WO2008111760A1 - Method and apparatus for providing video synthesizing call service using voice recognition - Google Patents

Method and apparatus for providing video synthesizing call service using voice recognition Download PDF

Info

Publication number
WO2008111760A1
WO2008111760A1 PCT/KR2008/001268 KR2008001268W WO2008111760A1 WO 2008111760 A1 WO2008111760 A1 WO 2008111760A1 KR 2008001268 W KR2008001268 W KR 2008001268W WO 2008111760 A1 WO2008111760 A1 WO 2008111760A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
words
images
terminal
call
Prior art date
Application number
PCT/KR2008/001268
Other languages
French (fr)
Inventor
Gil-Soo Lee
Bong-Kyu Heo
Original Assignee
Ti Square Technology Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from KR1020070062908A external-priority patent/KR100893546B1/en
Application filed by Ti Square Technology Ltd. filed Critical Ti Square Technology Ltd.
Publication of WO2008111760A1 publication Critical patent/WO2008111760A1/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/222Studio circuitry; Studio devices; Studio equipment
    • H04N5/262Studio circuits, e.g. for mixing, switching-over, change of character of image, other special effects ; Cameras specially adapted for the electronic generation of special effects
    • H04N5/272Means for inserting a foreground image in a background image, i.e. inlay, outlay

Definitions

  • the present invention relates, in general, to a method and apparatus for providing a video synthesis(video overlay) call service using voice recognition, and, more particularly, to a method and apparatus, which can analyze the content of a voice call of a video terminal user during a video call conversation, synthesize(overlay) images or video corresponding to words spoken by the user with a video call signal in real time, and provide a synthesized video call signal to a video terminal.
  • a user can make a call while personally viewing the image of the other party, rather than merely making a voice call, in a Wideband Code Division Multiple Access (WCDMA) environment or the like, and various types of supplementary service using such a video call have been gradually developed and have been provided to users.
  • WCDMA Wideband Code Division Multiple Access
  • a conventional video call service is disadvantageous in that, since video is generally transmitted around the face of a user, the content of a video call is simple and the call may be more unnatural than a voice call.
  • a video synthesis method of synthesizing background images has been proposed so as to provide a decorative effect to a screen.
  • an object of the present invention is to provide a method and apparatus for providing a video synthesis call service, which analyze the content of voices input by a user during a video call conversation in real time, synthesize images or video corresponding to words contained in a voice signal, with a video call signal, transmitted between a calling video terminal and a called video terminal, in real time, and provide a synthesized signal to the terminal of the other party, thus increasing video call users' interest, and transmitting various types of video screens in real time to overcome the simplicity of a typical video call.
  • Another object of the present invention is to provide a method and apparatus for providing a video synthesis call service, which automatically provide screens, synthesized with various types of stereographical video or images, to the video terminal of the other party during a call, without requiring the user of a video terminal to perform specific manipulation in each video call.
  • a further object of the present invention is to provide a method and apparatus for providing a video synthesis call service, which download video or images corresponding to words, contained in voice content, from an external video or image provision device during a video call conversation, store the downloaded video or images, synthesize the video or images with a video call signal with respect to the same words when subsequently spoken, and transmit a synthesized video call signal, thus enabling various types of video or images to be synthesized.
  • a method of providing a video synthesis call service using voice recognition in an apparatus for providing a video synthesis call service using voice recognition, the apparatus being connected to a calling video terminal and a called video terminal through a video call network, comprising the steps of extracting a voice signal from a video call signal transmitted between the calling video terminal and the called video terminal; extracting one or more words from the extracted voice signal; searching for images or video corresponding to the extracted words; synthesizing the found images or video with the video call signal transmitted between the calling video terminal and the called video terminal; and transmitting the synthesized video call signal to at least one of the calling video terminal and the called video terminal.
  • the step of extracting the words from the extracted voice signal may be performed to convert the extracted voice signal into a sentence and to extract one or more words from the sentence by separating the sentence into one or more words.
  • the images or video may be stored in advance to correspond to respective words.
  • the step of extracting the voice signal from the video call signal transmitted between the calling video terminal and the called video terminal may be performed at regular periods.
  • the step of extracting the voice signal from the video call signal transmitted between the calling video terminal and the called video terminal may be performed only when an amplitude of the voice signal is greater than a predetermined level.
  • an apparatus for providing a video synthesis call service comprising a video call network cooperation unit for receiving a video call signal, transmitted between the calling video terminal and the called video terminal, in cooperation with the video call network; a video call voice extraction unit for extracting a voice signal from the video call signal received from the video call network cooperation unit; a voice recognition processing unit for converting the voice signal extracted by the video call voice extraction unit into a sentence containing one or more words; a sentence word processing unit for extracting respective words from the sentence, converted by the voice recognition processing unit, by separating the sentence into words; an image/video search unit for comparing the words extracted by the sentence word processing unit with images or video stored in advance and selecting images or video corresponding to the images or video, stored in advance to correspond to the extracted words; and an image/video synthesis unit for synthesizing
  • the apparatus may further comprise an image/video database for storing images or video corresponding to one or more words.
  • a video terminal for providing a video synthesis call service through a video call network, comprising a video call signal reception unit for receiving a video call signal; a video call voice extraction unit for extracting a voice signal from the video call signal received from the video call signal reception unit; a voice recognition processing unit for converting the voice signal extracted by the video call voice extraction unit into a sentence containing one or more words; a sentence word processing unit for extracting respective words from the sentence output from the voice recognition processing unit by separating the sentence into words; an image/ video search unit for comparing the words extracted by the sentence word processing unit with images or video stored in advance and selecting images or video corresponding to the images or video, stored in advance to correspond to the extracted words; and an image/video synthesis unit for synthesizing the images or video selected by the image/video search unit with the video call signal, and transmitting the video call signal synthesized with the images or video to a video/voice communication unit, wherein the
  • the video terminal may further comprise an image/video database for storing images or video corresponding to one or more words.
  • the video terminal may further comprise a download processing unit for downloading images or video from an external image/video provision device, and transmitting the images or video to the image/video database.
  • the present invention can provide a method and apparatus for providing a video synthesis call service, which analyze the content of voices input by a user during a video call conversation in real time, synthesize images or video corresponding to words contained in a voice signal, with a video call signal, transmitted between a calling video terminal and a called video terminal, in real time, and provide a synthesized signal to the terminal of the other party, thus increasing video call users' interest, and transmitting various types of video screens in real time to overcome the simplicity of a typical video call.
  • the present invention can provide a method and apparatus for providing a video synthesis call service, which automatically provide screens, synthesized with various types of stereographical video or images, to the video terminal of the other party during a call, without requiring the user of a video terminal to perform specific manipulation in each video call.
  • the present invention can provide a method and apparatus for providing a video synthesis call service, which download video or images corresponding to words, contained in voice content, from an external video or image provision device during a video call conversation, store the downloaded video or images, synthesize the video or images with a video call signal with respect to the same words when subsequently spoken, and transmit a synthesized video call signal, thus enabling various types of video or images to be synthesized.
  • FIG. 1 is a diagram of the entire construction showing the connection of an apparatus 30 for providing a video synthesis call service using voice recognition, a calling video terminal 10 and a called video terminal 20 according to the present invention
  • FIG. 2 is a block diagram showing the detailed construction of the video synthesis call service provision apparatus 30 of FIG. 1 ;
  • FIG. 3 is a flowchart showing the procedure of the video synthesis call service of
  • FIG. 4 is a diagram showing a method of extracting words contained in a voice signal in FIG. 3;
  • FIG. 5 is a diagram showing a method of searching for images or video corresponding to the extracted words in FIG. 3;
  • FIG. 6 is a diagram showing the detailed construction of a video terminal 60 capable of providing a video synthesis call service using voice recognition according to another embodiment of the present invention. Best Mode for Carrying Out the Invention
  • FIG. 1 is a diagram of the entire construction showing the connection of an apparatus for providing a video synthesis(video overlay) call service using voice recognition, a calling video terminal and a called video terminal according to the present invention.
  • an apparatus 30 for providing a video synthesis call service is connected to a calling video terminal 10 and a called video terminal 20 through a video call network.
  • the calling video terminal 10 and the called video terminal 20 perform a video call over the video call network, and are connected to the video synthesis call service provision apparatus 30 through the video call network, thus being provided with a video synthesis call service.
  • the video synthesis call service provision apparatus 30 extracts and recognizes a voice signal from the video call signal of the user of the calling video terminal 10 or the called video terminal 20 in real time, detects words contained in the voice signal, searches for images or video corresponding to the detected words, synthesizes found images or video with the video call signal, and transmits the synthesized video call signal to either or both of the calling video terminal 10 and the called video terminal 20.
  • FIG. 2 is a block diagram showing the detailed construction of the video synthesis call service provision apparatus 30 of FIG. 1.
  • the video synthesis call service provision apparatus 30 includes a video call network cooperation unit 31, a video call voice extraction unit 32, a voice recognition processing unit 33, a sentence word processing unit 34, an image/video search unit 35, an image/video synthesis unit 36 and an image/video database (DB) 37.
  • a video call network cooperation unit 31 a video call voice extraction unit 32, a voice recognition processing unit 33, a sentence word processing unit 34, an image/video search unit 35, an image/video synthesis unit 36 and an image/video database (DB) 37.
  • DB image/video database
  • the video call network cooperation unit 31 functions to receive a video call signal transmitted between the calling video terminal 10 and the called video terminal 20 in cooperation with the video call network and transmits the received video call signal to the video call voice extraction unit 32, and also functions to transmit a video call signal synthesized by the image/video synthesis unit 36 to the calling or called video terminal over the video call network.
  • the video call voice extraction unit 32 functions to extract a voice signal from the video call signal received from the video call network cooperation unit 31 by separating the video call signal into a video signal and the voice signal, and to transmit the extracted voice signal to the voice recognition processing unit 33.
  • the voice recognition processing unit 33 functions to convert the voice signal, received from the video call voice extraction unit 32, into a sentence and to transmit the sentence to the sentence word processing unit 34.
  • the sentence word processing unit 34 functions to extract one or more words from the sentence, received from the voice recognition processing unit 33, by separating the sentence into one or more words, and to transmit the extracted words to the image/ video search unit 35.
  • the image/video search unit 35 functions to search the image/video DB 37 for images or video corresponding to the words received from the sentence word processing unit 34, and transmit found images or video corresponding to the extracted words to the image/video synthesis unit 36.
  • the image/video synthesis unit 36 functions to synthesize the images or video, received from the image/video search unit 35, with the video call signal transmitted between the users of the calling video terminal and the called video terminal, and to transmit a video call signal synthesized with the images or video to the video call network cooperation unit 31. That is, the video call signal transmitted between the users of the calling video terminal 10 and the called video terminal 20 is synthesized with the found images or video, and thus the synthesized video call signal is transmitted to either or both of the calling video terminal 10 and the called video terminal 20.
  • the image/video DB 37 stores in advance words frequently used by typical users
  • FIG. 3 is a flowchart showing the procedure of the video synthesis call service described with reference to FIGS. 1 and 2.
  • the video call voice extraction unit 32 of the video synthesis call service provision apparatus 30 extracts only a voice signal from a video call signal at step S301.
  • the extracted voice signal passes through the voice recognition processing unit 33 and the sentence word processing unit 34, words contained in the voice signal are extracted from the voice signal at step S303.
  • the image/video search unit 35 searches the image/video DB 37 for images or video corresponding to the extracted words at step S305, and transmits the images or video, corresponding to the extracted words and found in the search, to the image/ video synthesis unit 36.
  • the image/video synthesis unit 36 synthesizes the received images or video with the video call signal transmitted between the calling video terminal 10 and the called video terminal 20 at step S307, and transmits the video call signal synthesized with the images or video to the video call network cooperation unit 31.
  • the video call network cooperation unit 31 transmits the video call signal, synthesized with the images or video, to either or both of the calling video terminal 10 and the called video terminal 20 at step S309.
  • FIG. 4 is a diagram showing a method of extracting words contained in the voice signal in FIG. 3.
  • FIG. 4 illustrates the method (step S303) of extracting the words contained in the voice signal in FIG. 3.
  • the voice signal extracted by the video call voice extraction unit 32
  • the voice recognition processing unit 33 converts the extracted voice signal into a sentence at step S401, and transmits the sentence to the sentence word processing unit 34.
  • the term 'sentence' means text containing one or more words.
  • voice recognition technology used in the procedure for converting a voice signal into a sentence in the present invention, may be implemented using conventional well-known technology. The present invention is not intended to propose voice recognition technology itself, and thus a detailed description thereof is omitted.
  • the sentence word processing unit 34 extracts one or more words from the sentence received from the voice recognition processing unit 33 by separating the sentence into one or more words at step S403, and transmits the extracted words to the image/video search unit 35. In this case, the steps after the above-described step S305 are performed.
  • FIG. 5 is a diagram showing a method of searching for images or video corresponding to the extracted words in FIG. 3.
  • FIG. 5 illustrates a method (step S305) of searching for images or video corresponding to the extracted words in FIG. 3.
  • the image/video search unit 35 searches the image/video DB 37 for the images or video corresponding to the extracted words at step S501.
  • the images or video, corresponding to the extracted words and found in the search, are transmitted to the image/video synthesis unit 36 at step S503.
  • steps S501 and S503 are shown in the lower portion of FIG. 5. That is, when the extracted words, such as "today”, "home”, and “beer", are transmitted to the image/video search unit 35, the image/video search unit 35 searches the image/video DB 37 for images or video corresponding to respective words, such as "today", "home”, and "beer.” It is determined that images or video corresponding to the words "home” and "beer” exist in the image/video DB 37 shown in the lower portion of FIG. 5, but no image or video corresponding to the word "today” exists in the DB. Therefore, the image/video search unit 35 transmits images or video corresponding to the words "home” and "beer” to the image/video synthesis unit 36.
  • the image/video synthesis unit 36 synthesizes the received images or video with the video call signal transmitted between the calling video terminal 10 and the called video terminal 20, and transmits the synthesized video call signal to either or both of the calling video terminal 10 and the called video terminal 20.
  • FIG. 6 is a diagram showing the detailed construction of a video terminal capable of providing a video synthesis call service using voice recognition according to another embodiment of the present invention. Unlike the above embodiment, described with reference to FIGS. 1 to 3, FIG. 6 illustrates a block diagram showing the construction of a video terminal capable of performing a video synthesis call service without requiring a separate video synthesis call service provision apparatus.
  • a video terminal 60 for performing a video synthesis call service using voice recognition includes a video call reception unit 61, a video call voice extraction unit 62, a voice recognition processing unit 63, a sentence word processing unit 64, an image/video search unit 65, an image/video synthesis unit 66, a video/voice communication unit 67, an image/video database (DB) 69, and a download processing unit 68.
  • the video call reception unit 61 functions to receive a video call signal from the user of the video terminal and transmit the received video call signal to the video call voice extraction unit 62.
  • the video call voice extraction unit 62 functions to extract only a voice signal from the video call signal, received from the video call reception unit 61, by separating the video call signal into a video signal and the voice signal, and to transmit the extracted voice signal to the voice recognition processing unit 63.
  • the voice recognition processing unit 63 functions to convert the voice signal, received from the video call voice extraction unit 62, into a sentence, and to transmit the sentence to the sentence word processing unit 64.
  • the sentence word processing unit 64 functions to extract one or more words from the sentence, received from the voice recognition processing unit 63, by separating the sentence into one or more words, and to transmit the extracted words to the image/video search unit 65.
  • the image/video search unit 65 functions to search the image/video DB 69 for images or video corresponding to the words received from the sentence word processing unit 64, and to transmit found images or video corresponding to the extracted words to the image/ video synthesis unit 66.
  • the image/video synthesis unit 66 functions to synthesize the images or video, received from the image/video search unit 65, with the video call signal transmitted between the users of the calling video terminal and the called video terminal, and to transmit the video call signal synthesized with the images or video to the video/voice communication unit 67.
  • the image/video DB 69 functions to store in advance words frequently used by typical users, and images or video corresponding to the words, thus enabling the image/video search unit 65 to search for images or video corresponding to the words received from the sentence word processing unit 64.
  • the download processing unit 68 functions to download images or video from an external image/video provision device, and to transmit the images or video to the image/video DB 69.
  • the external image/video provision device may be a specific server, a Personal Computer (PC), etc.
  • the method can be designated to perform a series of procedures, such as procedures for extracting a voice signal, converting the voice signal into a sentence, and extracting words from the sentence at each period during the entire video call conversation. Further, a method of extracting only voice signals, the amplitudes of which are greater than a predetermined level, among voice signals contained in video call signals transmitted between the calling video terminal 10 and the called video terminal 20, may be used.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Telephonic Communication Services (AREA)
  • Telephone Function (AREA)

Abstract

The present invention relates to a method and apparatus for providing a video synthesis call service using voice recognition. The present invention provides a method comprising the steps of extracting a voice signal from a video call signal transmitted between the calling video terminal and the called video terminal; extracting one or more words contained in the extracted voice signal from the voice signal; searching for images or video corresponding to the extracted words; synthesizing found images or video with the video call signal transmitted between the calling video terminal and the called video terminal; and transmitting the synthesized video call signal to either or both of the calling video terminal and the called video terminal and the apparatus using the method.

Description

Description
METHOD AND APPARATUS FOR PROVIDING VIDEO SYNTHESIZING CALL SERVICE USING VOICE RECOGNITION
Technical Field
[1] The present invention relates, in general, to a method and apparatus for providing a video synthesis(video overlay) call service using voice recognition, and, more particularly, to a method and apparatus, which can analyze the content of a voice call of a video terminal user during a video call conversation, synthesize(overlay) images or video corresponding to words spoken by the user with a video call signal in real time, and provide a synthesized video call signal to a video terminal. Background Art
[2] Recently, with the rapid development of mobile communication technology, a user can make a call while personally viewing the image of the other party, rather than merely making a voice call, in a Wideband Code Division Multiple Access (WCDMA) environment or the like, and various types of supplementary service using such a video call have been gradually developed and have been provided to users. However, a conventional video call service is disadvantageous in that, since video is generally transmitted around the face of a user, the content of a video call is simple and the call may be more unnatural than a voice call. In consideration of this fact, a video synthesis method of synthesizing background images has been proposed so as to provide a decorative effect to a screen. However, such a method is problematic in that, since still images are generally used, the limitation of the simplicity of a screen cannot be overcome, and a user must perform setting of a decorative effect every time in order to change such a simple screen. For example, there is inconvenience in that a user must store images or video for background images in his or her terminal, and determine whether to use the images or video during a video call conversation and which background image will be used. Disclosure of Invention
Technical Problem
[3] Accordingly, the present invention has been made keeping in mind the above problems occurring in the prior art, and an object of the present invention is to provide a method and apparatus for providing a video synthesis call service, which analyze the content of voices input by a user during a video call conversation in real time, synthesize images or video corresponding to words contained in a voice signal, with a video call signal, transmitted between a calling video terminal and a called video terminal, in real time, and provide a synthesized signal to the terminal of the other party, thus increasing video call users' interest, and transmitting various types of video screens in real time to overcome the simplicity of a typical video call.
[4] Another object of the present invention is to provide a method and apparatus for providing a video synthesis call service, which automatically provide screens, synthesized with various types of stereographical video or images, to the video terminal of the other party during a call, without requiring the user of a video terminal to perform specific manipulation in each video call.
[5] A further object of the present invention is to provide a method and apparatus for providing a video synthesis call service, which download video or images corresponding to words, contained in voice content, from an external video or image provision device during a video call conversation, store the downloaded video or images, synthesize the video or images with a video call signal with respect to the same words when subsequently spoken, and transmit a synthesized video call signal, thus enabling various types of video or images to be synthesized. Technical Solution
[6] In accordance with an aspect of the present invention to accomplish the above objects, there is provided a method of providing a video synthesis call service using voice recognition in an apparatus for providing a video synthesis call service using voice recognition, the apparatus being connected to a calling video terminal and a called video terminal through a video call network, comprising the steps of extracting a voice signal from a video call signal transmitted between the calling video terminal and the called video terminal; extracting one or more words from the extracted voice signal; searching for images or video corresponding to the extracted words; synthesizing the found images or video with the video call signal transmitted between the calling video terminal and the called video terminal; and transmitting the synthesized video call signal to at least one of the calling video terminal and the called video terminal.
[7] Preferably, the step of extracting the words from the extracted voice signal may be performed to convert the extracted voice signal into a sentence and to extract one or more words from the sentence by separating the sentence into one or more words.
[8] Further, at the step of searching for images or video corresponding to the extracted words, the images or video may be stored in advance to correspond to respective words.
[9] Further, the step of extracting the voice signal from the video call signal transmitted between the calling video terminal and the called video terminal may be performed at regular periods.
[10] Further, the step of extracting the voice signal from the video call signal transmitted between the calling video terminal and the called video terminal may be performed only when an amplitude of the voice signal is greater than a predetermined level.
[11] In accordance with another aspect of the present invention to accomplish the above objects, there is provided an apparatus for providing a video synthesis call service, the apparatus being connected to a calling video terminal and a called video terminal through a video call network and configured to provide the video synthesis call service, comprising a video call network cooperation unit for receiving a video call signal, transmitted between the calling video terminal and the called video terminal, in cooperation with the video call network; a video call voice extraction unit for extracting a voice signal from the video call signal received from the video call network cooperation unit; a voice recognition processing unit for converting the voice signal extracted by the video call voice extraction unit into a sentence containing one or more words; a sentence word processing unit for extracting respective words from the sentence, converted by the voice recognition processing unit, by separating the sentence into words; an image/video search unit for comparing the words extracted by the sentence word processing unit with images or video stored in advance and selecting images or video corresponding to the images or video, stored in advance to correspond to the extracted words; and an image/video synthesis unit for synthesizing the images or video selected by the image/video selection unit with the video call signal transmitted between the calling video terminal and the called video terminal, and transmitting the video call signal synthesized with the images or video to the video call network cooperation unit, wherein the video call network cooperation unit transmits the video call signal synthesized with the images or video to at least one of the calling video terminal and the called video terminal.
[12] Preferably, the apparatus may further comprise an image/video database for storing images or video corresponding to one or more words.
[13] In accordance with a further aspect of the present invention to accomplish the above objects, there is provided a video terminal for providing a video synthesis call service through a video call network, comprising a video call signal reception unit for receiving a video call signal; a video call voice extraction unit for extracting a voice signal from the video call signal received from the video call signal reception unit; a voice recognition processing unit for converting the voice signal extracted by the video call voice extraction unit into a sentence containing one or more words; a sentence word processing unit for extracting respective words from the sentence output from the voice recognition processing unit by separating the sentence into words; an image/ video search unit for comparing the words extracted by the sentence word processing unit with images or video stored in advance and selecting images or video corresponding to the images or video, stored in advance to correspond to the extracted words; and an image/video synthesis unit for synthesizing the images or video selected by the image/video search unit with the video call signal, and transmitting the video call signal synthesized with the images or video to a video/voice communication unit, wherein the video/voice communication unit receives the video call signal synthesized with the images or video from the image/video synthesis unit, and transmits the synthesized video call signal to a video terminal of another party.
[14] Preferably, the video terminal may further comprise an image/video database for storing images or video corresponding to one or more words.
[15] Further, the video terminal may further comprise a download processing unit for downloading images or video from an external image/video provision device, and transmitting the images or video to the image/video database.
Advantageous Effects
[16] Accordingly, the present invention can provide a method and apparatus for providing a video synthesis call service, which analyze the content of voices input by a user during a video call conversation in real time, synthesize images or video corresponding to words contained in a voice signal, with a video call signal, transmitted between a calling video terminal and a called video terminal, in real time, and provide a synthesized signal to the terminal of the other party, thus increasing video call users' interest, and transmitting various types of video screens in real time to overcome the simplicity of a typical video call.
[17] Further, the present invention can provide a method and apparatus for providing a video synthesis call service, which automatically provide screens, synthesized with various types of stereographical video or images, to the video terminal of the other party during a call, without requiring the user of a video terminal to perform specific manipulation in each video call.
[18] Furthermore, the present invention can provide a method and apparatus for providing a video synthesis call service, which download video or images corresponding to words, contained in voice content, from an external video or image provision device during a video call conversation, store the downloaded video or images, synthesize the video or images with a video call signal with respect to the same words when subsequently spoken, and transmit a synthesized video call signal, thus enabling various types of video or images to be synthesized. Brief Description of the Drawings
[19] FIG. 1 is a diagram of the entire construction showing the connection of an apparatus 30 for providing a video synthesis call service using voice recognition, a calling video terminal 10 and a called video terminal 20 according to the present invention; [20] FIG. 2 is a block diagram showing the detailed construction of the video synthesis call service provision apparatus 30 of FIG. 1 ;
[21] FIG. 3 is a flowchart showing the procedure of the video synthesis call service of
FIGS. 1 and 2;
[22] FIG. 4 is a diagram showing a method of extracting words contained in a voice signal in FIG. 3;
[23] FIG. 5 is a diagram showing a method of searching for images or video corresponding to the extracted words in FIG. 3; and
[24] FIG. 6 is a diagram showing the detailed construction of a video terminal 60 capable of providing a video synthesis call service using voice recognition according to another embodiment of the present invention. Best Mode for Carrying Out the Invention
[25] Hereinafter, the construction of the present invention will be described in detail with reference to the attached drawings.
[26] FIG. 1 is a diagram of the entire construction showing the connection of an apparatus for providing a video synthesis(video overlay) call service using voice recognition, a calling video terminal and a called video terminal according to the present invention.
[27] Referring to FIG. 1, an apparatus 30 for providing a video synthesis call service according to the present invention is connected to a calling video terminal 10 and a called video terminal 20 through a video call network. The calling video terminal 10 and the called video terminal 20 perform a video call over the video call network, and are connected to the video synthesis call service provision apparatus 30 through the video call network, thus being provided with a video synthesis call service. The video synthesis call service provision apparatus 30 extracts and recognizes a voice signal from the video call signal of the user of the calling video terminal 10 or the called video terminal 20 in real time, detects words contained in the voice signal, searches for images or video corresponding to the detected words, synthesizes found images or video with the video call signal, and transmits the synthesized video call signal to either or both of the calling video terminal 10 and the called video terminal 20.
[28] FIG. 2 is a block diagram showing the detailed construction of the video synthesis call service provision apparatus 30 of FIG. 1.
[29] Referring to FIG. 2, the video synthesis call service provision apparatus 30 includes a video call network cooperation unit 31, a video call voice extraction unit 32, a voice recognition processing unit 33, a sentence word processing unit 34, an image/video search unit 35, an image/video synthesis unit 36 and an image/video database (DB) 37.
[30] The video call network cooperation unit 31 functions to receive a video call signal transmitted between the calling video terminal 10 and the called video terminal 20 in cooperation with the video call network and transmits the received video call signal to the video call voice extraction unit 32, and also functions to transmit a video call signal synthesized by the image/video synthesis unit 36 to the calling or called video terminal over the video call network.
[31] The video call voice extraction unit 32 functions to extract a voice signal from the video call signal received from the video call network cooperation unit 31 by separating the video call signal into a video signal and the voice signal, and to transmit the extracted voice signal to the voice recognition processing unit 33.
[32] The voice recognition processing unit 33 functions to convert the voice signal, received from the video call voice extraction unit 32, into a sentence and to transmit the sentence to the sentence word processing unit 34.
[33] The sentence word processing unit 34 functions to extract one or more words from the sentence, received from the voice recognition processing unit 33, by separating the sentence into one or more words, and to transmit the extracted words to the image/ video search unit 35.
[34] The image/video search unit 35 functions to search the image/video DB 37 for images or video corresponding to the words received from the sentence word processing unit 34, and transmit found images or video corresponding to the extracted words to the image/video synthesis unit 36.
[35] The image/video synthesis unit 36 functions to synthesize the images or video, received from the image/video search unit 35, with the video call signal transmitted between the users of the calling video terminal and the called video terminal, and to transmit a video call signal synthesized with the images or video to the video call network cooperation unit 31. That is, the video call signal transmitted between the users of the calling video terminal 10 and the called video terminal 20 is synthesized with the found images or video, and thus the synthesized video call signal is transmitted to either or both of the calling video terminal 10 and the called video terminal 20.
[36] The image/video DB 37 stores in advance words frequently used by typical users
(for example, "love", "drink", "meal", "home", etc.) and images or video corresponding to the words, thus enabling the image/video search unit 35 to search for the images or video corresponding to the words received from the sentence word processing unit 34.
[37] FIG. 3 is a flowchart showing the procedure of the video synthesis call service described with reference to FIGS. 1 and 2.
[38] Referring to FIG. 3, when the calling video terminal 10 make a video call to the called video terminal 20 over the video call network, the video call voice extraction unit 32 of the video synthesis call service provision apparatus 30 extracts only a voice signal from a video call signal at step S301. When the extracted voice signal passes through the voice recognition processing unit 33 and the sentence word processing unit 34, words contained in the voice signal are extracted from the voice signal at step S303. The image/video search unit 35 searches the image/video DB 37 for images or video corresponding to the extracted words at step S305, and transmits the images or video, corresponding to the extracted words and found in the search, to the image/ video synthesis unit 36. The image/video synthesis unit 36 synthesizes the received images or video with the video call signal transmitted between the calling video terminal 10 and the called video terminal 20 at step S307, and transmits the video call signal synthesized with the images or video to the video call network cooperation unit 31. The video call network cooperation unit 31 transmits the video call signal, synthesized with the images or video, to either or both of the calling video terminal 10 and the called video terminal 20 at step S309.
[39] FIG. 4 is a diagram showing a method of extracting words contained in the voice signal in FIG. 3.
[40] FIG. 4 illustrates the method (step S303) of extracting the words contained in the voice signal in FIG. 3. When the voice signal, extracted by the video call voice extraction unit 32, is transmitted to the voice recognition processing unit 33, the voice recognition processing unit 33 converts the extracted voice signal into a sentence at step S401, and transmits the sentence to the sentence word processing unit 34. Here, the term 'sentence' means text containing one or more words. Meanwhile, voice recognition technology, used in the procedure for converting a voice signal into a sentence in the present invention, may be implemented using conventional well-known technology. The present invention is not intended to propose voice recognition technology itself, and thus a detailed description thereof is omitted. The sentence word processing unit 34 extracts one or more words from the sentence received from the voice recognition processing unit 33 by separating the sentence into one or more words at step S403, and transmits the extracted words to the image/video search unit 35. In this case, the steps after the above-described step S305 are performed.
[41] Meanwhile, an example used to describe steps S401 and S403 is shown in the lower portion of FIG. 4. That is, when the voice signal is extracted and passes through the voice recognition processing unit 33, the voice signal is converted into the sentence "Let's drink beer at home today." In this case, the voice recognition processing unit 33 converts voices into the sentence using voice recognition technology. The sentence word processing unit 34 receives the sentence and separates the sentence into one or more words, thus extracting the words from the sentence. That is, in the above example, the sentence "Let's drink beer at home today" is separated into words, such as "today", "home", and "beer", and thus the words are extracted. [42] FIG. 5 is a diagram showing a method of searching for images or video corresponding to the extracted words in FIG. 3.
[43] FIG. 5 illustrates a method (step S305) of searching for images or video corresponding to the extracted words in FIG. 3. When the words contained in the voice signal are extracted from the voice signal at step S303, the image/video search unit 35 searches the image/video DB 37 for the images or video corresponding to the extracted words at step S501. The images or video, corresponding to the extracted words and found in the search, are transmitted to the image/video synthesis unit 36 at step S503.
[44] Meanwhile, an example used to describe steps S501 and S503 is shown in the lower portion of FIG. 5. That is, when the extracted words, such as "today", "home", and "beer", are transmitted to the image/video search unit 35, the image/video search unit 35 searches the image/video DB 37 for images or video corresponding to respective words, such as "today", "home", and "beer." It is determined that images or video corresponding to the words "home" and "beer" exist in the image/video DB 37 shown in the lower portion of FIG. 5, but no image or video corresponding to the word "today" exists in the DB. Therefore, the image/video search unit 35 transmits images or video corresponding to the words "home" and "beer" to the image/video synthesis unit 36.
[45] Meanwhile, the image/video synthesis unit 36 synthesizes the received images or video with the video call signal transmitted between the calling video terminal 10 and the called video terminal 20, and transmits the synthesized video call signal to either or both of the calling video terminal 10 and the called video terminal 20.
[46] FIG. 6 is a diagram showing the detailed construction of a video terminal capable of providing a video synthesis call service using voice recognition according to another embodiment of the present invention. Unlike the above embodiment, described with reference to FIGS. 1 to 3, FIG. 6 illustrates a block diagram showing the construction of a video terminal capable of performing a video synthesis call service without requiring a separate video synthesis call service provision apparatus.
[47] Referring to FIG. 6, a video terminal 60 for performing a video synthesis call service using voice recognition includes a video call reception unit 61, a video call voice extraction unit 62, a voice recognition processing unit 63, a sentence word processing unit 64, an image/video search unit 65, an image/video synthesis unit 66, a video/voice communication unit 67, an image/video database (DB) 69, and a download processing unit 68.
[48] The video call reception unit 61 functions to receive a video call signal from the user of the video terminal and transmit the received video call signal to the video call voice extraction unit 62. The video call voice extraction unit 62 functions to extract only a voice signal from the video call signal, received from the video call reception unit 61, by separating the video call signal into a video signal and the voice signal, and to transmit the extracted voice signal to the voice recognition processing unit 63. The voice recognition processing unit 63 functions to convert the voice signal, received from the video call voice extraction unit 62, into a sentence, and to transmit the sentence to the sentence word processing unit 64. The sentence word processing unit 64 functions to extract one or more words from the sentence, received from the voice recognition processing unit 63, by separating the sentence into one or more words, and to transmit the extracted words to the image/video search unit 65. The image/video search unit 65 functions to search the image/video DB 69 for images or video corresponding to the words received from the sentence word processing unit 64, and to transmit found images or video corresponding to the extracted words to the image/ video synthesis unit 66. The image/video synthesis unit 66 functions to synthesize the images or video, received from the image/video search unit 65, with the video call signal transmitted between the users of the calling video terminal and the called video terminal, and to transmit the video call signal synthesized with the images or video to the video/voice communication unit 67. The image/video DB 69 functions to store in advance words frequently used by typical users, and images or video corresponding to the words, thus enabling the image/video search unit 65 to search for images or video corresponding to the words received from the sentence word processing unit 64. The download processing unit 68 functions to download images or video from an external image/video provision device, and to transmit the images or video to the image/video DB 69. Here, the external image/video provision device may be a specific server, a Personal Computer (PC), etc. [49] Meanwhile, when all call content is analyzed in real time from video call signals transmitted between the calling video terminal 10 and the called video terminal 20 and thus the search and comparison are performed on the call content with reference to the image/video DB 37 in the above embodiments, there may be problems in that the communication quality of a video call may be deteriorated or the results of video synthesis may be transmitted late due to the delay of processing time. In order to solve these problems, a method of extracting voice signals from video call signals at regular periods, rather than extracting voice signals from complete video call signals, may be used. For example, when the period is set to 5 or 10 seconds, the method can be designated to perform a series of procedures, such as procedures for extracting a voice signal, converting the voice signal into a sentence, and extracting words from the sentence at each period during the entire video call conversation. Further, a method of extracting only voice signals, the amplitudes of which are greater than a predetermined level, among voice signals contained in video call signals transmitted between the calling video terminal 10 and the called video terminal 20, may be used. This is used to prevent the above problems by synthesizing images or video corresponding to relevant words with video call signals only when the amplitudes of voice signals are greater than a predetermined level, because a talking person generally has a tendency to pronounce important words in a louder voice than typical words so as to emphasize the important words in content desired to be transmitted. [50] Although the construction of the present invention has been disclosed with reference to preferred embodiments of the present invention, those skilled in the art will appreciate that the present invention is not limited to the embodiments, and that various modifications, additions and substitutions are possible, without departing from the scope and spirit of the invention as disclosed in the accompanying claims and the attached drawings.

Claims

Claims
[1] A method of providing a video synthesis (video overlay) call service using voice recognition in an apparatus for providing a video synthesis call service using voice recognition, the apparatus being connected to a calling video terminal and a called video terminal through a video call network, comprising the steps of: extracting a voice signal from a video call signal transmitted between the calling video terminal and the called video terminal; extracting one or more words from the extracted voice signal; searching for images or video corresponding to the extracted words; synthesizing(overlaying) the found images or video with the video call signal transmitted between the calling video terminal and the called video terminal; and transmitting the synthesized(overlayed) video call signal to at least one of the calling video terminal and the called video terminal.
[2] The method according to claim 1, wherein the step of extracting the words from the extracted voice signal is performed to convert the extracted voice signal into a sentence and to extract one or more words from the sentence by separating the sentence into one or more words.
[3] The method according to claim 1, wherein, at the step of searching for images or video corresponding to the extracted words, the images or video are stored in advance to correspond to respective words.
[4] The method according to claim 1, wherein the step of extracting the voice signal from the video call signal transmitted between the calling video terminal and the called video terminal is performed at regular periods.
[5] The method according to claim 1, wherein the step of extracting the voice signal from the video call signal transmitted between the calling video terminal and the called video terminal is performed only when an amplitude of the voice signal is greater than a predetermined level.
[6] An apparatus for providing a video synthesis call service, the apparatus being connected to a calling video terminal and a called video terminal through a video call network and configured to provide the video synthesis call service, comprising: a video call network cooperation unit for receiving a video call signal, transmitted between the calling video terminal and the called video terminal, in cooperation with the video call network; a video call voice extraction unit for extracting a voice signal from the video call signal received from the video call network cooperation unit; a voice recognition processing unit for converting the voice signal extracted by the video call voice extraction unit into a sentence containing one or more words; a sentence word processing unit for extracting respective words from the sentence, converted by the voice recognition processing unit, by separating the sentence into words; an image/video search unit for comparing the words extracted by the sentence word processing unit with images or video stored in advance and selecting images or video corresponding to the extracted words; and an image/video synthesis unit for synthesizing the images or video selected by the image/video selection unit with the video call signal transmitted between the calling video terminal and the called video terminal, and transmitting the video call signal synthesized with the images or video to the video call network cooperation unit, wherein the video call network cooperation unit transmits the video call signal synthesized with the images or video to at least one of the calling video terminal and the called video terminal.
[7] The apparatus according to claim 6, further comprising an image/video database for storing images or video corresponding to one or more words.
[8] A video terminal for providing a video synthesis call service through a video call network, comprising: a video call signal reception unit for receiving a video call signal; a video call voice extraction unit for extracting a voice signal from the video call signal received from the video call signal reception unit; a voice recognition processing unit for converting the voice signal extracted by the video call voice extraction unit into a sentence containing one or more words; a sentence word processing unit for extracting respective words from the sentence output from the voice recognition processing unit by separating the sentence into words; an image/video search unit for comparing the words extracted by the sentence word processing unit with images or video stored in advance and selecting images or video corresponding to the the extracted words; and an image/video synthesis unit for synthesizing the images or video selected by the image/video search unit with the video call signal, and transmitting the video call signal synthesized with the images or video to a video/voice communication unit, wherein the video/voice communication unit receives the video call signal synthesized with the images or video from the image/video synthesis unit, and transmits the synthesized video call signal to a video terminal of another party.
[9] The video terminal according to claim 8, further comprising an image/video database for storing images or video corresponding to one or more words. [10] The video terminal according to claim 9, further comprising a download processing unit for downloading images or video from an external image/video provision device, and transmitting the images or video to the image/video database.
PCT/KR2008/001268 2007-03-12 2008-03-06 Method and apparatus for providing video synthesizing call service using voice recognition WO2008111760A1 (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
KR10-2007-0024116 2007-03-12
KR20070024116 2007-03-12
KR10-2007-0062908 2007-06-26
KR1020070062908A KR100893546B1 (en) 2007-03-12 2007-06-26 Method and apparatus for providing video synthesizing call service using voice recognition

Publications (1)

Publication Number Publication Date
WO2008111760A1 true WO2008111760A1 (en) 2008-09-18

Family

ID=39759672

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2008/001268 WO2008111760A1 (en) 2007-03-12 2008-03-06 Method and apparatus for providing video synthesizing call service using voice recognition

Country Status (1)

Country Link
WO (1) WO2008111760A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104023150A (en) * 2013-02-28 2014-09-03 联想(北京)有限公司 Information processing method and electronic device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002215184A (en) * 2001-01-19 2002-07-31 Casio Comput Co Ltd Speech recognition device and program for the same
KR20040028038A (en) * 2002-09-28 2004-04-03 주식회사 케이티 System and method of automatically converting text to image using by language processing technology
US20050102139A1 (en) * 2003-11-11 2005-05-12 Canon Kabushiki Kaisha Information processing method and apparatus

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002215184A (en) * 2001-01-19 2002-07-31 Casio Comput Co Ltd Speech recognition device and program for the same
KR20040028038A (en) * 2002-09-28 2004-04-03 주식회사 케이티 System and method of automatically converting text to image using by language processing technology
US20050102139A1 (en) * 2003-11-11 2005-05-12 Canon Kabushiki Kaisha Information processing method and apparatus

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104023150A (en) * 2013-02-28 2014-09-03 联想(北京)有限公司 Information processing method and electronic device

Similar Documents

Publication Publication Date Title
CN108615527B (en) Data processing method, device and storage medium based on simultaneous interpretation
JP3953886B2 (en) Subtitle extraction device
KR101338818B1 (en) Mobile terminal and information display method using the same
US8633959B2 (en) Video telephony apparatus and signal transmitting/receiving method for mobile terminal
US7840406B2 (en) Method for providing an electronic dictionary in wireless terminal and wireless terminal implementing the same
EP2816559A2 (en) Translation system comprising display apparatus and server and control method thereof
RU2500081C2 (en) Information processing device, information processing method and recording medium on which computer programme is stored
US20180286388A1 (en) Conference support system, conference support method, program for conference support device, and program for terminal
EP1718045A1 (en) Method for setting main language in mobile terminal and mobile terminal implementing the same
KR101517975B1 (en) Earphone apparatus with synchronous interpretating and translating function
KR20090097292A (en) Method and system for providing speech recognition by using user images
US20040227811A1 (en) Communication apparatus and method
CN106027801A (en) Method and device for processing communication message and mobile device
JP2020193994A (en) Telephone call system and telephone call program
KR100893546B1 (en) Method and apparatus for providing video synthesizing call service using voice recognition
KR102300589B1 (en) Sign language interpretation system
WO2008111760A1 (en) Method and apparatus for providing video synthesizing call service using voice recognition
JP6305538B2 (en) Electronic apparatus, method and program
JP2010003025A (en) Terminal device and program
KR101597248B1 (en) SYSTEM AND METHOD FOR PROVIDING ADVERTISEMENT USING VOISE RECOGNITION WHEN TELEPHONE CONVERSATION ON THE VoIP BASE
KR100969218B1 (en) Method and apparatus for providing a image overlay call service using recognition chatting text
EP3477636A1 (en) Analysis mechanisms of telephone conversations for contextual information element extraction
US7702321B2 (en) Apparatus for mapping abbreviated numbers in mobile communication terminal and method thereof
KR20070029440A (en) Method for providing information using wireless terminal
US20210304767A1 (en) Meeting support system, meeting support method, and non-transitory computer-readable medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 08723305

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 08723305

Country of ref document: EP

Kind code of ref document: A1