KR100868638B1

KR100868638B1 - System and method for balloon providing during video communication

Info

Publication number: KR100868638B1
Application number: KR1020070078898A
Authority: KR
Inventors: 김진식
Original assignee: 에스케이 텔레콤주식회사
Priority date: 2007-08-07
Filing date: 2007-08-07
Publication date: 2008-11-12

Abstract

A system and a method for providing video call speech balloon during video call are provided to recognize voice of a caller during video call, convert word and sentence into a speech balloon, and to transmit the voice to the mobile terminal of the other part. A speech balloon DB stores speech balloon desired by each mobile terminal to be displayed with a sentence or word recognized voice during video call. A user voice/face image storage(320) stores face image which is used to separate the face image of the terminal user. A voice recognition processing part(340) recognizes the voice received through the audio logical channel from the terminal and converts a word or a sentence. A face recognition and mouth shape analysis part(350) refers word and sentence of a user through mouth analysis.

Description

Video call speech balloon providing system and method {SYSTEM AND METHOD FOR BALLOON PROVIDING DURING VIDEO COMMUNICATION}

The present invention relates to a system and method for providing a video call speech balloon, and more particularly, to a system and a method for providing a video call speech balloon so that a voice of a caller can be displayed in a speech balloon during a video call.

With the rapid development of mobile communication technology and infrastructure, mobile communication terminals provide various additional service functions such as Internet search, wireless data communication, electronic organizer, and video call function as well as general voice call.

Among them, a video call function is to perform a call while transmitting and receiving an image captured by a camera with a counterpart, and is conventionally implemented to simply transmit a caller's voice during a video call.

Accordingly, in the past, the emotions, moods or feelings contained in the words of the caller cannot be delivered at the same time, and the words are unclearly transmitted due to the influence of the RF environment or the ambient noise, so that the feelings or feelings of the caller are not transmitted to the other party. In addition to not being able to communicate, there is a problem that it is difficult to accurately convey the context of words.

In addition, when two or more callers are simultaneously photographed and delivered to the other party using one camera provided in the terminal, when the two people speak at the same time, the other party may not know exactly who is speaking. There is a problem.

The present invention has been made to solve the above-described problems, and after recognizing the voice of the caller in the video call to convert the word, sentence, and then transmit the converted word, sentence with the speech bubble to the other terminal, the other party calls An object of the present invention is to provide a video call speech balloon providing system and method for checking a person's voice through words and sentences.

According to an aspect of the present invention, there is provided a video call speech balloon system, including: a speech balloon DB for storing a speech balloon shape designated for each terminal to be displayed together with a voice recognized word and sentence during a video call; A user voice and face image storage unit for receiving and storing a voice used to separate a voice of a terminal user in a two or more modes and a face image used to separate a face image of the terminal user for each terminal; If a session setup request is received from a terminal having a video call by establishing a session with the counterpart terminal, a session is established with the terminal according to the session setup request, and the counterpart is based on identification information of the counterpart terminal provided from the terminal. A video call session processing unit for requesting a session establishment from a terminal to establish a session with the counterpart terminal; A speech recognition processor for recognizing speech received from the terminal through an audio logical channel and converting the speech into words and sentences; The face image is extracted from the image received from the terminal through the video logical channel, the position of the mouth is analyzed by analyzing the pattern of the extracted face image, and the face inferring the words and sentences spoken by the caller through the mouth shape analysis. A recognition and mouth analysis unit; And a speech balloon display processing unit for displaying the speech balloon shape retrieved from the speech balloon DB and the words and sentences recognized by the speech recognition processing unit at the position of the mouth identified by the face recognition and mouth analysis unit and transmitting them to the counterpart terminal. It is preferable.

Further, the speech balloon DB, the manual speech balloon DB for storing one-to-one matching with the key button value assigned to each of the speech balloon shape and the respective speech balloon shape used in the manual speech bubble mode; It is preferable that the speech bubble shape used in the automatic speech bubble mode and an automatic speech bubble DB for matching and storing the words and sentences specified in each speech bubble shape is preferably made.

The speech recognition processing unit may include a one-person mode speech recognition processing unit for recognizing a voice received through an audio logical channel in a one-person mode and converting the speech into a word and a sentence; In two or more modes, two or more voices transmitted through an audio logical channel are separated by waveform analysis, and each separated voice is converted into words and sentences through voice recognition, and the separated voice is converted into the user voice and face. It is preferable to include a two-or more mode speech recognition processing unit for grasping the voice of the terminal user from the separated voices, respectively, compared to the terminal user voices stored in the image storage unit.

The speech balloon display processing unit may display a speech balloon shape searched in the speech balloon DB and a word and sentence recognized by the one-person mode speech recognition processing unit at the mouth position identified by the face recognition and mouth analysis unit. A display processor; When speaking only one person at a time in a two-person mode, the facial recognition and mouth shape analysis unit identifies the caller based on the result of the analysis of the mouth movement, and then the speech bubble shape and 2 found in the speech bubble DB In the abnormality mode speech recognition processing unit, a word or sentence recognized by the speech is displayed at the mouth position of the caller, and when two or more people speak at the same time, the voice of the terminal user recognized by the two-ordinary mode speech recognition processing unit. After the voice of each caller is grasped based on the information and the face image of the user of the terminal identified by the face recognition and mouth analysis unit, a speech bubble shape retrieved from the speech bubble DB and a word recognized by the two or more mode speech recognition processor, Two or more mode speech balloon display processing unit to display the sentence at the mouth position of each caller who has a voice It is made by also being preferred.

On the other hand, the video call speech balloon providing method according to an embodiment of the present invention, the terminal performing a video call by establishing a session with the other terminal to terminate the session with the other terminal to receive the speech bubble function to request the session setting Accordingly, a first process of performing session establishment with each of the terminal and the counterpart terminal; If the speech bubble mode is set to the manual speech bubble mode of 1, the speech bubble shape designated in the speech bubble shape selection key button value received from the terminal is searched through the control logical channel in the manual speech bubble DB, and then the detected speech bubble shape is recognized and detected. A second step of displaying on the mouth part of the caller identified by the mouth shape analyzing unit, and displaying the word and sentence recognized by the one-person mode speech recognition processor inside the speech bubble shape and transmitting the same to the counterpart terminal; If the speech balloon mode is set to an automatic speech balloon mode of 1, the first mode speech recognition processor converts a voice received through an audio logical channel into a word or sentence through speech recognition, and matches the speech recognized word or sentence. After searching the speech bubble shape in the automatic speech bubble DB, the searched speech bubble shape is displayed on the mouth part of the caller identified by the face recognition and mouth analysis unit, and the word and sentence recognized by the one-person mode speech recognition processor are displayed. It is preferably made to include a third process of displaying inside the shape and transmitting to the counterpart terminal.

In addition, the first process may include: establishing a session with the terminal according to the session establishment request in the video call session processing unit that has received a session establishment request from the terminal; And requesting session establishment from the counterpart terminal by using the identification information of the counterpart terminal provided from the terminal when the session establishment request is made.

The second process may include: searching, by the manual speech bubble DB, the speech bubble shape designated in the speech bubble shape selection key button value received through the control logic channel in the mode speech balloon display processor 1; Converting a voice received through an audio logical channel into a word or a sentence through voice recognition in a one-person mode speech recognition processor; Determining a position of a mouth by analyzing a pattern of a face image included in an image received through a video logical channel by a face recognition and mouth analyzer; Displaying the searched speech bubble shape in the one-person mode speech balloon display processing unit in the mouth part of the caller identified by the face recognition and mouth analysis unit; And displaying the speech recognized word and sentence in the speech bubble shape by the first mode speech recognition processor, and transmitting the speech recognized word and sentence to the counterpart terminal through a video logical channel.

The third process may include converting a voice received through an audio logical channel into a word or a sentence through voice recognition in the first mode speech recognition processor; Determining a position of a mouth by analyzing a pattern of a face image included in an image received through a video logical channel by the face recognition and mouth analysis unit; Searching for a speech bubble shape matching the speech recognized word and sentence in an automatic speech balloon DB using the speech recognized word and sentence in the one-person mode speech balloon display processor; Displaying the searched speech bubble shape in the one-person mode speech balloon display processing unit in the mouth part of the caller identified by the face recognition and mouth analysis unit; And displaying the speech recognized word and sentence in the speech bubble shape by the first mode speech recognition processor, and transmitting the speech recognized word and sentence to the counterpart terminal through a video logical channel.

On the other hand, the video call speech balloon providing method according to another embodiment of the present invention, the terminal performing a video call by establishing a session with the other terminal terminal to terminate the session with the other terminal to receive the speech bubble function to request the session setting Accordingly, a first process of performing session establishment with each of the terminal and the counterpart terminal; When the speech bubble mode is set to the manual speech bubble mode of 2, the speech bubble shape designated in the speech bubble shape selection key button value received from the terminal is searched through the control logic channel in the manual speech bubble DB, and the voice of each caller is identified. A second process of displaying a speech bubble shape, a speech recognized word, and a sentence in a mouth of a caller who has recognized a voice and transmitting the same to a counterpart terminal; If the speech bubble mode is set to an automatic speech bubble mode of 2, the two or more mode speech recognition processor converts a voice received through an audio logical channel into words and sentences through speech recognition, and converts the speech to a recognized word or sentence. After searching the automatic speech balloon DB in the appropriate speech balloon shape, the voice of each caller is identified, and the searched speech balloon shape, the voice recognized word, and the sentence are displayed on the mouth of the caller whose voice is recognized and transmitted to the counterpart terminal. It is preferable to include three steps.

Further, the second process may include: searching, by the manual speech bubble DB, a speech bubble shape designated in the speech bubble shape selection key button value received through the control logic channel by the two-or-more mode speech balloon display processing unit; When two or more voices are received through the audio logical channel at the same time, when the voice of the caller is set to identify the voice of the caller through the mouth movement analysis, the voices received through the audio logical channel by the two or more mode voice recognition processing unit by waveform After separation, converting the separated speech into words and sentences through speech recognition; The face recognition and mouth analysis unit extracts the face image included in the image received through the video logical channel, and then analyzes the pattern of each face image to determine the location of each caller's mouth, Inferring words and sentences spoken by each caller by analyzing movement; Comparing the inferred words, sentences with speech-recognized words, sentences and identifying the voice of each caller; And displaying the searched speech bubble shape, the voice recognized word, and the sentence in the mouth of each caller according to the identified caller's voice and transmitting the same to the counterpart terminal through a video logical channel.

In addition, when two or more voices are simultaneously received through the audio logical channel, if the voice of the caller is set to grasp the voice of the caller through the waveform analysis of the voice, the voice received through the audio logical channel by the two or more mode voice recognition processing unit. After each of the waveforms are separated into waveforms, the separated voices are compared with the terminal user voices stored in the user voice and face image storage unit to identify the voices of the terminal users from two or more voices, and the separated voices are voiced. Converting the word and sentence through recognition; The face recognition and mouth analysis unit extracts a face image included in an image received through a video logical channel, and then extracts each face image of the terminal user stored in the user voice and face image storage unit. Identifying the face image of the terminal user from two or more face images compared to the above, and identifying the mouth position of each caller through pattern analysis of each face image; Determining the voice of each caller based on the voice and face image grasping result of the terminal user; The method may further include displaying the searched speech bubble shape, the voice recognized word, and the sentence at the mouth of each caller according to the identified caller's voice, and transmitting the same to the counterpart terminal through a video logical channel.

And when one voice is received through the audio logical channel at the same time, converting the received voice into words and sentences through voice recognition; Extracting a face image included in an image received through a video logical channel through a face recognition and mouth analysis unit, and identifying a mouth position of each caller through pattern analysis of each face image; Analyzing the movement of the mouth to identify the talker who is currently speaking, displaying the searched speech bubble shape, the voice recognized word, and the sentence at the mouth of the talking caller, and transmitting the same to the counterpart terminal through a video logical channel. It is preferred to further comprise a.

In the third process, when two or more voices are received through the audio logical channel at the same time, if the voice of the caller is set to be identified through the mouth movement analysis, the two or more mode voice recognition processing unit performs the audio logical channel. Separating the received voice by waveform, converting each of the separated voices into words and sentences through speech recognition, and searching a speech balloon shape matching the recognized speech and sentences in an automatic speech balloon DB; ; The face recognition and mouth analysis unit extracts the face image included in the image received through the video logical channel, and then analyzes the pattern of each face image to determine the location of each caller's mouth, Analyzing the movements and inferring words and sentences spoken by each caller; Comparing the inferred words, sentences with speech-recognized words, sentences and identifying the voice of each caller; And displaying the searched speech bubble shape, the voice recognized word, and the sentence in the mouth of each caller according to the identified caller's voice and transmitting the same to the counterpart terminal through a video logical channel.

Here, when two or more voices are received through the audio logical channel at the same time, when the voice of the caller is set to grasp the voice of the caller through the waveform analysis of the voice, the voice received through the audio logical channel by the two or more mode voice recognition processing unit. Dividing the respective voices into waveforms, and comparing the separated voices with the terminal user voices stored in the user voice and face image storage unit to identify voices of the terminal users from two or more voices; Converting each of the separated voices into words and sentences through speech recognition and searching a speech balloon shape matching the speech recognized words and sentences in an automatic speech balloon DB; The face recognition and mouth analysis unit extracts a face image included in an image received through a video logical channel, and then extracts each face image of the terminal user stored in the user voice and face image storage unit. Identifying the face image of the terminal user from two or more face images compared to the above, and identifying the mouth position of each caller through pattern analysis of each face image; Determining the voice of each caller based on the voice and face image grasping result of the terminal user; And displaying the searched speech bubble shape, the voice recognized word, and the sentence in the mouth of each caller according to the identified caller's voice and transmitting the same to the counterpart terminal through a video logical channel.

If one voice is received at one time through the audio logical channel, the received speech is converted into a word and a sentence through voice recognition, and a speech bubble matching the word and sentence recognized in the automatic speech bubble DB is performed. Retrieving a shape; Extracting a face image included in an image received through a video logical channel through a face recognition and mouth analysis unit, and identifying a mouth position of each caller through pattern analysis of each face image; Analyzing the movement of the mouth to identify the talker who is currently speaking, displaying the searched speech bubble shape, the voice recognized word, and the sentence at the mouth of the talking caller, and transmitting the same to the counterpart terminal through a video logical channel. It is preferred to further comprise a.

According to the video call speech balloon providing system and method of the present invention, after recognizing the caller's voice during a video call and converting it into words and sentences, the converted word and sentence are designated by the user of the speech bubble or voice recognized words, sentences By putting the speech bubble matched to the image to be displayed on the image, it is possible to deliver the words, the emotions, feelings, etc. that the caller is speaking to the other party. In addition, when two or more people are photographed by one camera and delivered to the other party, each voice is classified and recognized, and the voice-recognized words and sentences are displayed on the image through the shape of each speech bubble. You can tell exactly if you speak.

Hereinafter, with reference to the accompanying drawings will be described in detail a system and method for providing a video call speech balloon according to an embodiment of the present invention.

1 is a road schematically showing a configuration of a communication network including a video call speech bubble providing system according to an embodiment of the present invention, the video call speech bubble providing system according to the present invention includes a WCMDA (Wideband Code Division Multiple Access) network It can be applied to all networks capable of video calls such as Long Term Evolution (LTE) network, High Speed Downlink Packet Access (HSDPA) network, CDMA 2000 1x EV-DO network, and IP network. In addition, the video call speech balloon providing system according to the present invention can be applied to a wired network can also be used during video calls over a wired network.

In FIG. 1, the calling terminal 100 and the called terminal 200 are connected to the video call speech balloon providing system 300 by wireless data communication in advance, and select and designate a speech bubble shape to be used in the manual speech bubble mode, respectively. Select (0 ~ 9, *, #) to designate the speech bubble shape to be used by the terminal user. For example, as shown in Fig. 2, the first key designates a speech balloon when angry, the second key designates a speech balloon when smiling, and the third key designates a speech balloon when it is pleasant. In the fourth key, the speech bubble shape for depressing may be designated, and in the fifth key, the speech bubble shape may be designated for telling the truth.

Here, when there are many speech bubble shapes desired by the terminal user, the speech bubble shape may be designated by using a combination of two or more buttons.

In addition, the calling terminal 100 and the called terminal 200 connect to the video call speech bubble providing system 300 in advance and designate a speech bubble shape to be automatically displayed according to the words and sentences recognized in the automatic speech bubble mode. For example, if a recognized word or sentence contains heat, anger, etc., the speech bubble of an angry person is displayed, and if the recognized word or sentence includes good or happy, it is pleasant. Specifies that the speech bubble shape of is displayed.

In addition, the calling terminal 100 and the called terminal 200 are video call speech bubbles in advance to separate the voice of each caller when two or more callers are photographed and delivered to the other party with one camera provided in the terminal. Access to the providing system 300 to register the unique voice and face image of the terminal user.

Meanwhile, after the calling terminal 100 establishes a session with the called terminal 200 by sending a video call to the called terminal 200, the call terminal function is selected from the calling terminal user or the called terminal user during the video call. Upon receiving the speech bubble function, the terminal requesting the speech bubble function ends the session set between the calling terminal 100 and the called terminal 200 so as to perform a video call with the other terminal through the video call speech bubble providing system 300, and then the video. The session is reset through the call bubble providing system 300.

For example, when the speech bubble function is selected by the calling terminal user, the calling terminal 100 disconnects the session established between the called terminal 200 and then requests the video call speech bubble providing system 300 to set up the session. In addition, the video call speech bubble providing system 300, which has established a session with the calling terminal 100 at the request of the calling terminal 100, receives the call based on the identification information of the called terminal 200 received from the calling terminal 100. By requesting the session setting to the terminal 200 and setting up a session with the called terminal 200, the calling terminal 100 and the called terminal 200 can perform a video call through the video call speech bubble providing system 300. do.

As described above, the calling terminal 100 and the called terminal 200, which are connected to the video call speech bubble providing system 300, provide a video call speech bubble through a video logical channel. The system 300 transmits the voice received through the microphone provided in the terminal to the video call speech balloon providing system 300 through the audio logical channel, and when the speech bubble mode is set to the manual speech bubble mode, the speech bubble In order to select a shape, the key button value received from the terminal user is transmitted to the video call speech bubble providing system 300 through the control logical channel.

On the other hand, when the video call speech balloon providing system 300 is requested to set up a session from the calling terminal 100 or the called terminal 200 to use the speech bubble function, after setting the session with the terminal requesting the session setting, the session setting Based on the identification information of the counterpart terminal received from the requesting terminal, a request is made to the counterpart terminal to establish a session.

As described above, the video call speech balloon providing system 300 which has established sessions with the calling terminal 100 and the called terminal 200, respectively, when the speech bubble mode is set to the manual speech bubble mode, through a control logic channel. Receives a speech bubble shape key button value from the calling terminal 100 or the called terminal 200, and receives the voice of the caller received in real time from the calling terminal 100 or the called terminal 200 through an audio logical channel. At the same time, the voice received through the audio logical channel is converted into words and sentences through voice recognition, and transmitted in real time from the calling terminal 100 or the called terminal 200 through the video logical channel. A speech bubble shape corresponding to a key button value for selecting a speech bubble shape received through a control logic channel on a received image, a voice recognized word, A sentence is displayed and transmitted to the counterpart terminal through the video logical channel.

In addition, when the speech bubble mode is set to the automatic speech bubble mode, the video call speech balloon providing system 300 may receive audio of a caller received in real time from the calling terminal 100 or the called terminal 200 through an audio logical channel. It converts the voice received through the audio logical channel into words and sentences through voice recognition at the same time as it is transmitted to the other terminal through the logical channel, and then retrieves the speech bubble shape assigned to the converted words and sentences through voice recognition. The video logical channel is displayed by displaying a speech balloon searched by a word or sentence converted through voice recognition, and a voice recognized word or sentence on an image received in real time from the calling terminal 100 or the called terminal 200 through the logical channel. Send to the other terminal through

In addition, the video call speech bubble providing system 300 is the two or more callers at the same time in the two or more modes in which two or more callers are photographed and transmitted from the calling terminal 100 or the receiving terminal 200 via the video logical channel at the same time. When two or more callers' voices are simultaneously received through the audio logical channel, the waveforms of the received voices are separated, and the separated voice waveforms are compared with the voice waveforms of the terminal user previously inputted from the terminal user. While the voice waveform of the terminal user is identified from the voices of more than or more, the face image included in the image received through the video logical channel is analyzed and compared with the face image input from the terminal user in advance. By grasping, when two or more callers speak at the same time, each call It recognizes the child's voice and displays a speech bubble with the recognized words and sentences next to the image of the caller's face where the voice is recognized.

In addition, when the video call speech bubble providing system 300 speaks only one person at a time in a two or more mode, it analyzes the shape of the image received through the video logical channel to determine who is speaking and to speak. A speech bubble is displayed along with the recognized words and sentences next to the caller's face image, and when two or more callers speak at the same time in a two or more mode, each caller receives an image of each caller. It analyzes the shape of the mouth and infers the words and sentences spoken by the caller, and compares the inferred words and sentences with the words and sentences recognized by the voice received through the audio logical channel. After identifying as a person, a speech bubble with the recognized words and sentences next to the image of the caller's face identified as the person speaking Displays.

As described above, the video call speech balloon providing system 300 displays a speech bubble shape, words, and sentences in the mouth determined through pattern analysis of a caller's face image when displaying a speech bubble shape and recognized words and sentences on an image. It is preferable.

3 is a road, a speech balloon DB 310, a user voice and face image storage unit 320, a video call session processing unit 330, schematically showing the internal configuration of the video call speech bubble providing system applied to the present invention; And a speech recognition processor 340, a face recognition and mouth analyzer 350, and a speech bubble display processor 360.

In such a configuration, the speech balloon DB 310 stores a speech balloon shape designated to be displayed together with a voice recognized word and sentence during a video call from the calling terminal 100 and the called terminal 200 connected by wireless data communication. The database includes a manual speech bubble DB 313 for storing the speech bubble shape used in the manual speech bubble mode, and an automatic speech bubble DB 315 for storing the speech bubble shape used in the automatic speech bubble mode.

The above-described manual speech bubble DB 313 is matched one-to-one with a speech bubble shape to be used in the manual speech bubble mode and a key button value assigned to each speech balloon shape.

The speech bubble DB 315 is matched with the speech bubble shape to be used in the automatic speech bubble mode, and the words and sentences specified in the speech bubble shapes.

On the other hand, the user voice and face image storage unit 320 of the terminal user used to separate the terminal user's voice in two or more modes from the calling terminal 100 and the called terminal 200 connected by wireless data communication It receives a voice and a face image of the terminal user used to separate the face image of the terminal user and stores it for each terminal user. Here, in addition to the voice and face images of the terminal user, the voice and face images stored for each terminal user may also store the voice and face images of the neighbors who may have a video call with the other party.

When the video call session processing unit 330 receives a session setting request from the calling terminal 100 or the called terminal 200 to use the speech bubble function, sets the session with the terminal requesting the session setting and receives the session setting request from the terminal. A session is set up by requesting a session setup from the counterpart terminal based on identification information of one counterpart terminal.

The video call session processor 330 receives the video received through the video logical channel and the audio received through the audio logical channel from the originating terminal 100 or the called terminal 200, which is set up as a session, and transmits the received video to the counterpart terminal. .

Meanwhile, the voice recognition processor 340 recognizes the voice received from the calling terminal 100 or the called terminal 200 through the audio logical channel and converts the voice into a word or sentence, but receives a single voice through the audio logical channel. In this case, the one-person mode speech recognition processing unit 343 converts the voice into words and sentences through speech recognition, and when two or more voices are received through the audio logical channel, the two-person mode speech recognition processing unit ( In step 345, the waveform of the speech is separated and analyzed, and the separated speech is converted into words and sentences through speech recognition.

As described above, the two or more modes speech recognition processing unit 345 separates two or more speech waveforms received through the audio logical channel, and stores the separated speech waveforms in the user voice and face image storage unit 320. The voice of the terminal user is recognized from two or more voices compared with the voice waveforms of the terminal user, and the separated voices are recognized and converted into words and sentences.

The face recognition and mouth analysis unit 350 extracts a face image through face recognition from an image received from the calling terminal 100 or the called terminal 200 through a video logical channel, and then analyzes the pattern of the extracted face image. If the position of the mouth is determined, but there are two or more face images extracted through face recognition, that is, if two or more face images are included in the image transmitted through the video logic channel, the video logical channel is used. Compares each face image extracted from the received image with the face image of the terminal user stored in the user voice and face image storage unit 320 to identify the face image of the terminal user from two or more face images, Analyze the movement of the mouth to identify the caller who is speaking, and through the mouth analysis Infer the words and sentences spoken by each caller.

The speech balloon display processor 360 recognizes the speech balloon shape selected by the speech balloon shape or the word or sentence recognized by the speech recognition processor 340 based on the speech bubble shape selection key button value received through the control logic channel, and speech recognition. The word and sentence recognized by the processor 340 are displayed at the position of the mouth recognized by the face recognition and mouth analysis unit 350 and transmitted to the counterpart terminal. The single person speech bubble display processor 363 and 2 persons The abnormal mode speech bubble display processor 365 is included.

When the speech bubble mode is set to the manual speech bubble mode, the one-person mode speech bubble display processing unit 363 includes a manual speech bubble DB 313 for a speech bubble shape corresponding to a key button value for selecting a speech bubble shape received through a control logic channel. If the speech bubble mode is set to the automatic speech bubble mode, the speech bubble shape matching the word and sentence recognized by the one-person mode speech recognition processor 343 is searched by the automatic speech bubble DB 315, and then the manual speech bubble DB 313 or the speech bubble shape retrieved from the automatic speech bubble DB 315 and the words and sentences recognized by the one-person mode speech recognition processor 343 are displayed at the position of the mouth identified by the face recognition and mouth shape analyzer 350. Send to the other terminal.

In addition, the two-person or more mode speech bubble display processor 365 recognizes and analyzes the mouth-shaped image from the image received through the video logical channel when the caller speaks only one person at a time in the two or more mode. After the caller who is speaking based on the result value received from the shape analysis unit 350 is identified, the speech balloon shape searched by the manual speech bubble DB 313 or the automatic speech bubble DB 315 and the two-or more mode speech recognition processor 345 are performed. The recognized word and sentence are displayed at the position of the mouth determined by the face recognition and mouth analysis unit 350 and transmitted to the counterpart terminal.

The above-mentioned two or more mode speech balloon display processing unit 365 images the words and sentences recognized by the speech bubble shape searched by the manual speech bubble DB 313 or the automatic speech bubble DB 315 and the two or more mode speech recognition processing unit 345. When displaying on, it is preferable to display a speech bubble together with the recognized words and sentences at the mouth position of the caller identified as the speaking caller.

In addition, the two or more mode speech balloon display processing unit 365 separates the waveform of the voice received through the audio logical channel when two or more callers speak simultaneously in the two or more mode, and then separates the separated voice waveform. Input the result value received from the two-or more mode speech recognition processing unit 345 which compares and analyzes the speech waveform inputted from the terminal user in advance, and the face image included in the image received through the video logic channel in advance. Based on the result value received from the face recognition and mouth analysis unit 350, which compares and analyzes the received face image, checks the voice of each caller and finds the speech bubble shape retrieved from the manual speech bubble DB 313 or the automatic speech bubble DB 315. And the two or more modes of speech recognition processing unit 345 respectively display the recognized words and sentences at the mouth position of the caller whose voice is confirmed. Transfer to room terminal.

In addition, the two or more mode speech balloon display processing unit 365 analyzes the image of the mouth from the image received through the video logical channel when two or more callers speak simultaneously in the two or more mode, the word spoken by each caller. 2 or more mode speech recognition processing unit by comparing the words and sentences received from the face recognition and mouth analysis unit 350 inferring the sentences and the words and sentences respectively recognized by the 2 or more mode speech recognition processing unit 345 ( In step 345, the phone caller recognizes the caller corresponding to the recognized word and sentence, respectively, and the speech bubble shape searched by the manual speech bubble DB 313 or the automatic speech bubble DB 315 and the two or more mode speech recognition processing unit 345. Recognized words and sentences are displayed on the identified caller's mouth position and transmitted to the counterpart terminal.

4 and 5 are process diagrams for explaining a video call speech bubble providing method according to an embodiment of the present invention, FIG. 4 is a process diagram for explaining an operation process in a single mode, and FIG. 5 is two or more persons. It is a process chart for demonstrating the operation process in mode.

First, when the calling terminal 100 and the called terminal 200 establish a session and perform a video call, when the calling terminal user or the called terminal user is requested to use the one-person speech bubble function, the calling user receives the speech bubble function. The calling terminal 100 or the called terminal 200 terminates the session set with the other terminal to continue the video call with the other terminal through the video call speech bubble providing system 300, and then the video call speech bubble providing system 300 Request a session setup from the calling terminal 100 or the called terminal 200, the video call session processing unit 330 sets a session with the terminal requesting the session setup in response to the session setup request. (S10, S12), the identification information of the other party terminal provided from the terminal requesting the session establishment when the session establishment request By requesting the session setup to the other terminal sets the other party terminal and the session (S14).

As described above, when the session is reset between the calling terminal 100 and the called terminal 200 through the video call speech bubble providing system 300, the calling terminal 100 and the called terminal 200 are connected through a video logical channel. Video call to the video call speech bubble providing system 300, and transmits the video call to the speech bubble providing system 300 through the audio logical channel, video call received video and audio through the video logic channel and audio logic channel The speech balloon providing system 300 transmits the received image and audio to the counterpart terminal through the video logical channel and the audio logical channel. If the speech bubble mode is set to the manual speech bubble mode of 1, the mode speech balloon display processing unit 363 ) For the speech bubble shape selection in the manual speech bubble DB (313) by using a key button value for the speech bubble shape selection received through the control logic channel. It retrieves a text box shape that is assigned to the button value (S16, S18).

In addition, the one-person mode speech recognition processor 343 receives the voice received through the audio logical channel and converts the voice into a word and a sentence through voice recognition, while the face recognition and mouth analysis unit 350 converts the video logical channel. The position of the mouth is determined by analyzing the pattern of the face image included in the received image (S20, S22).

Thereafter, the one-person mode speech balloon display processing unit 363 displays the speech bubble shape retrieved through the above-described process S18 on the mouth part of the caller identified by the face recognition and mouth analysis unit 350 (S24). After the speech processing unit 343 displays the speech-recognized word and sentence inside the speech bubble (S26), the speech-recognized word and sentence are displayed together with the speech bubble shape and are transmitted to the counterpart terminal through the video logical channel (S28). .

On the other hand, if the speech balloon mode is set to the automatic speech balloon mode of 1, the one-person mode speech recognition processing unit 343 receives the voice received through the audio logical channel to convert the word and sentences through the speech recognition, face recognition And the mouth shape analysis unit 350 determines the position of the mouth portion through the pattern analysis of the face image included in the image received through the video logical channel (S16, S30, S32).

Thereafter, the one-person mode speech balloon display processing unit 363 is a one-person mode speech recognition processing unit in the automatic speech bubble DB 315 using words and sentences recognized by the one-person mode speech recognition processing unit 343 through the process S30. In operation S34, a speech balloon shape that matches the voice recognized word or sentence is searched for through 343.

In addition, the speech balloon shape searched through the process S34 is displayed on the mouth of the caller identified by the face recognition and mouth analysis unit 350 (S36), and the word recognized by the one-person mode speech recognition processing unit 343. After the sentence is displayed inside the speech bubble (S38), the voice-recognized word and the sentence are displayed together with the speech bubble shape and transmitted to the counterpart terminal through the video logical channel (S40).

The counterpart terminal receiving the image in which the voice recognized words and sentences are displayed together with the speech bubble shape through steps S28 and S40 may provide the counterpart caller with a screen as shown in FIG. 6.

Meanwhile, when the calling terminal 100 and the called terminal 200 establish a session and perform a video call, when the calling terminal user or the called terminal user is requested to use the two-person mode speech bubble function, the terminal user requests the speech bubble function. The received calling terminal 100 or the called terminal 200 terminates the session set with the other terminal to continue the video call with the other terminal through the video call speech bubble providing system 300, and then the video call speech bubble providing system ( 300, the session establishment request is made. As shown in FIG. 5, the video call session processing unit 330 that has received the session establishment request from the calling terminal 100 or the called terminal 200 requests the session establishment. Sets a session with the terminal requesting the setting (S50, S52), and provides a session setting request from the terminal requesting the session setting. Sets the other party terminal and the session request to the session setup to the other party terminal using the identification information of the counterpart terminal (S54).

As described above, when the session is reset between the calling terminal 100 and the called terminal 200 through the video call speech bubble providing system 300, the calling terminal 100 and the called terminal 200 are connected through a video logical channel. Video call to the video call speech bubble providing system 300, and transmits the video call to the speech bubble providing system 300 through the audio logical channel, video call received video and audio through the video logic channel and audio logic channel The speech balloon providing system 300 transmits the received video and audio to the counterpart terminal through the video logical channel and the audio logical channel. When the speech bubble mode is set to the manual speech bubble mode of 2, the speech bubble display processing unit for two or more modes 365 denotes a speech bubble shape in the manual speech bubble DB 313 using a key button value for selecting a speech bubble shape received through a control logical channel. Taekyong retrieves a text box shape that is assigned to the button key value (S56, S58).

When the voice of one person is received as a result of analyzing the voice received through the audio logical channel by the two or more mode voice recognition processor 345, the received voice is converted into a word and a sentence through voice recognition, The face recognition and mouth analysis unit 350 extracts a face image included in an image received through a video logical channel, and then detects the mouth position of each caller through pattern analysis of each face image (S60 and S62). , S64, S66).

Then, by analyzing the mouth movement of the mouth part of each caller to determine who is currently speaking (S68), two or more modes speech balloon display processing unit 365 is the mouth of the caller identified as the talker After displaying the speech bubble shape searched through the process S58 described above (S70), and the two or more mode speech recognition processing unit 345 through the process S62 to display the speech recognition words, sentences inside the speech bubble (S72) In operation S74, the voice recognition word and the sentence are displayed together with the speech bubble shape to the counterpart terminal through the video logical channel.

On the other hand, when two or more voices are received through the audio logical channel at the same time, it is determined whether the voice of each caller is set by analyzing mouth movements (S60, S76), and analyzing each mouth movement. When the voice of the caller is set to recognize the voice of the caller, the two or more mode voice recognition processing unit 345 separates the voice received through the audio logical channel for each waveform (S78), and then separates each voice into voice recognition. Through the words, sentences are converted (S80).

Meanwhile, the face recognition and mouth analysis unit 350 extracts a face image included in an image received through a video logical channel (S82), and then grasps the location of each caller's mouth through pattern analysis of each face image. (S84), and analyzes the mouth movements of the mouths of each caller to infer words and sentences spoken by each caller (S86).

Thereafter, the face recognition and mouth analysis unit 350 compares the word and sentence inferred by the mouth shape analysis with the two person or more mode voice recognition processor 345 to compare the words and sentences that are voice recognition. To determine which caller's voice is each voice separated by the recognition processing unit 345 (S88), the two-person or more mode speech balloon display processing unit 365 is a speech bubble shape found in the mouth of each caller through the above-described process S58 (S90), and after displaying the speech recognition words and sentences with the corresponding voice inside the speech bubble displayed on the mouth of the caller identified as the owner of each voice through the above process S88 (S92), the voice recognized words In operation S94, the image is displayed with the speech bubble shape to the counterpart terminal through a video logical channel.

On the other hand, when the determination result of the above-described process S76 is set to analyze the voice waveform to determine who is currently speaking, the two or more mode voice recognition processing unit 345 received through the audio logical channel After the voices are separated by waveforms (S96), each of the separated voices is compared with the voices stored in the user voice and face image storage unit 320 to determine which voice is the voice of the terminal user. In operation S98, the separated voices are converted into words and sentences through speech recognition in operation S100.

Meanwhile, the face recognition and mouth analyzer 350 extracts a face image included in an image received through a video logical channel (S102), and then extracts each extracted face image from a user voice and a face image storage unit 320. Compare each face image stored in the) to determine which face image is the face image of the terminal user among two or more face images (S104), and identify the mouth position of each caller through pattern analysis of each face image. (S106).

Thereafter, the two-person or more mode speech balloon display processing unit 365 according to the voice of each caller identified based on the result of identifying the voice of the terminal user through step S98 and the result of identifying the face image of the terminal user through step S104. After displaying the speech balloon searched through the process S58 described above in each mouth of the caller (S108), and after the speech recognition words and sentences with the corresponding voice inside the speech bubble displayed in the mouth of the caller (S110), speech recognition In operation S112, the displayed word and the sentence together with the speech bubble shape are transmitted to the counterpart terminal through the video logical channel.

On the other hand, when the speech bubble mode is set to the automatic speech bubble mode of 2, when the voice received by the two or more mode voice recognition processing unit 345 received through the audio logical channel as a result of receiving one voice, After converting the speech into words and sentences through speech recognition (S56, S114, S116), the speech balloon matching the words or sentences recognized by the speech recognition processor 345 for two or more modes in the automatic speech balloon DB 315 Search for the shape (S118).

Meanwhile, the face recognition and mouth analysis unit 350 extracts a face image included in an image received through a video logical channel, and then detects the mouth position of each caller through pattern analysis of each face image (S120). , S122).

Then, by analyzing the mouth movement of the mouth part of each caller to determine who is currently speaking (S124), two or more mode speech balloon display processing unit 365 is the mouth of the caller identified as the talker After displaying the searched speech balloon shape in the above-described process S118 (S126), and after the two-ordinary mode speech recognition processing unit 345 displays the speech-recognized words and sentences inside the speech bubble (S128). In step S130, the voice-recognized word and sentence are displayed together with the speech bubble shape to the counterpart terminal through the video logical channel.

On the other hand, when two or more voices are received through the audio logical channel at the same time, by analyzing the mouth movement, it is determined whether it is set to find out who is currently speaking (S114, S132). If it is set to determine the caller who is currently speaking by analyzing the movement, the two or more mode speech recognition processing unit 345 separates the voice received through the audio logical channel for each waveform (S134). , Each speech is converted into a word, a sentence through speech recognition (S136), the speech bubble matching the word, sentence recognized by the speech recognition processor 345 for two or more modes in the automatic speech bubble DB (315) Search for the shape (S138).

Meanwhile, the face recognition and mouth analysis unit 350 extracts a face image included in an image received through a video logical channel (S140), and then grasps the location of each caller's mouth through pattern analysis of each face image. And (S142), by analyzing the mouth movement of the mouth portion of each caller infers the words, sentences spoken by each caller (S144).

Thereafter, the face recognition and mouth analysis unit 350 compares the word and sentence inferred by the mouth shape analysis with the two person or more mode voice recognition processor 345 to compare the words and sentences that are voice recognition. To determine which caller's voice each voice separated by the recognition processing unit 345 is (S146), two or more mode speech balloon display processing unit 365 is the speech bubble shape retrieved through the above-described process S138 in the mouth of each caller (S148), and the two or more modes of speech recognition processing unit 345 through the above-described process S136, respectively, the words and sentences recognized by the speech balloon displayed on the mouth of the caller identified through the above-described process S146 Afterwards (S150), the voice-recognized word and sentence are transmitted together with the speech bubble shape to the counterpart terminal through the video logical channel (S152).

On the other hand, when the determination result of the process S132 is set to analyze the voice waveform to determine who is currently speaking, the two or more mode speech recognition processing unit 345 received through the audio logical channel After the voices are separated by waveforms (S154), each of the separated voices is compared with the voices stored in the user voice and the face image storage unit 320 to determine which voice is the voice of the terminal user. (S156), each separated voice is converted into a word and a sentence through speech recognition (S158), and a word recognized through the two or more mode speech recognition processing unit 345 in the automatic speech bubble DB 315, The speech bubble shape matching the sentence is searched for (S160).

Meanwhile, the face recognition and mouth analyzer 350 extracts a face image included in an image received through a video logical channel (S162), and then extracts each extracted face image from the user voice and face image storage unit 320. Compare the face images stored in each) to determine which face image is the face image of the terminal user among two or more face images (S164), and identify the mouth position of each caller through pattern analysis of each face image. (S166).

Thereafter, the two-person or more mode speech balloon display processing unit 365 according to the voice of each caller identified based on the result of identifying the terminal user's voice through step S156 and the result of identifying the face image of the terminal user through step S164. The speech balloon shape searched through the above-described process S160 is displayed on each caller's mouth (S168), and the words and sentences respectively recognized by the two-or more mode speech recognition processing unit 345 through the above-described process S158 are entered into the mouth of the caller. After the inside of the speech bubble displayed on the part is displayed (S170), the voice-recognized word and sentence are displayed along with the speech bubble shape, and the image is transmitted to the counterpart terminal through the video logical channel (S172).

The counterpart terminal that receives the voice-recognized word and sentence with the speech bubble shape through the processes S74, S94, S112, S130, S152, and S172 may display a screen as shown in FIG. It will be provided to the caller.

The video call speech balloon providing system and method of the present invention are not limited to the above-described embodiments, and various modifications can be made within the scope of the technical idea of the present invention.

The video call speech balloon providing system and method of the present invention is applied to a video call speech balloon providing system, and recognizes a caller's voice during a video call, converts it into words and sentences, and then converts the converted words and sentences from a terminal user. Accurate communication is achieved by putting the speech bubbles in the shape or speech recognition words and sentences to be displayed on the image.

1 is a view schematically showing the configuration of a mobile communication network including a video call speech bubble providing system according to an embodiment of the present invention.

2 is a view showing an example of a speech bubble shape assigned to a key button value according to the present invention.

Figure 3 is a schematic view showing the internal configuration of the video call speech bubble providing system applied to the present invention.

4 and 5 is a process for explaining a video call speech bubble providing method according to an embodiment of the present invention.

6 is an exemplary view showing an operation screen in a one-person mode.

7 is an exemplary view showing an operation screen in two or more modes.

*** Explanation of symbols for the main parts of the drawing ***

100. calling terminal, 200. called terminal,

300. Video call speech balloon providing system, 310. Speech balloon DB,

313. Manual Speech Bubble DB, 315. Automatic Speech Bubble DB,

320. User voice and face image storage unit, 330. Video call session processing unit,

340. A speech recognition processor, 343. a one-person mode speech recognition processor,

345. Two-person voice recognition processing unit, 350. Face recognition and mouth analysis unit,

360. Speech balloon display processing unit, 363. 1-person mode speech bubble display processing unit,

365. Two or more modes speech balloon display processing unit,

Claims

A speech balloon DB for storing a speech balloon shape designated for each terminal to be displayed together with a voice recognized word or sentence during a video call;

The camera provided with the terminal uses a voice used to separate the terminal user's voice in two or more modes in which two or more callers are photographed and transmitted, and a face image used to separate the face image of the terminal user. A user voice and face image storage unit configured to receive and store each user;

If a session setup request is received from a terminal having a video call by establishing a session with the counterpart terminal, a session is established with the terminal according to the session setup request, and the counterpart is based on identification information of the counterpart terminal provided from the terminal. A video call session processing unit for requesting a session establishment from a terminal to establish a session with the counterpart terminal;

A speech recognition processor for recognizing a speech received from the terminal through an audio logical channel and converting the speech into a word or a sentence;

The face image is extracted from the image received from the terminal through a video logical channel, the position of the mouth is analyzed by analyzing the pattern of the extracted face image, and the face is inferred from the word or sentence spoken by the caller through the shape analysis. A recognition and mouth analysis unit;

And a speech balloon display processing unit for displaying the speech balloon shape searched in the speech balloon DB and the word or sentence recognized by the speech recognition processing unit at the mouth position identified by the face recognition and mouth analysis unit and transmitting the same to the counterpart terminal. Video call speech bubble providing system.

The method of claim 1, wherein the speech bubble DB,

A manual speech bubble DB for one-to-one matching and storing of a speech balloon shape used in the manual speech bubble mode and a key button value assigned to each speech balloon shape;

A video call speech balloon providing system, comprising: an automatic speech balloon DB for matching and storing a speech bubble shape used in an automatic speech bubble mode and a word or sentence specified in each speech balloon shape.

The terminal of claim 1, wherein the terminal, which has established a session with the counterpart terminal and performs a video call,

When the speech bubble function is selected by the terminal user, the session established with the counterpart terminal is released, and the video call speech bubble providing system requests the session setting to the video call speech bubble providing system.

The method of claim 1, wherein the video call session processing unit,

And transmits an image received from the terminal through a video logical channel to the counterpart terminal through a video logical channel, and transmits a voice received from the terminal through an audio logical channel to the counterpart terminal through an audio channel. Video call speech bubble providing system.

The method of claim 1, wherein the speech recognition processing unit,

A one-person mode speech recognition processor for recognizing one person's voice received from the terminal through an audio logical channel and converting it into a word or a sentence in a one-person mode in which one caller is photographed and transmitted to a camera provided in the terminal;

In two or more modes in which two or more callers are photographed and transmitted to a camera provided in the terminal, two or more voices transmitted from the terminal are separated through waveform analysis through audio logic channels, and each separated voice is recognized by voice. 2 or more mode speech recognition for converting a word or sentence into a word or sentence, and comparing the separated voice with the terminal user voice stored in the user voice and face image storage unit, respectively, to identify the voice of the terminal user among the separated voices. Video call speech balloon providing system comprising a processing unit.

The method of claim 1, wherein the face recognition and mouth analysis unit,

When two or more face images are extracted from an image received from the terminal through a video logical channel, each face image is compared with a face image stored in the user voice and face image storage unit, and among the extracted face images. The video call speech balloon providing system, characterized in that to grasp the face image of the terminal user.

The method of claim 1, wherein the speech bubble display processing unit,

When the speech bubble mode is set to the manual speech bubble mode, the speech bubble DB is selected from the speech bubble DB based on a key button value for selecting the speech bubble shape received from the terminal through a control logic channel.

And when the speech bubble mode is set to the automatic speech bubble mode, selecting a speech bubble shape from the speech bubble DB based on a word or sentence recognized by the speech recognition processor.

The method of claim 1, wherein the speech bubble display processing unit,

A one-person speech balloon display processing unit for displaying the word or sentence recognized by the speech balloon DB searched in the speech bubble DB and the one-person mode speech recognition processing unit at the mouth position determined by the face recognition and mouth analysis unit;

When speaking only one person at a time in a two-person mode, the facial recognition and mouth shape analysis unit identifies the caller based on the result of the analysis of the mouth movement, and then the speech bubble shape and 2 found in the speech bubble DB In the abnormal mode speech recognition processing unit, the word or sentence recognized by the voice is displayed at the mouth position of the identified caller,

When two or more people speak at the same time, the voice of each caller is determined based on the voice information of the terminal user identified by the two-person mode speech recognition processing unit and the face image of the terminal user identified by the face recognition and mouth analysis unit. After the identification, the speech bubble shape retrieved from the speech bubble DB and two or more mode speech recognition processing unit comprises a two or more mode speech balloon display processing unit for displaying the speech recognized words or sentences at the mouth position of each caller recognized voice Video call speech balloon providing system, characterized in that.

The method of claim 8, wherein the two abnormal mode speech balloon display processing unit,

When two or more people speak at the same time, each call by comparing the word or sentence inferred by analyzing the image of the mouth in the face recognition and mouth analysis unit and the speech or word recognized in the two-person mode speech recognition processor After grasping the voice of the person, the speech balloon shape searched in the speech bubble DB and the video call speech balloon characterized in that the two or more mode speech recognition processing unit displays the speech recognized words or sentences at the mouth position of each caller that the voice is recognized system.

In order to receive a speech bubble function, a terminal having established a session with a counterpart terminal disconnects the session with the counterpart terminal and requests a session setup from a video call speech bubble providing system. Performing a session establishment with the counterpart terminal respectively;

If the speech bubble mode is set to the manual speech bubble mode in a single mode in which a single caller is photographed and transmitted to a camera provided in the terminal, a key button value for selecting a speech bubble shape received from the terminal through a control logical channel in the manual speech bubble DB. After searching for the speech bubble shape designated in, the searched speech bubble shape is displayed on the mouth of the caller identified by the face recognition and mouth analysis unit, and the single voice received from the terminal through an audio logical channel in the one-person mode speech recognition processor. Converting the speech into a word or sentence through voice recognition, displaying the speech recognized word or sentence inside the speech bubble, and transmitting the same to the counterpart terminal;

When the speech bubble mode is set to the automatic speech bubble mode in the single person mode, the single person mode speech recognition processor converts a single voice received from the terminal through an audio logical channel into a word or sentence through speech recognition, After searching the speech balloon shape corresponding to the speech recognized word or sentence in the automatic speech balloon DB, the searched speech balloon shape is displayed on the mouth part of the caller identified by the face recognition and mouth analysis unit, and the speech is processed by the one-person mode speech recognition processor. And a third process of displaying the recognized word or sentence inside the speech bubble shape and transmitting the recognized word or sentence to the counterpart terminal.

The method of claim 10, wherein the first process comprises:

Establishing a session with the terminal according to the session establishment request in the video call session processing unit which has received a session establishment request from the terminal;

Requesting session establishment to the counterpart terminal using the identification information of the counterpart terminal provided from the terminal when the session establishment request is made, and establishing a session with the counterpart terminal. .

The method of claim 10, wherein the second process,

Retrieving, from the manual speech bubble DB, a speech bubble shape specified in a speech bubble shape selection key button value received from the terminal through the control logic channel in a single mode speech bubble display processing unit;

Converting a single voice received from the terminal through an audio logical channel into a word or sentence through voice recognition in a single-person mode speech recognition processor;

Determining a position of a mouth by analyzing a pattern of a face image included in an image received from the terminal through a video logical channel through a face recognition and mouth analysis unit;

Displaying the searched speech bubble shape in the one-person mode speech balloon display processing unit in the mouth part of the caller identified by the face recognition and mouth analysis unit;

And displaying the speech recognized word or sentence inside the speech bubble by the first mode speech recognition processor, and transmitting the word or sentence to the counterpart terminal through a video logical channel.

The method of claim 10, wherein the third process,

Converting a single voice received from the terminal through an audio logical channel into a word or a sentence through voice recognition in the first mode speech recognition processor;

Determining a position of a mouth by analyzing a pattern of a face image included in an image received from the terminal through a video logical channel through the face recognition and mouth analysis unit;

Searching for a speech bubble shape matching the speech recognized word or sentence in an automatic speech bubble DB using the speech recognized word or sentence in the one-person mode speech balloon display processor;

The first terminal which establishes a session with the counterpart terminal and performs a video call by disconnecting the session with the counterpart terminal and requesting a session setup in order to receive a speech bubble function, performs session setup with the terminal and the counterpart terminal, respectively. Process;

When the speech bubble mode is set to the manual speech bubble mode in two or more modes in which two or more callers are photographed and transmitted to a camera provided in the terminal, the speech bubble shape received from the terminal through a control logical channel in the manual speech bubble DB is used. After searching the speech bubble shape specified in the key button value, the two or more mode speech recognition processing unit recognizes the voice received from the terminal through an audio logical channel to identify the voice of each caller. A second process of displaying a searched speech bubble shape and a speech recognized word or sentence in a portion and transmitting the same to the counterpart terminal;

If the speech balloon mode is set to the automatic speech balloon mode in the two or more modes, the two or more mode speech recognition processor converts a voice received through an audio logical channel into a word or sentence through speech recognition, and recognizes the speech. After searching the automatic speech balloon DB for the word or sentence in the automatic speech bubble DB, the voice of each caller is identified, and the searched speech bubble shape and the speech recognized word or sentence are displayed at the mouth of the caller whose voice is recognized. The video call speech balloon providing method comprising a third process of transmitting to.

The method of claim 14, wherein the second process,

Retrieving, from the manual speech bubble DB, the speech bubble shape specified in the speech bubble shape selection key button value received from the terminal through the control logical channel in a two or more mode speech bubble display processing unit;

When two or more voices are simultaneously received from the terminal through an audio logical channel, if the voice of the caller is set to be identified through the mouth movement analysis, the two or more mode voice recognition processing unit receives the voice from the terminal through the audio logical channel. Separating the received voice by waveform and converting each of the separated voices into words or sentences through voice recognition;

The face recognition and mouth analysis unit extracts two or more face images included in an image received from the terminal through a video logical channel, and then identifies the mouth position of each caller through pattern analysis of each face image. Analyzing the movement of each caller's mouth to infer a word or sentence spoken by each caller;

Comparing the inferred word or sentence with a speech recognized word or sentence to identify a voice of each caller;

And displaying the searched speech bubble shape and the voice recognized word or sentence in the mouth of each caller according to the identified caller's voice and transmitting the same to the counterpart terminal through a video logical channel. How to give a call bubble.

16. The audio recognition apparatus of claim 15, wherein when two or more voices are simultaneously received from the terminal through the audio logical channel, the audio is processed by the two or more mode voice recognition processor if the voice of the caller is determined by analyzing the waveform of the voice. After the voices received through the logical channel are separated by waveforms, each of the separated voices is compared with the terminal user voices stored in the user voice and face image storage unit, and the voices of the terminal users are identified from two or more voices. Converting each of the separated speech into words or sentences through speech recognition;

The face recognition and mouth analysis unit extracts two or more face images included in an image received from the terminal through a video logical channel, and then stores each extracted face image in the user voice and face image storage unit. Identifying the face image of the terminal user from two or more face images compared to the face image of the terminal user, and identifying the mouth position of each caller through pattern analysis of each face image;

Determining the voice of each caller based on the voice and face image grasping result of the terminal user;

16. The method of claim 15, further comprising: converting the received voice into a word or sentence through voice recognition when only one voice from two or more callers is received from the terminal through the audio logical channel;

Extracting two or more face images included in an image received through a video logical channel by a face recognition and mouth analysis unit, and identifying mouth position of each caller through pattern analysis of each face image;

Analyzing the movement of the mouth to identify the currently speaking caller, displaying the searched speech bubble shape and the speech recognized word or sentence at the mouth of the talking caller, and then transmitting the same to the counterpart terminal through a video logical channel; The video call speech balloon providing method comprising a.

The method of claim 14, wherein the third process,

When two or more voices are simultaneously received from the terminal through an audio logical channel, if two or more voices are set to grasp the voice of the caller through mouth movement analysis, the two or more mode voice recognition processing unit may receive two voices through the audio logical channel. Separating the speech of at least a person by waveform, converting each of the separated speech into words or sentences through speech recognition, and searching a speech balloon shape matching the speech recognized word or sentence in an automatic speech balloon DB;

19. The audio recognition apparatus of claim 18, wherein when two or more voices are simultaneously received from the terminal through the audio logical channel, the audio is processed by the two or more mode voice recognition processor if the voice of the caller is determined by analyzing the waveform of the voice. After separating two or more voices received through the logical channel by waveform, each of the separated voices is compared with the terminal user voice stored in the user voice and face image storage unit. Identifying;

Converting each of the separated voices into words or sentences through speech recognition and searching a speech balloon shape matching the speech recognized words or sentences in an automatic speech balloon DB;

19. The method of claim 18, wherein when only one voice of two or more callers is received from the terminal through the audio logical channel, the received voice is converted into a word or sentence through voice recognition, and an automatic speech bubble DB Searching for a speech bubble shape matching the speech recognized word or sentence;

Extracting two or more face images included in an image received from the terminal through a video logical channel in a face recognition and mouth analysis unit, and identifying the mouth position of each caller through pattern analysis of each face image; Wow;