WO2013027893A1

WO2013027893A1 - Apparatus and method for emotional content services on telecommunication devices, apparatus and method for emotion recognition therefor, and apparatus and method for generating and matching the emotional content using same

Info

Publication number: WO2013027893A1
Application number: PCT/KR2011/008399
Authority: WO
Inventors: 강준규
Original assignee: Kang Jun-Kyu
Priority date: 2011-08-22
Filing date: 2011-11-07
Publication date: 2013-02-28
Also published as: KR20130022434A

Abstract

The present invention relates to augmented reality providing a user with mixed data from a real environment and a virtual environment. More particularly, the present invention relates to an apparatus and method for emotional content services on telecommunication devices, to an apparatus and method for emotion recognition therefor, and to an apparatus and method for generating and matching the emotional content using same, which may utilize a real-time matching technique of matching real images and virtual images on the basis of voice recognition, a technique of extracting facial characteristics, a technique of facial normalization, a technique of facial detection for a facial recognition technique for object recognition, a facial feature (expression analysis) relationship technique for an emotion recognition technique for object recognition, a hand motion recognition technique for object recognition, and a motion and behavior recognition technique for object recognition, and which may match mixed virtual objects (including characters) to the faces and bodies of both sides making a video call using gesture and expression analyses to implement various mixed realities, which cannot be seen in the real world, for the video call.

Description

[Correction 16.11.2011] according to Rule 26. Apparatus and method for emotion content service of communication terminal device, apparatus and method for emotion recognition therefor, apparatus and method for generating and matching emotion content using same

The present invention analyzes the emotions and facial expressions recognized by the image pickup apparatus of the transmitting communication terminal device, and the emotion contents for realizing mixing the virtual object on the screen of the receiving communication terminal device in real time so as to effectively deliver the analyzed emotion or communication. Service apparatus and method, an emotion recognition apparatus and method for serving the emotion content, an apparatus and method for generating and matching the emotion content through the emotion recognition, and an apparatus and method for generating the emotion content .

Recent key keywords in domestic and international IT are summarized as Smart Device, Augmented Reality, and Contents. Among them, smart devices are smartphones that have spread in earnest through Apple's iPhone. Smartphones are growing rapidly in the short term, with subscribers surpassing 10 million since March 2011. According to the Korea IDC survey data, the domestic tablet PC market is expected to be 90 ~ 1 million units this year, and iPad is expected to account for 35 ~ 45%.

Augmented reality, which is the second key keyword in the IT field, is a technology derived from one field of virtual reality and refers to a technology that combines the real world and the virtual experience. Augmented reality is regarded as one of the top 10 innovations to lead the future, and it is a technology that gives the user a better sense of reality by interacting with virtual objects based on the real world.

Augmented reality is a field of virtual reality, a computer graphics technique that synthesizes virtual objects in the real environment and looks like objects existing in the original environment.Augmented reality is an existing method that targets only virtual space and virtual objects. Unlike virtual reality, it is a technology that synthesizes virtual objects on the basis of real world and reinforces and provides additional information that is difficult to obtain in real world alone. Currently, augmented reality technology is being actively used in various forms in the fields of broadcasting, advertising, exhibition, games, theme parks, military, education and promotion.

In other words, augmented reality differs from the virtual reality technology, which excludes interaction with the real world and processes interactions only in the pre-established virtual space, based on real-time processing. The information is overlaid on the image of the real world input through the terminal, and thus it is distinguished from the virtual reality that provides only the image generated by the computer in that it enables interaction with the real world.

Such augmented reality technology has been in the spotlight in the field of mobile augmented reality technology used in communication terminals in particular, a lot of research and investment in the mobile augmented reality technology based on a marker or a mobile augmented reality technology is currently being made. Marker-based mobile augmented reality technology is a technology for recognizing a building by recognizing a specific sign after taking a specific sign corresponding to a specific building when shooting a specific building. It is a technology that overlays POI (Point of Interests) information corresponding to the image in the inferred direction by inferring the current position of the terminal and the viewing direction by using the installed GPS and digital compass. .

These conventional techniques generally provide only information about a building or a place previously designated by a service provider, and thus it is impossible for a user to provide appropriate information about an object not designated by the service provider. Since it does not provide the technology to accurately recognize the image input through the terminal by inferring the viewing direction, most current researches accurately recognize the actual object existing in the acquired image and map the local information of the object. In order to provide an intuitive and convenient image recognition-based symptom reality, or to display the icon in the form of augmented reality to access the detailed information of the object in the position of the object included in the input image input through the terminal The user can conveniently Research is limited to the accuracy and quantitative expansion of information provided, such as research on location and access to detailed information of the object of interest.

Therefore, the development of augmented reality technology, rather than the development of augmented reality technology in the use of a communication terminal in everyday life, it is desired to develop a variety of applications that can give users a pleasure through augmented reality technology.

3D video content, the third key keyword in the IT field, is exploding in related industries due to James Cameron's 'Avatar', and it is expected that the time for enjoying video calls with 3D content will come.

According to the 'Information and Trends of Mobile Applications' of the Korea Institute of Information and Telecommunications Policy, Telecom Asia selected automotive, mobile video calls, social media, augmented reality and adult content as the top five app trends in 2011. According to the -stat issuance report, mobile video call sales are expected to exceed $ 1 billion by 2015. IDC expects worldwide mobile app downloads to reach 10.9 billion this year and steeply climbed to 76.9 billion by 2014. In addition, mobile app sales are expected to exceed $ 35 billion in 2014, and the app ecosystem is expected to continue for the next 10 years.

Unlike in Korea, overseas smartphones have been slow in distributing video call services, but as Apple's iPhone (iPhone 4) emphasizes the service called FaceTime, interest in video calls is growing.

In addition, Android phones such as Samsung's Galaxy SI II are supporting video calls, and since Android version 2.3 (Gingerbread) officially supports video calls, Android phones from mid-2011 have been able to support video calls. It is expected to be equipped with call function. In addition, Android-based Tablet computers and iPad 2 generations are also expected to provide video call service through front and back cameras, so video calls are becoming more common.

In addition, if 4G services become full-fledged, 1: 1 or multi-party video calls may grow into core services in addition to voice and data services. Unlike domestic markets, interest in video calls through mobile terminals is gradually increasing.

On the other hand, in the field of social networking services (SNS), voice and video call services are spreading like fashion, and voice call service, which is a combination service of Twitter and its own, is also in line with this extension. In addition, communication services using voice and video are expected to increase further online. Therefore, the service trends of social network service providers and telecommunications companies are expected to gradually move toward a combination of augmented reality, video call, and specialized services.

Therefore, there is an urgent need for the development of related technologies for new concept services using emotional recognition and augmented reality techniques that can further evolve the spread of video call services, and the development of services using them.

Accordingly, an object of the present invention is to provide an emotional content service apparatus and method that can provide fun to a video call by providing a virtual object representing the emotional state of both callers with the video during a video call, and the emotional content. An emotion recognition apparatus and method for service, an apparatus and method for generating the emotion content through the emotion recognition, and the emotion content generated by the apparatus and method for generating the emotion content.

In addition, another object of the present invention is to provide an emotional content service apparatus and method for superimposing the emotional state of the caller to the video of the caller through a virtual object, to experience augmented reality that gives the callers a more realistic feeling and And an emotion recognition apparatus and method for servicing the emotion content, an apparatus and method for generating the emotion content through the emotion recognition, and the emotion content generated by the apparatus and method for generating the emotion content. have.

In addition, the present invention analyzes the emotions and facial expressions recognized from the image pickup device of the transmitting communication terminal device, and the emotion that is implemented by real-time mixing a virtual object on the screen of the receiving communication terminal device to effectively deliver the analyzed emotion or communication An apparatus and method for providing content services are provided.

In addition, the present invention analyzes the emotions and facial expressions recognized from the image pickup device of the transmitting communication terminal device, and the emotion that is implemented by real-time mixing a virtual object on the screen of the receiving communication terminal device to effectively deliver the analyzed emotion or communication The present invention provides an apparatus and method for emotion recognition for providing a content service.

In addition, the present invention analyzes the emotions and facial expressions recognized from the image pickup device of the transmitting communication terminal device, and the emotion that is implemented by real-time mixing a virtual object on the screen of the receiving communication terminal device to effectively deliver the analyzed emotion or communication An apparatus and method for generating the emotion content through an emotion recognition apparatus and method for providing a content service are provided.

In addition, the present invention analyzes the emotions and facial expressions recognized from the image pickup device of the transmitting communication terminal device, and the emotion that is implemented by real-time mixing a virtual object on the screen of the receiving communication terminal device to effectively deliver the analyzed emotion or communication The present invention provides an emotional content generated by an apparatus and a method for generating the emotional content through an emotion recognition apparatus and a method for providing a content service.

The present invention for achieving the above object is to analyze the emotion to match the avatar to the avatar instead of a specific emotion to analyze, and the emoticon matching to add the effect on the specific expression by analyzing the expression and the expression for a particular emotion It has at least one feature which exaggerates the specific part of the face or body which it represents.

The present invention proposes a face detection, face recognition, and emotion recognition core technology for recognizing a user's emotions and expressions, and accordingly proposes a technology for generating and matching emoticons maximizing avatars and expressions for the recognized emotions. In this paper, we propose a video call service through a smartphone.

The present invention can increase the effect of communication by recognizing a change in facial expression and matching the corresponding content (avatar) on a person's real face, thereby enabling expression of emotions that are impossible in the real world through facial recognition.

Apparatus and method for emotion content service of a communication terminal device according to the present invention for achieving the above objects, apparatus and method for emotion recognition for the same, apparatus and method for generating and matching emotion content using the same are voice recognition, object recognition Face area detection technology of face recognition technology, face area normalization technology, feature extraction technology within face area, facial component (expression analysis) relationship technology of emotion recognition technology of object recognition, object or hand gesture, object recognition behavior and behavior On the basis of cognitive technology, the real-time matching technology of the real picture and the virtual image is used to match the mixed virtual objects (including characters) through the gesture and facial expression analysis on both the face and the body to make a video call. Characterized in that the mixed reality is implemented through a video call.

In addition, the emotional content service of the communication terminal device according to the present invention is registered in advance the expression analysis relation function of the specific expression and gesture of the voice, face and body, if a similar voice, facial expression and gesture is transmitted through the image On the output video screen, the virtual object responding to voice, facial expression, and gesture is matched in real time on the face and body to enjoy a video call.

In addition, the emotional content service of the communication terminal apparatus according to the present invention for achieving the above objects is at least one of a gesture and facial expression of the user photographed through the image pickup means of the communication terminal device having at least an image pickup means and a display means A communication terminal device of a counterpart that extracts the emotional state of the user from the user, generates a virtual object corresponding to the extracted emotional state, superimposes on at least one of the body and the face of the user, and makes a video call with the user It is characterized by displaying on the display means.

In this case, in the emotional content service method of the communication terminal device according to the present invention, the virtual object may further include a character.

In the emotional content service method of the communication terminal device according to the present invention, the virtual object may be changed by the user.

In the emotional content service method of the communication terminal device according to the present invention, the virtual object is changed in real time corresponding to the emotional state.

In addition, in the emotional content service method of the communication terminal device according to the present invention, the virtual object is characterized in that the position superimposed on the body and face of the user is changed.

In addition, in the emotional content service method of the communication terminal device according to the present invention, if the communication terminal device further comprises a voice input means and a voice output means, the emotional state of the user from the user's voice input through the voice input means Characterized in further extracting.

In addition, the emotional content service of the communication terminal apparatus according to the present invention for achieving the above object is a video call service providing terminal having at least an imaging means and a display means, the gesture and facial expression of the user to be photographed through the imaging means A counterpart that extracts the emotional state of the user from at least one of the following, generates a virtual object corresponding to the extracted emotional state, and superimposes on at least one of a body and a face of the user to make a video call with the user It is characterized in that displayed on the display means of the video call device.

In addition, the emotional content service method of the communication terminal device according to the present invention for achieving the above object, the process comprising: inputting a face image of the user to the communication terminal device; Extracting face components from the input face image; Preprocessing the extracted facial components; Extracting facial features from the preprocessed facial components; Registering the extracted feature of the face in a face database; And a step of recognizing an emotion by comparing a feature registered in the face database with a face component extracted in the feature extraction process.

In addition, the method for generating and matching the emotion content through the emotion recognition method for the emotion content service of the communication terminal device according to the present invention for achieving the above object, the face image of the user from the camera module of the communication terminal device Receiving a process; Preprocessing the input face image; Detecting only valid data in the preprocessed face image; Estimating the position of the face and the camera information from the detected valid data; And generating a 3D image from the camera information and the position information of the face, and matching the generated 3D image with the face image of the user.

In this case, the method for generating and matching the emotion content through the emotion recognition method for the emotion content service of the communication terminal device according to the present invention comprises the steps of: outputting the 3D image matched with the face image of the user on the screen; And transmitting the 3D image matched with the face image of the user to a counterpart communication terminal through a network.

In addition, the emotional content service method of the communication terminal device according to the present invention for achieving the above object, the process of capturing a frame image to be analyzed in the video data source from the camera module of the communication terminal device; Preprocessing the captured image in an easy to analyze state; Detecting a face in the preprocessed image; Recognizing a posture estimation and a facial expression based on the recognition information extracted through the face detection process, and selecting a posture and facial expression of the avatar with respect to the posture and facial expression taken by the face; Determining position coordinates of the selected avatar on the 3D space through the analyzed information, selecting an avatar animation for a corresponding expression and emotion, and transmitting a control signal to the 3D engine; Composing a 3D space in which the avatar image and the video are to be represented, and performing a function of placing the analyzed avatar at a corresponding position; Matching the avatar and the video source represented through the 3D space into a single image source; Performing video encoding on the matched image together with a voice source; And transmitting the matched image to a counterpart terminal through a network configured for video call.

At this time, in the emotional content service method of the communication terminal device according to the present invention, the process of detecting a face in the pre-processed image, applying a learning algorithm or the like to analyze the facial feature point, and the position and relationship data on the image of the component It is characterized in that the extraction.

In addition, in the emotional content service method of the communication terminal device according to the present invention, the step of transmitting the matched image to the other terminal through the network, establishes a session via SIP, and transmits to the Internet via RTP / RTCP Characterized in that.

In addition, the emotional content service apparatus of the communication terminal apparatus according to the present invention for achieving the above object, the server communication unit interworking with the video call service providing terminal; And recognize the emotion state of the user from at least one of a gesture and an expression of the user from the image information received from the video call service providing terminal, and compare the recognized emotion state with previously stored object related information. A server for extracting an object matching an emotional state and superimposing the extracted object on at least one of a body and a face of the user and transmitting the object to a counterpart video call service providing terminal communicating with the video call service providing terminal. And a control unit.

In this case, the emotion content service apparatus of the communication terminal device according to the present invention, characterized in that it further comprises a server storage unit for storing the object-related data corresponding to the emotional state.

In addition, an emotion recognition apparatus for an emotion content service of a communication terminal apparatus according to the present invention for achieving the above objects includes a display unit for displaying an image of the other party and an object overlapping the image according to the video call; A communication unit interworking with a video call service providing server; An imaging unit which acquires image information of a user according to a video call; And recognize the emotion state of the user from the image information obtained by the image pickup unit, extract emotion information related to the recognized emotion state, and transmit the extracted emotion information to the video call service providing server, and the image from the video call service providing server. And a controller configured to receive an object corresponding to the emotion information of the other party according to a call and to superimpose the received object on a position associated with the received object in the image of the other party according to the video call and to output the object to the display unit. It is characterized by.

In addition, the apparatus for generating and matching the emotion content through the emotion recognition method for the emotion content service of the communication terminal device according to the present invention for achieving the above object, the video of the other party and the video according to the video call A display unit displaying overlapping objects; A communication unit interworking with a video call service providing server; An imaging unit which acquires image information of a user according to a video call; And recognize the emotion state of the user from the image information obtained by the image pickup unit, extract emotion information related to the recognized emotion state, and transmit the extracted emotion information to the video call service providing server, and the image from the video call service providing server. And a controller configured to receive an object corresponding to the emotion information of the other party according to a call and to superimpose the received object on a position associated with the received object in the image of the other party according to the video call and to output the object to the display unit. It is characterized by.

In this case, the display unit of the device for generating and matching the emotion content through the emotion recognition method for the emotion content service of the communication terminal device according to the present invention, characterized in that for further displaying the image of the user according to the video call do.

The apparatus for generating and matching the emotion content through the emotion recognition method for the emotion content service of the communication terminal device according to the present invention further includes a key input unit for determining whether to apply the object. do.

The apparatus for generating and matching the emotion content through the emotion recognition method for the emotion content service of the communication terminal device according to the present invention may further include a storage unit for storing the object.

In addition, the emotional content service of the communication terminal apparatus according to the present invention for achieving the above objects is at least one of a gesture and facial expression of the user photographed through the image pickup means of the communication terminal device having at least an image pickup means and a display means A communication terminal device of the other party that extracts the emotional state of the user from the user, exaggerates at least one of the body and the face of the user representing the gesture and expression corresponding to the extracted emotional state, and makes a video call with the user It is characterized by displaying on the display means.

As described above, the present invention can provide abundant sights to the video call by implementing various mixed reality not seen in the real world through the video call.

In particular, the present invention has an effect that can provide fun to the video call by providing a virtual object representing the emotional state of both callers with the video call during the video call.

In addition, the present invention by superimposing the emotional state of the caller to the video of the caller through a virtual object, it is possible to experience the augmented reality that gives the caller a more realistic feeling to deliver the emotional state of the caller freshly It has an effect.

In addition, the present invention has the effect of enabling the user to experience both virtual and reality by shaping the expressions of voices, faces and bodies of specific expressions and gestures into virtual objects through the video call screen.

1 is a view showing a concept of avatar matching according to the present invention;

2 is a view showing a concept of emoticon matching according to the present invention,

3 is a view showing composite data matched to a change in facial expression of a user and standard data thereof according to an embodiment of the present invention;

4 is a control flow diagram illustrating an emotion and facial expression recognition and matching procedure of a 3D avatar;

5 is a table summarized the embodiments of the main characters of storytelling according to the present invention;

6 is a diagram illustrating a service movement scenario according to user movement between access networks to which the emotion content service method according to the present invention is applied;

7 is a control flowchart for face detection in an emotion recognition method for an emotion content service of a communication terminal device according to the present invention;

8 is a view showing a configuration and operation (Req / Resp) sequence for the SIP service,

9 is a diagram illustrating a basic message and status code scheme of a SIP;

10 is a diagram illustrating a SIP protocol stack;

11 is a diagram showing a basic procedure of call setup of a SIP protocol;

12 is a conceptual diagram of a content matching system according to the present invention;

13 is a basic operation procedure for image registration according to the present invention,

14 is a schematic diagram of an avatar video communication operation procedure through emotion recognition and image registration according to the present invention;

FIG. 15 is a view illustrating an avatar video communication operation procedure through emotion recognition and image registration of FIG. 14 using actual content; FIG.

16 is a view showing various fields to which the present invention is applicable.

DETAILED DESCRIPTION Hereinafter, detailed descriptions of preferred embodiments of the present invention will be described with reference to the accompanying drawings. It should be noted that the same components in the figures represent the same numerals wherever possible. Specific details are set forth in the following description, which is provided to aid a more general understanding of the present invention. In the following description of the present invention, if it is determined that a detailed description of a related known function or configuration may unnecessarily obscure the subject matter of the present invention, the detailed description thereof will be omitted.

The present invention has at least one feature of avatar matching to analyze an emotion to allow an avatar to express it for a specific emotion, and emoticon matching to analyze an expression and add an effect to the specific expression.

1 is a view showing the concept of avatar matching according to the present invention, Figure 2 is a view showing the concept of emoticon matching according to the present invention.

As shown in FIG. 1, the avatar matching according to the present invention recognizes an emotion expressed in an actual image and replaces the actual image with an avatar expressed in augmented reality. In this case, the entire screen of the actual image may be replaced or only a specific part of the face may be expressed in augmented reality.

On the other hand, as shown in Figure 2, matching the emoticon according to the present invention is to use a variety of emoticons for recognizing the emotions expressed in the actual image to increase the transmission effect on the recognized expression.

Accordingly, the emotional content according to the present invention may be composed of, for example, ten male emotional recognition reactions, ten female emotional recognition reactions, and ten animal emotional recognition reactions. Such emotion contents can express characters and motions linked with motion expression scripts through 3D modeling that can be matched to standard data.

FIG. 3 is a diagram showing synthetic data and standard data thereof matched to a change in facial expression of a user according to an exemplary embodiment of the present invention. As shown in FIG. 3, by making a 3D avatar to match the change of facial expressions between users during a video call, standard data can be used to make a more interesting and pleasant video call between users by substituting an avatar or an emotion that is difficult to express. Provides avatar creation and video call application on behalf of. At this time, the present invention can be implemented to be applicable to the video call by producing a man, a woman, an anthropomorphic character according to the emotion and facial expression changes and each type of performance.

As shown in Figure 3, to build the standardized data for calm, joy, anger, sadness, surprise, fear, kiss (two species), wink (two species), and will be described below 'Ava', Basic characters and application characters of 'Bata', 'Sponsor' and 'Bath' are created as 3D avatars. In this case, two or more types of the basic type and the application type are produced for each character. In addition, the film produces emotion and facial expression animations that embody actions based on the emotion and facial response of Ava, Bata, Spawn, and Bath. Accordingly, 3D object and animation data for calm, joy, anger, sadness, surprise, horror, kiss (2 kinds), and wink (2 kinds) are produced (for example, Ava: 10 species, Bata: 10 species, Spawn: 10 species).

FIG. 4 is a control flowchart illustrating a process of recognizing and matching emotions and facial expressions of a 3D avatar. As illustrated in FIG. 4, a source of an avatar produced according to each emotion and facial expression is applied to a user application through the following recognition and matching steps. Is reflected.

In FIG. 4, the procedure indicated in blue is a procedure implemented by a commercial library, the procedure indicated in green is a content implementation procedure, gray indicates a system area, and red indicates a user's emotion and facial expression during a video call. This is a procedure implemented by a technology for developing smartphone application content in which an avatar acting on a corresponding emotion and an emoticon adding an effect to a corresponding expression are matched.

That is, the present invention develops core technologies for face recognition, emotion recognition, video registration, and video communication necessary for emotion and facial expression matching support video call service, and develops application content and interworking server. In addition, avatars and emoticons for the user are produced and applied to the smartphone application.

In the present invention, the emotion and facial expressions recognized from the image pickup device of the transmitting communication terminal device, and the emotion to implement the real-time mixing of the virtual object on the screen of the receiving communication terminal device to effectively deliver the analyzed emotion or communication The emotion content generated by the device and method for generating the emotion content through the emotion recognition device and method for providing a content service utilizes the creative, future-oriented value and smart image of the character, a new theme, a new trend. It is used as content in mobile video call.

In particular, the emotional content according to the present invention enhances the educational utility as a differentiated theme using advanced smart devices that will change the future life, and maximizes artistic value through differentiated content storytelling and the shaping of universal values. In other words, it maximizes the differentiation factor that 3D animation is possible not only through video player but also through video call with others.

Accordingly, the emotional content according to the present invention can be helpful for the emotional development of our children by inducing a didactic value and various facial expression changes that our lives are happy and precious through smart video calls that accompany us and always communicate in everyday life. Present a proposal.

To this end, as an example of differentiated content storytelling, through 'Ava', which overcomes loneliness and grows with precious friends, parent and child communication, friend and friend communication, teacher and student communication, workplace boss and coworker By inducing communication, I will try to solve the problem of 'communication' that emerges as a problem through the main character 'Ava' and his friend 'Bata' race. At this time, it is preferable that the visual image of the character according to the present invention constructs a character image that anyone can like and accept through the composition of figures 3-4.

Figure 5 is a table summarized the embodiments of the main characters of storytelling according to the present invention, understanding the importance of living and breathing with me, the importance of universal values and communication for imagination and adventure, friendship and family It shows the main characters, their images, and their personalities and roles to convey them.

The background story structure of the embodiment of FIG. 5 deals with a left-right doldol episode that occurs while the main character 'Ava' (user) meets a race 'Bata' in a virtual space living in his mobile phone. The protagonist is a user who uses a video call, and the character 'Bata', personified in the virtual space that is the friend of the protagonist, is a friend of mine who can meet at any time when the protagonist makes a video call with the other party. The only passage through which the race 'Bata', who lives in the virtual space, meets the main character through the communication terminal of the mobile phone video call, and because he likes to follow and enjoys playing, he acts like the main character according to the emotional reaction of the main character. It is represented as a race with mysterious ability to show various facial expression emoticons on behalf of the main character.

According to the background story structure of FIG. 5, standardized data on calmness, joy, anger, sadness, surprise, horror, kiss, wink, and aba, bata, spawn, bass basic characters and a number of applied characters are produced. In addition to embodying the emotions and facial expressions of Ava, Bata, Spawn, and Bath, 3D objects and animation data for calm, joy, anger, sadness, surprise, horror, kiss and wink of each character are produced.

In expressing the emotion content according to the present invention as a 3D object in a video communication terminal such as a smartphone, it is expected that the processing time for image processing from face detection to recognition and emotion recognition during a video call is large, and also recognized information. On the basis of this, delays are expected to occur in expressing matching between video and 3D objects. Therefore, in terms of 3D matching, it is possible to consider the method of minimizing the matching delay time by using the 3D engine. For example, the development speed can be improved by using the Unity3D engine in the 3D object representation method using OpenGL ES.

FIG. 6 is a diagram illustrating a service movement scenario according to user movement between access networks to which the emotion content service method according to the present invention is applied, and a method for solving a communication network access problem in a mobile environment through a smart phone may be sought. .

As shown in FIG. 6, there is a problem in terms of network environment in selecting an available access network according to a user's location movement. In other words, changing the IP address assigned by connecting to another access network may cause disconnection and error of the service in use. In particular, since quality degradation and service unavailability may occur for a real-time service such as a video call, there is a need to find an application method for the service mobility technology and derive a countermeasure for such a problem.

7 is a control flowchart for face detection in an emotion recognition method for an emotion content service of a communication terminal device according to the present invention. Referring to FIG. 7, face detection is to find a location of a face in an image. A face of a person is determined based on an angle of the front or side according to the gazing direction, the degree of tilting the head from side to side, various expressions, and distance from the camera. The image may vary depending on external changes such as morphological changes such as the size of the face image, differences in brightness levels within the face due to lighting, complex backgrounds, or other objects of indistinguishable color from the face. Face detection studies from MM include many difficulties.

Face detection is a pre-processing step before face recognition, such as knowledge-based methods, feature-based methods, template-matching methods, and appearance-based methods. Divided by. This is summarized in Table 1 below.

Table 1

Knowledge-based Face Detection Methods (Knowledge-based Face Detection Methods) is a method in which the face of the human face, such as eyebrows, eyes, nose, mouth, each face component using a constant distance and positional relationship to each other. In this method, partial contrast is concentrated in the center area of the face image, and the contrast distribution of the face image and the image is detected through comparison, and a top-down approach is mainly used. The knowledge-based detection method has a disadvantage in that it is difficult to detect a face in an image having various changes in the face, such as a tilt of a face, an angle of looking at a camera, and an expression, so that it can be applied only in a special case.

Feature-based Face Detection Methods are a combination of the size and shape of the facial feature components (eye, nose, mouth, outline, and contrast), their correlations, the color and texture information of the face and the components It is a method of detecting a face using shape information. In the specific based method, the bottom-up approach is used to find partial features of the face and to integrate the candidate regions (face specific components) to find the face.

Feature-based face detection methods have the advantage of not being sensitive to poses or face orientations, because the processing time can be found quickly and easily. However, it can be mistaken for a background or an object similar to the skin color, and the color and texture information of the face may be lost as the brightness of the light changes. In addition, there is a disadvantage that can not detect the feature components of the face according to the degree of inclination of the face.

Template matching-based Face Detection Methods create a standard template for all of the target faces and then detect the face by comparing the similarity with the input image. There are algorithms and variant template algorithms.

The template matching-based face detection method generates information using partial regions or outlines from the prepared image data, and then transforms the generated information through algorithms to increase the amount of similar information and use it for face detection. However, the template matching-based face detection method is sensitive to the change in the size of the face according to the distance, the rotation angle and the tilt of the face according to the gaze direction, and it is difficult to define templates for different poses like the knowledge-based method.

An appearance-based method is a method of detecting a face using a model trained by a set of training images using pattern recognition. Appearance-based methods are one of the most used methods in the face detection field, and are based on eigenface, linear discriminant analysis (LDA), and neural network generated by Principal Component Analysis (PCA). NN), Adaboost, and Support Vector Machines (SVMs).

Appearance-based methods use the existing face and non-face learning data groups to detect face regions in complex images and generate learned eigenvectors to find faces. This method has the advantage of high recognition rate because the constraints mentioned in other detection methods are overcome by learning. However, appearance-based methods such as PCA, NN, and SVM require a lot of time to learn the database, and also have a disadvantage of having to learn again when the database changes.

Next, as a face recognition step, face recognition technology is a method used to identify a face after detecting a face through a multimedia image. Face recognition technology can be classified as shown in Table 1 below, in the present invention is used to identify the components of the face.

TABLE 2

In the face recognition method according to holistic methods, the input of the face recognition system uses the entire face area. In general, the holistic face recognition method has an advantage that can be easily implemented, but it does not take enough detail of the face, so it is difficult to obtain sufficient results. Holistic face recognition methods include principal component analysis (PCA), linear discriminant analysis (LDA), independent component analysis (ICA), tensor face and probabilistic decision-based neural networks (PDBNN).

Feature-based methods first extract spatial features (eyes, nose and mouth), and then the location and spatial characteristics (geometry and appearance) of the spatial features are input to the recognition system. The feature-based method is quite complicated because there is a variety of feature information on the face, so it is necessary to determine how to select the best features to improve face recognition performance. However, the typical feature-based methods, such as Pure Geometry, Dynamic Link Architecture, and Hidden Markov model, have much better performance than the above holistic matching method. Are utilized.

Hybrid methods are very complicated because they use the entire face area to recognize a face along with the location characteristics, but the recognition rate is much superior to the holistic matching and feature-based matching methods. Hybrid methods include linear feature analysis (LFA), shape-normalization, and component-based methods.

On the other hand, it can be implemented as an element technology for basic SIP (Session Initiation Protocol) based MoIP service because it can be applied to MoIP supplementary service technology through technology matching or image matching that is not directly related to the present invention.

FIG. 8 is a diagram illustrating a configuration and operation (Req / Resp) sequence for a SIP service, FIG. 9 is a diagram illustrating a basic message and status code scheme of SIP, and FIGS. 10 and 11 are a basic SIP protocol stack and call setup. As a diagram illustrating each process, SIP is a protocol for managing sessions or calls in multimedia communication, and is a technique that focuses on multimedia communication management through signaling rather than multimedia data transmission itself. Table 3 below summarizes the components of SIP service and its main functions.

TABLE 3

8 to 11, the basic call setup process of the SIP protocol will be described. First, a caller sends an INVITE request message for creating a session to a callee. These messages go through several SIP servers to be delivered to the receiver. The received proxy server parses the message to recognize the recipient and delivers the received message to the appropriate proxy server or recipient's user agent (UA).

The receiver receiving the INVITE message sends a response message to the INVITE message. The response message has a status code indicating the result of processing. If the receiver receives and processes the message correctly, it sends a “200 OK” response message to the sender. The sender who receives the response sends a ACK request message back to the receiver to inform the receiver that the response message is correctly received.

On the other hand, the wired / wireless convergence service environment is an environment in which the mobility of terminals is generalized, and various types of access networks are selected based on criteria such as service quality and user preferences, rather than access network access in a simple sense for existing communication access. It is evolving into a mobile terminal environment between heterogeneous networks that connect and communicate with each other. Therefore, in order to access a terminal between heterogeneous networks, a mobility support technology between heterogeneous networks is required, and a function for mobility support technology must be mounted in the terminal.

In addition, in order to be able to communicate in different access networks, a multi-mode terminal having a plurality of communication interfaces to access networks to which the terminals can be connected is required. Recently, there is a movement to enhance the utilization of single-mode terminals in relation to femtocells, but in the future, considering the services such as service migration and conversion as well as mobility between heterogeneous networks due to service quality, cost and user demand, multi-mode terminals The need for more is growing. Moreover, in the use of such a multimode terminal, the current approach takes the form of changing the communication mode for connection to a heterogeneous network, which requires a reset of the terminal's power and services.

Therefore, in a heterogeneous mobile communication environment using a multimode terminal free of physical movement in the future, an automatic access control technique between heterogeneous networks is required in a multimode terminal in which handover between heterogeneous networks is automatically controlled without user terminal setting and service disconnection. Do.

12 is a conceptual diagram of a content matching system according to an embodiment of the present invention. As a general operation procedure, an avatar is recognized to a counterpart by recognizing a user's face and emotion during a video call through an Avatar video call program included in each user's terminal. The facial expressions and emoticons are matched on the screen and transmitted.

In the video call application, the user performs a continuous face recognition and emotion during the call. At this time, more effective and enjoyable video call with the other party is possible through avatar matching and emoticon matching with perceived emotion or facial expression.

Basic video call is made through SIP-based video conference and performs data transmission and reception using RTP / RTCP. Depending on the network environment and situation, the switch to HTTP streaming may be considered.

FIG. 13 is a diagram illustrating a basic operation procedure for image registration according to the present invention. Through the basic operation procedure illustrated in FIG. 13, an avatar and a video are matched. As shown in FIG. 13, image input is performed through a camera module of a smart phone (terminal) to perform image preprocessing for face recognition, facial expression recognition, and emotion recognition. In addition, it extracts possible face candidates and analyzes the components of the face to extract information for posture estimation and emotion recognition. Through this information, the facial expression and motion of the avatar are selected, the position in the 3D space is calculated, matched with the video, and displayed on the screen. It also encodes this image and sends it over the network to a remote video call smartphone.

14 is a schematic diagram of an avatar video communication operation procedure through emotion recognition and image registration according to the present invention. Step 1 is a step of capturing a frame image to be analyzed in a video data source from a camera module, and step 2 is a analysis of the captured image. The preprocessing is performed in an easy-to-use state, so that the boundary between objects in the image can be grasped by an edge detection algorithm or the like.

Next, step (3) detects a face from the preprocessed image. Analyzes facial feature points by applying a learning algorithm, extracts the position and relationship data on the image of the component, and step (4) uses the face detection step. At the stage of posture estimation and facial expression recognition based on the extracted recognition information, the attitude and facial expression of the avatar regarding the posture and facial expression taken by the face are selected. Next, step ⑤ determines the avatar (face) position coordinates in the 3D space through the analyzed information, selects an avatar animation for the expression and emotion, and transmits a control signal (message) to the 3D engine.

In step ⑥, the 3D space in which the avatar image and video are represented is composed, and the analyzed avatar is placed in the corresponding position (controlling the avatar and 3D space through the 3D engine API). The avatar and the video source to be represented are matched to a single image source, and in step ⑧, the video is encoded along with the voice source. The audio source is extracted from the video source and processed.

Finally, in the step ⑨, the network is sent to the other terminal configured for the video call. The session is configured through the SIP and transmitted through the Internet through the RTP / RTCP.

FIG. 15 illustrates an avatar video communication operation process through emotion recognition and image registration of FIG. 14 using actual content. As shown in FIG. 15, face tracking and eye, nose, and mouth are recognized from the captured image, and through this, standard analysis and relationship technology are applied, emotional inference and real-time matching to match a virtual model to the face of the real world. Thus, if the expression is made during the video call, a real-time avatar is implemented on the user's face.

As described above, an apparatus and method for emotion content service of a communication terminal device according to the present invention, an apparatus and method for emotion recognition therefor, an apparatus and method for generating and matching emotion content using the same, and whether the emotion content is voice recognition or an object Facial region detection technology of face recognition technology, facial region normalization technology, feature extraction technology in face region, facial component (expression analysis) relationship technology of emotion recognition technology of object recognition, object hand gesture, object recognition action and On the basis of behavioral cognitive technology, real-time matching technology of real-life and virtual images is used to match mixed virtual objects (including characters) through gesture and facial expression analysis on both faces and bodies making video calls, which cannot be seen in the real world. Various mixed reality is realized through video call.

In addition, an apparatus and method for emotion contents service of a communication terminal device according to the present invention, an apparatus and method for emotion recognition for the same, an apparatus and method for generating and matching emotion contents using the same, the emotion contents include voice, a specific expression of a face and a body If the similar voice, facial expressions and gestures are transmitted through the video, the virtual objects responding to the voices, facial expressions and gestures are matched on the face and body in real time. It is a wonderful way to enjoy your video calls.

In addition, an apparatus and method for emotion content service of a communication terminal device according to the present invention, an apparatus and method for emotion recognition for the same, an apparatus and method for generating and matching emotion content using the same, and the emotion content may be expressed through a mobile device. By developing a system that recognizes gestures and expresses them through avatars, it will become a foundation to take a leap forward in the cutting-edge video industry, such as domestic film, animation, and cyber characters, and bring human emotions and facial expressions to third parties (Avatar). It will greatly contribute to enhancing the competitiveness of domestic mobile contents and video contents industry by making the process of expressing 3D virtual objects on the face of the real world more natural.

Meanwhile, in the detailed description of the present invention, specific embodiments have been described, but various modifications are possible without departing from the scope of the present invention. FIG. 16 is a diagram illustrating various fields to which the present invention is applicable, and it is expected that the protagonist and the viewer of the movie may communicate in the future due to the spread of the smart device, and the main character of the camera installed on the upper part of the device. It can be applied to the high-tech cultural industry that can provide diversity to see the same video but have different experiences by showing the appropriate response and gestures through the recognition.

In addition, the present invention is a script-based expression and gesture expression technology used in movies, animation, cyber characters, interface design using user emotion reaction, facial recognition for security and surveillance, consumer's emotional response measurement for products and designs The field of application is endless. Therefore, the scope of the present invention should not be limited to the described embodiments, but should be determined not only by the scope of the following claims, but also by the equivalents of the claims.

Claims

Extracting the emotional state of the user from at least one of a gesture and a facial expression of the user photographed through the imaging means of the communication terminal apparatus having at least an image pickup means and a display means, and extracting a virtual object corresponding to the extracted emotional state; Generating and superimposing on at least one of a body and a face of the user to display on the display means of the communication terminal device of the other party having a video call with the user.
The method of claim 1, wherein the virtual object,

Emotional content service method of a communication terminal device further comprising a character.
The method of claim 1, wherein the virtual object,

Emotional content service method of a communication terminal device, characterized in that changeable by the user.
The method of claim 1, wherein the virtual object,

Emotion content service method of a communication terminal device, characterized in that for changing in real time corresponding to the emotional state.
The method of claim 1, wherein the virtual object,

Emotion contents service method of a communication terminal device, characterized in that for changing the position superimposed on the body and face of the user.
The communication terminal according to claim 1, further comprising a voice input means and a voice output means.

Emotion content service method of a communication terminal device, characterized in that further extracting the emotional state of the user from the user's voice input through the voice input means.
Inputting a face image of the user to the communication terminal device;

Extracting face components from the input face image;

Preprocessing the extracted facial components;

Extracting facial features from the preprocessed facial components;

Registering the extracted feature of the face in a face database; And

And a step of recognizing an emotion by comparing a feature registered in the face database with a face component extracted in the feature extraction process.
Receiving a face image of a user from a camera module of the communication terminal device;

Preprocessing the input face image;

Detecting only valid data in the preprocessed face image;

Estimating the position of the face and the camera information from the detected valid data; And

Generating a 3D image from the camera information and position information of the face, and matching the generated 3D image to the face image of the user; the emotion through a emotion recognition method for an emotion content service of a communication terminal device How to create and match content.
The method of claim 8,

Outputting the 3D image matched with the face image of the user on a screen; And

Transmitting the 3D image matched with the face image of the user to a counterpart communication terminal through a network; generating the emotion content through an emotion recognition method for an emotion content service of a communication terminal device; And how to match.
Capturing a frame image to be analyzed in a moving image data source from a camera module of the communication terminal device;

Preprocessing the captured image in an easy to analyze state;

Detecting a face in the preprocessed image;

Recognizing a posture estimation and a facial expression based on the recognition information extracted through the face detection process, and selecting a posture and facial expression of the avatar with respect to the posture and facial expression taken by the face;

Determining position coordinates of the selected avatar on the 3D space through the analyzed information, selecting an avatar animation for a corresponding expression and emotion, and transmitting a control signal to the 3D engine;

Composing a 3D space in which the avatar image and the video are to be represented, and performing a function of placing the analyzed avatar at a corresponding position;

Matching the avatar and the video source represented through the 3D space into a single image source;

Performing video encoding on the matched image together with a voice source; And

And transmitting the matched image to a counterpart terminal through a network configured for a video call.
The method of claim 10, wherein the detecting of the face in the preprocessed image comprises:

Emotion content service method of a communication terminal device characterized by analyzing the facial feature points by applying a learning algorithm, and extracts the position and the relationship data on the image of the component.
The method of claim 10, wherein the transmitting of the matched image to the opposite terminal through the network comprises:

Emotional content service method of a communication terminal device, characterized in that for establishing a session through the SIP, and transmitting to the Internet via RTP / RTCP.
A server communication unit interworking with a video call service providing terminal; And

Recognizing the emotional state of the user in at least one of the gesture and facial expression of the user from the image information received from the video call service providing terminal, and compared the recognized emotional state with the pre-stored object-related information, the recognized emotion A server controller which extracts an object matching a state and superimposes the extracted object on at least one of a body and a face of the user and transmits the extracted object to a counterpart video call service providing terminal communicating with the video call service providing terminal; Emotion content service device of a communication terminal device comprising a.
The method of claim 13,

And a server storage unit for storing object related data corresponding to the emotional state.
A display unit displaying an image of the other party and an object overlapping the image according to a video call;

A communication unit interworking with a video call service providing server;

An imaging unit which acquires image information of a user according to a video call; And

Recognizing the emotional state of the user from the image information obtained by the image pickup unit, extracts the emotion information related to the recognized emotional state and transmits to the video call service providing server, the video call from the video call service providing server And a controller configured to receive an object corresponding to the emotion information of the other party according to the video call, and to superimpose the received object on a position associated with the received object in the video of the other party according to the video call and output the object to the display unit. Emotion recognition device for emotion content service of the terminal device.
A display unit displaying an image of the other party and an object overlapping the image according to a video call;

A communication unit interworking with a video call service providing server;

An imaging unit which acquires image information of a user according to a video call; And

Recognizing the emotional state of the user from the image information obtained by the image pickup unit, extracts the emotion information related to the recognized emotional state and transmits to the video call service providing server, the video call from the video call service providing server And a controller configured to receive an object corresponding to the emotion information of the other party according to the video call, and to superimpose the received object on a position associated with the received object in the video of the other party according to the video call and output the object to the display unit. And generating and matching the emotion content through an emotion recognition method for an emotion content service of a terminal device.
The method of claim 16, wherein the display unit,

And generating and matching the emotion content through an emotion recognition method for an emotion content service of a communication terminal device, further comprising displaying an image of a user according to the video call.
The method of claim 16,

And a key input unit configured to determine whether to apply the object. The apparatus for generating and matching the emotion content through an emotion recognition method for an emotion content service of a communication terminal device.
The method of claim 16,

And a storage unit for storing the object. The apparatus for generating and matching the emotion content through an emotion recognition method for an emotion content service of a communication terminal device.
Extracting the emotion state of the user from at least one of a gesture and an expression of the user photographed through the imaging means of the communication terminal device having at least an image pickup means and a display means, and extracting a gesture and facial expression corresponding to the extracted emotion state; Emotion content service method of a communication terminal device to exaggerate at least one portion of the user's body and face to be displayed on the display means of the communication terminal device of the other party having a video call with the user.