KR20220077390A

KR20220077390A - Electronic device and Method for controlling the electronic device thereof

Info

Publication number: KR20220077390A
Application number: KR1020200166292A
Authority: KR
Inventors: 김영욱; 길태호; 김경수; 김대훈; 김현한; 백서현; 손규빈; 정호진
Original assignee: 삼성전자주식회사
Priority date: 2020-12-02
Filing date: 2020-12-02
Publication date: 2022-06-09
Also published as: WO2022119179A1

Abstract

전자 장치 및 이의 제어 방법이 제공된다. 본 전자 장치의 제어 방법은 인물 및 오브젝트를 포함하는 이미지를 획득하고, 오브젝트에 포함된 복수의 영역 각각에 대응되는 어포던스에 대한 특징값을 획득하도록 학습된 제1 뉴럴 네트워크에 획득된 이미지를 입력하여 이미지에 포함된 오브젝트의 복수의 영역에 대응되는 어포던스에 대한 제1 특징값을 획득하며, 획득된 제1 특징값을 바탕으로 오브젝트를 이용하는 인물의 행동을 인식한다.An electronic device and a method for controlling the same are provided. The control method of the present electronic device acquires an image including a person and an object, and inputs the acquired image to a first neural network trained to acquire a feature value for affordance corresponding to each of a plurality of regions included in the object. A first feature value of affordances corresponding to a plurality of regions of an object included in an image is acquired, and a behavior of a person using the object is recognized based on the acquired first feature value.

Description

Electronic device and method for controlling the same

본 개시는 전자 장치 및 이의 제어 방법에 관한 것으로, 더욱 상세하게는 촬영된 이미지에 포함된 인물의 행동을 인식할 수 있는 전자 장치 및 이의 제어 방법에 관한 것이다.The present disclosure relates to an electronic device and a control method thereof, and more particularly, to an electronic device capable of recognizing a behavior of a person included in a photographed image and a control method thereof.

근래에는 신경망 모델을 이용하여 이미지 속에 포함된 인물의 행동을 인식하는 기술이 존재한다. 특히, 종래에는 이미지 속에 포함된 인물의 행동에 대응되는 클래스를 도출하도록 신경망 모델을 학습했다. Recently, there is a technique for recognizing the behavior of a person included in an image using a neural network model. In particular, in the prior art, a neural network model was trained to derive a class corresponding to the behavior of a person included in an image.

인물 행동에 대응되는 클래스를 도출하도록 신경망 모델을 학습하기 위하여, 클래스에 대응되는 학습 데이터가 필요하다. 또한, 신경망 모델을 통해 더욱 정밀한 인물 행동 인식 결과를 획득하기 위해서는 인물 행동에 대응되는 클래스가 증가할 필요가 존재한다. 즉, 신경망 모델을 통해 더욱 정밀한 인물 행동 인식 결과를 획득하기 위하여 학습 데이터가 기하 급수적으로 증가하는 한계가 존재한다.In order to learn a neural network model to derive a class corresponding to a person's behavior, learning data corresponding to the class is required. In addition, in order to obtain a more precise character behavior recognition result through the neural network model, it is necessary to increase the class corresponding to the character behavior. That is, there is a limit in that the training data increases exponentially in order to obtain a more precise human behavior recognition result through the neural network model.

뿐만 아니라, 하나의 오브젝트에 대해서도 인물의 다양한 행동이 존재할 수 있다. 예로, 이미지 내에 인물이 칼 손잡이를 쥐는 경우, 행동 인식의 결과가 "사람이 칼을 쥔다"일 수 있으나, 이미지 내에 인물이 칼 날을 쥐는 경우, 행동 인식의 결과가 "사람이 칼에 베이다"일 수 있다.In addition, various actions of a person may exist for one object. For example, when a person holds a knife handle in the image, the result of behavior recognition may be “a person is holding a knife”, but when a person in the image holds a knife blade, the result of the action recognition is “a person is cut by a knife” can be

즉, 오브젝트의 복수의 영역 중 인물의 행동과 관련된 영역에 따라 상이한 행동 인식 결과를 도출할 필요성이 존재한다.That is, there is a need to derive different behavior recognition results according to a region related to a person's behavior among a plurality of regions of an object.

본 발명의 목적은 오브젝트에 포함된 복수의 영역 각각에 대응되는 어포던스에 대한 특징값을 획득하도록 학습된 제1 뉴럴 네트워크를 통해 이미지에 포함된 오브젝트의 복수의 영역 중 인물과 관련된 영역에 대응되는 어포던스를 바탕으로 인물의 행동을 인식할 수 있는 전자 장치 및 이의 제어 방법에 관한 것이다.It is an object of the present invention to obtain an affordance corresponding to a person-related area among a plurality of areas of an object included in an image through a first neural network trained to obtain a feature value of an affordance corresponding to each of a plurality of areas included in the object. To an electronic device capable of recognizing a person's behavior based on the , and a method for controlling the same.

본 개시의 일 실시예에 따른, 전자 장치의 제어 방법은, 인물 및 오브젝트를 포함하는 이미지를 획득하는 단계; 오브젝트에 포함된 복수의 영역 각각에 대응되는 어포던스에 대한 특징값을 획득하도록 학습된 제1 뉴럴 네트워크에 상기 획득된 이미지를 입력하여 상기 이미지에 포함된 오브젝트의 복수의 영역에 대응되는 어포던스에 대한 제1 특징값을 획득하는 단계; 및 상기 획득된 제1 특징값을 바탕으로 상기 이미지에 포함된 오브젝트를 이용하는 상기 인물의 행동을 인식하는 단계;를 포함한다.According to an embodiment of the present disclosure, a method of controlling an electronic device includes: acquiring an image including a person and an object; The obtained image is input to a first neural network trained to obtain a feature value of an affordance corresponding to each of a plurality of regions included in the object, and the obtained image is inputted to an affordance corresponding to a plurality of regions of the object included in the image. 1 acquiring a feature value; and recognizing the action of the person using the object included in the image based on the acquired first feature value.

본 개시의 일 실시예에 따른, 전자 장치는 적어도 하나의 인스트럭션을 저장하는 메모리;및 프로세서;를 더 포함하고, 상기 프로세서는, 상기 적어도 하나의 인스트럭션을 실행함으로써, 인물 및 오브젝트를 포함하는 이미지를 획득하며, 오브젝트에 포함된 복수의 영역 각각에 대응되는 어포던스에 대한 특징값을 획득하도록 학습된 제1 뉴럴 네트워크에 상기 획득된 이미지를 입력하여 상기 이미지에 포함된 오브젝트의 복수의 영역에 대응되는 어포던스에 대한 제1 특징값을 획득하고, 상기 획득된 제1 특징값을 바탕으로 상기 이미지에 포함된 상기 오브젝트를 이용하는 상기 인물의 행동을 인식할 수 있다According to an embodiment of the present disclosure, the electronic device further includes: a memory for storing at least one instruction; and a processor, wherein the processor executes the at least one instruction to generate an image including a person and an object. and input the acquired image to a first neural network trained to obtain feature values for affordances corresponding to each of the plurality of regions included in the object, and affordances corresponding to the plurality of regions of the object included in the image may acquire a first feature value for , and recognize the action of the person using the object included in the image based on the acquired first feature value

상술한 바와 같은 본 개시의 실시예에 따라, 전자 장치는 이미지 속에 포함된 인물의 행동을 더욱 정확하게 인식할 수 있게 된다.According to the embodiment of the present disclosure as described above, the electronic device can more accurately recognize the behavior of a person included in an image.

도 1은 본 개시의 일 실시예에 따른, 인물의 행동을 인식하기 위한 구성을 나타내는 블록도,
도 2는 본 개시의 일 실시예에 따른, 지식 데이터베이스를 구축하는 방법을 설명하기 위한 도면,
도 3는 본 개시의 일 실시예에 따른, 지식 데이터베이스에 저장된 지식 그래프를 설명하기 위한 도면,
도 4는 본 개시의 일 실시예에 따른, 제1 뉴럴 네트워크의 학습을 설명하기 위한 도면,
도 5a 및 도 5b는 본 개시의 일 실시예에 따른, 어포던스 라벨 데이터에 포함된 오브젝트의 영역 별 어포던스를 설명하기 위한 도면,
도 6은 본 개시의 일 실시예에 따른, 제1 뉴럴 네트워크의 학습을 설명하기 위한 도면,
도 7은 본 개시의 다른 실시예에 따른, 인물의 행동을 인식하기 위한 구성을 나타내는 블록도,
도 8은 본 개시의 일 실시예에 따른 방법에 의해 인물의 행동을 인식하는 방법을 설명하기 위한 도면,
도 9는 본 개시의 일 실시예에 따른, 전자 장치의 제어 방법을 설명하기 위한 흐름도, 그리고,
도 10은 본 개시의 일 실시예에 따른, 전자 장치의 구성을 상세히 설명하기 위한 블록도이다.1 is a block diagram showing a configuration for recognizing an action of a person, according to an embodiment of the present disclosure;
2 is a view for explaining a method of building a knowledge database, according to an embodiment of the present disclosure;
3 is a diagram for explaining a knowledge graph stored in a knowledge database, according to an embodiment of the present disclosure;
4 is a diagram for explaining learning of a first neural network according to an embodiment of the present disclosure;
5A and 5B are diagrams for explaining affordance for each area of an object included in affordance label data according to an embodiment of the present disclosure;
6 is a diagram for explaining learning of a first neural network according to an embodiment of the present disclosure;
7 is a block diagram illustrating a configuration for recognizing a behavior of a person, according to another embodiment of the present disclosure;
8 is a view for explaining a method for recognizing a person's behavior by a method according to an embodiment of the present disclosure;
9 is a flowchart for explaining a method of controlling an electronic device, according to an embodiment of the present disclosure;
10 is a block diagram for describing in detail the configuration of an electronic device according to an embodiment of the present disclosure.

본 실시 예들은 다양한 변환을 가할 수 있고 여러 가지 실시 예를 가질 수 있는바, 특정 실시 예들을 도면에 예시하고 상세한 설명에 상세하게 설명하고자 한다. 그러나 이는 특정한 실시 형태에 대해 범위를 한정하려는 것이 아니며, 본 개시의 실시 예의 다양한 변경(modifications), 균등물(equivalents), 및/또는 대체물(alternatives)을 포함하는 것으로 이해되어야 한다. 도면의 설명과 관련하여, 유사한 구성요소에 대해서는 유사한 참조 부호가 사용될 수 있다.Since the present embodiments can apply various transformations and can have various embodiments, specific embodiments are illustrated in the drawings and described in detail in the detailed description. However, this is not intended to limit the scope of the specific embodiments, and should be understood to include various modifications, equivalents, and/or alternatives of the embodiments of the present disclosure. In connection with the description of the drawings, like reference numerals may be used for like components.

본 개시를 설명함에 있어서, 관련된 공지 기능 혹은 구성에 대한 구체적인 설명이 본 개시의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우 그에 대한 상세한 설명은 생략한다. In describing the present disclosure, if it is determined that a detailed description of a related known function or configuration may unnecessarily obscure the subject matter of the present disclosure, a detailed description thereof will be omitted.

덧붙여, 하기 실시 예는 여러 가지 다른 형태로 변형될 수 있으며, 본 개시의 기술적 사상의 범위가 하기 실시 예에 한정되는 것은 아니다. 오히려, 이들 실시 예는 본 개시를 더욱 충실하고 완전하게 하고, 당업자에게 본 개시의 기술적 사상을 완전하게 전달하기 위하여 제공되는 것이다.In addition, the following examples may be modified in various other forms, and the scope of the technical spirit of the present disclosure is not limited to the following examples. Rather, these embodiments are provided to more fully and complete the present disclosure, and to fully convey the technical spirit of the present disclosure to those skilled in the art.

본 개시에서 사용한 용어는 단지 특정한 실시 예를 설명하기 위해 사용된 것으로, 권리범위를 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다.The terms used in the present disclosure are used only to describe specific embodiments, and are not intended to limit the scope of rights. The singular expression includes the plural expression unless the context clearly dictates otherwise.

본 개시에서, "가진다," "가질 수 있다," "포함한다," 또는 "포함할 수 있다" 등의 표현은 해당 특징(예: 수치, 기능, 동작, 또는 부품 등의 구성요소)의 존재를 가리키며, 추가적인 특징의 존재를 배제하지 않는다. In the present disclosure, expressions such as “have,” “may have,” “include,” or “may include” indicate the presence of a corresponding characteristic (eg, a numerical value, function, operation, or component such as a part). and does not exclude the presence of additional features.

본 개시에서, "A 또는 B," "A 또는/및 B 중 적어도 하나," 또는 "A 또는/및 B 중 하나 또는 그 이상"등의 표현은 함께 나열된 항목들의 모든 가능한 조합을 포함할 수 있다. 예를 들면, "A 또는 B," "A 및 B 중 적어도 하나," 또는 "A 또는 B 중 적어도 하나"는, (1) 적어도 하나의 A를 포함, (2) 적어도 하나의 B를 포함, 또는 (3) 적어도 하나의 A 및 적어도 하나의 B 모두를 포함하는 경우를 모두 지칭할 수 있다.In this disclosure, expressions such as "A or B," "at least one of A and/and B," or "one or more of A or/and B" may include all possible combinations of the items listed together. . For example, "A or B," "at least one of A and B," or "at least one of A or B" means (1) includes at least one A, (2) includes at least one B; Or (3) it may refer to all cases including both at least one A and at least one B.

본 개시에서 사용된 "제1," "제2," "첫째," 또는 "둘째,"등의 표현들은 다양한 구성요소들을, 순서 및/또는 중요도에 상관없이 수식할 수 있고, 한 구성요소를 다른 구성요소와 구분하기 위해 사용될 뿐 해당 구성요소들을 한정하지 않는다. As used in the present disclosure, expressions such as “first,” “second,” “first,” or “second,” may modify various elements, regardless of order and/or importance, and refer to one element. It is used only to distinguish it from other components, and does not limit the components.

어떤 구성요소(예: 제1 구성요소)가 다른 구성요소(예: 제2 구성요소)에 "(기능적으로 또는 통신적으로) 연결되어((operatively or communicatively) coupled with/to)" 있다거나 "접속되어(connected to)" 있다고 언급된 때에는, 상기 어떤 구성요소가 상기 다른 구성요소에 직접적으로 연결되거나, 다른 구성요소(예: 제3 구성요소)를 통하여 연결될 수 있다고 이해되어야 할 것이다. A component (eg, a first component) is "coupled with/to (operatively or communicatively)" to another component (eg, a second component); When referring to "connected to", it will be understood that the certain element may be directly connected to the other element or may be connected through another element (eg, a third element).

반면에, 어떤 구성요소(예: 제1 구성요소)가 다른 구성요소(예: 제2 구성요소)에 "직접 연결되어" 있다거나 "직접 접속되어" 있다고 언급된 때에는, 상기 어떤 구성요소와 상기 다른 구성요소 사이에 다른 구성요소(예: 제3 구성요소)가 존재하지 않는 것으로 이해될 수 있다.On the other hand, when it is said that a component (eg, a first component) is "directly connected" or "directly connected" to another component (eg, a second component), the component and the It may be understood that other components (eg, a third component) do not exist between other components.

본 개시에서 사용된 표현 "~하도록 구성된(또는 설정된)(configured to)"은 상황에 따라, 예를 들면, "~에 적합한(suitable for)," "~하는 능력을 가지는(having the capacity to)," "~하도록 설계된(designed to)," "~하도록 변경된(adapted to)," "~하도록 만들어진(made to)," 또는 "~를 할 수 있는(capable of)"과 바꾸어 사용될 수 있다. 용어 "~하도록 구성된(또는 설정된)"은 하드웨어적으로 "특별히 설계된(specifically designed to)" 것만을 반드시 의미하지 않을 수 있다. The expression “configured to (or configured to)” as used in this disclosure, depending on the context, for example, “suitable for,” “having the capacity to” ," "designed to," "adapted to," "made to," or "capable of." The term “configured (or configured to)” may not necessarily mean only “specifically designed to” in hardware.

대신, 어떤 상황에서는, "~하도록 구성된 장치"라는 표현은, 그 장치가 다른 장치 또는 부품들과 함께 "~할 수 있는" 것을 의미할 수 있다. 예를 들면, 문구 "A, B, 및 C를 수행하도록 구성된(또는 설정된) 프로세서"는 해당 동작을 수행하기 위한 전용 프로세서(예: 임베디드 프로세서), 또는 메모리 장치에 저장된 하나 이상의 소프트웨어 프로그램들을 실행함으로써, 해당 동작들을 수행할 수 있는 범용 프로세서(generic-purpose processor)(예: CPU 또는 application processor)를 의미할 수 있다.Instead, in some circumstances, the expression “a device configured to” may mean that the device is “capable of” with other devices or parts. For example, the phrase "a processor configured (or configured to perform) A, B, and C" refers to a dedicated processor (eg, an embedded processor) for performing the corresponding operations, or by executing one or more software programs stored in a memory device. , may mean a generic-purpose processor (eg, a CPU or an application processor) capable of performing corresponding operations.

실시 예에 있어서 '모듈' 혹은 '부'는 적어도 하나의 기능이나 동작을 수행하며, 하드웨어 또는 소프트웨어로 구현되거나 하드웨어와 소프트웨어의 결합으로 구현될 수 있다. 또한, 복수의 '모듈' 혹은 복수의 '부'는 특정한 하드웨어로 구현될 필요가 있는 '모듈' 혹은 '부'를 제외하고는 적어도 하나의 모듈로 일체화되어 적어도 하나의 프로세서로 구현될 수 있다.In an embodiment, a 'module' or 'unit' performs at least one function or operation, and may be implemented as hardware or software, or a combination of hardware and software. In addition, a plurality of 'modules' or a plurality of 'units' may be integrated into at least one module and implemented with at least one processor, except for 'modules' or 'units' that need to be implemented with specific hardware.

한편, 도면에서의 다양한 요소와 영역은 개략적으로 그려진 것이다. 따라서, 본 발명의 기술적 사상은 첨부한 도면에 그려진 상대적인 크기나 간격에 의해 제한되지 않는다. Meanwhile, various elements and regions in the drawings are schematically drawn. Accordingly, the technical spirit of the present invention is not limited by the relative size or spacing drawn in the accompanying drawings.

한편, 본 개시의 다양한 실시 예에 따른 전자 장치는 예를 들면, 스마트 폰, 태블릿 PC, 데스크탑 PC, 랩탑 PC 또는 웨어러블 장치 중 적어도 하나를 포함할 수 있다. 웨어러블 장치는 액세서리형(예: 시계, 반지, 팔찌, 발찌, 목걸이, 안경, 콘택트 렌즈, 또는 머리 착용형 장치(head-mounted-device(HMD)), 직물 또는 의류 일체형(예: 전자 의복), 신체 부착형(예: 스킨 패드 또는 문신), 또는 생체 이식형 회로 중 적어도 하나를 포함할 수 있다. Meanwhile, the electronic device according to various embodiments of the present disclosure may include, for example, at least one of a smart phone, a tablet PC, a desktop PC, a laptop PC, and a wearable device. A wearable device may be an accessory (e.g., watch, ring, bracelet, anklet, necklace, eyewear, contact lens, or head-mounted-device (HMD)), a textile or clothing integral (e.g. electronic garment); It may include at least one of body-attached (eg, skin pad or tattoo), or bioimplantable circuitry.

어떤 실시 예들에서, 전자 장치는 예를 들면, 텔레비전, DVD(digital video disk) 플레이어, 오디오, 냉장고, 에어컨, 청소기, 오븐, 전자레인지, 세탁기, 공기 청정기, 셋톱 박스, 홈 오토매이션 컨트롤 패널, 보안 컨트롤 패널, 미디어 박스(예: 삼성 HomeSync^TM, 애플TV^TM, 또는 구글 TV^TM), 게임 콘솔(예: Xbox^TM, PlayStation^TM), 전자 사전, 전자 키, 캠코더, 또는 전자 액자 중 적어도 하나를 포함할 수 있다.In some embodiments, the electronic device may include, for example, a television, digital video disk (DVD) player, audio, refrigerator, air conditioner, vacuum cleaner, oven, microwave oven, washing machine, air purifier, set-top box, home automation control panel, Secure at least one of a control panel, media box (eg Samsung HomeSync ^TM , Apple TV ^TM , or Google TV ^TM ), game console (eg Xbox ^TM , PlayStation ^TM ), electronic dictionary, electronic key, camcorder, or electronic picture frame. may include

이하에서는 첨부한 도면을 참고하여 본 개시에 따른 실시 예에 대하여 본 개시가 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 상세히 설명한다.Hereinafter, embodiments according to the present disclosure will be described in detail with reference to the accompanying drawings so that those of ordinary skill in the art to which the present disclosure pertains can easily implement them.

이하에서는 도면을 참조하여 본 개시에 대해 더욱 상세히 설명하기로 한다. 도 1은 본 개시의 일 실시예에 따른, 전자 장치의 구성을 나타내는 블록도이다. 전자 장치(100)는 카메라(110), 메모리(120) 및 프로세서(130)를 포함한다. 이때, 전자 장치(100)는 스마트 폰으로 구현될 수 있다. 다만, 본 개시에 따른 전자 장치(100)가 특정 유형의 장치에 국한되는 것은 아니며, 태블릿 PC 및 노트북 PC 등과 같이 다양한 종류의 전자 장치(100)로 구현될 수도 있다. Hereinafter, the present disclosure will be described in more detail with reference to the drawings. 1 is a block diagram illustrating a configuration of an electronic device according to an embodiment of the present disclosure. The electronic device 100 includes a camera 110 , a memory 120 , and a processor 130 . In this case, the electronic device 100 may be implemented as a smart phone. However, the electronic device 100 according to the present disclosure is not limited to a specific type of device, and may be implemented as various types of electronic devices 100 such as a tablet PC and a notebook PC.

카메라(110)는 이미지를 촬영할 수 있다. 특히, 카메라(110)는 오브젝트 및 인물을 포함하는 이미지를 촬영할 수 있다. 이때, 이미지는 정지 영상 또는 동영상일 수 있다.The camera 110 may capture an image. In particular, the camera 110 may capture an image including an object and a person. In this case, the image may be a still image or a moving image.

또한, 카메라(110)는 서로 상이한 복수의 렌즈를 포함할 수 있다. 여기서, 복수의 렌즈가 서로 상이하다는 것은 복수의 렌즈 각각의 화각(Field of View, FOV)이 서로 상이한 경우 및 복수의 렌즈 각각이 배치된 위치가 상이한 경우 등을 포함할 수 있다. Also, the camera 110 may include a plurality of lenses different from each other. Here, that the plurality of lenses are different from each other may include a case in which a field of view (FOV) of each of the plurality of lenses is different from each other and a case in which positions at which each of the plurality of lenses are disposed are different, and the like.

메모리(120)는 이미지 내에 포함된 인물의 행동을 인식하기 위한 모듈이 각종 동작을 수행하기 위해 필요한 데이터를 저장할 수 있다. 인물의 행동을 인식하기 위한 모듈은 이미지 처리 모듈(131), 어포던스 특징 획득 모듈(132), 행동 특징 획득 모듈(134) 및 행동 인식 모듈(136)을 포함할 수 있다. 또한, 메모리(110)는 이미지에 포함된 인물의 행동을 인식하기 위한 제1 내지 제3 뉴럴 네트워크(133,135,137)를 저장할 수 있다. The memory 120 may store data necessary for the module for recognizing the behavior of a person included in the image to perform various operations. The module for recognizing the behavior of the person may include an image processing module 131 , an affordance feature acquiring module 132 , a behavioral feature acquiring module 134 , and a behavior recognition module 136 . Also, the memory 110 may store the first to third neural networks 133 , 135 , and 137 for recognizing the behavior of a person included in the image.

한편, 메모리(120)는 전력 공급이 중단되더라도 저장된 정보를 유지할 수 있는 비휘발성 메모리 및 저장된 정보를 유지하기 위해서는 지속적인 전력 공급이 필요한 휘발성 메모리를 포함할 수 있다. 인물의 행동을 인식하기 위한 모듈이 각종 동작을 수행하기 위한 데이터는 비휘발성 메모리에 저장될 수 있다. 그리고, 오브젝트에 포함된 복수의 영역 각각에 대응되는 어포던스에 대한 정보를 획득하기 위한 제1 뉴럴 네트워크(133), 인물의 행동 특징에 대한 정보를 획득하기 위한 제2 뉴럴 네트워크(135) 및 어포던스에 대한 정보 및 인물의 행동 특징에 대한 정보를 바탕으로 인물의 행동을 인식하기 위한 제3 뉴럴 네트워크(137) 역시 비휘발성 메모리에 저장될 수 있다. Meanwhile, the memory 120 may include a non-volatile memory capable of maintaining stored information even when power supply is interrupted, and a volatile memory requiring continuous power supply to maintain the stored information. Data for the module for recognizing the behavior of a person to perform various operations may be stored in a non-volatile memory. In addition, the first neural network 133 for acquiring information on affordances corresponding to each of the plurality of regions included in the object, the second neural network 135 for acquiring information on the behavioral characteristics of a person, and affordances The third neural network 137 for recognizing the behavior of the person based on the information about the person and the information on the behavioral characteristics of the person may also be stored in the non-volatile memory.

또한, 메모리(120)는 카메라(110)를 통해 획득되는 이미지 프레임을 일시적으로 저장하는 적어도 하나의 버퍼를 포함할 수 있다.Also, the memory 120 may include at least one buffer for temporarily storing an image frame acquired through the camera 110 .

프로세서(130)는 메모리(120)와 전기적으로 연결되어 전자 장치(100)의 전반적인 기능 및 동작을 제어할 수 있다The processor 130 may be electrically connected to the memory 120 to control overall functions and operations of the electronic device 100 .

프로세서(130)는 이미지 내에 포함된 인물의 행동을 인식하기 위한 사용자 명령이 입력되면, 비휘발성 메모리에 저장되어 있는 인물의 행동을 인식하기 위한 모듈이 각종 동작을 수행하기 위한 데이터를 휘발성 메모리로 로딩(loading)할 수 있다. 그리고, 프로세서(130)는 제1 내지 제3 뉴럴 네트워크를 휘발성 메모리로 로딩할 수 있다. 프로세서(130)는 휘발성 메모리로 로딩된 데이터에 기초하여 각종 모듈 및 뉴럴 네트워크를 통해 각종 동작을 수행할 수 있다. 여기서, 로딩이란 프로세서(130)가 엑세스할 수 있도록 비휘발성 메모리에 저장된 데이터를 휘발성 메모리에 불러들여 저장하는 동작을 의미한다.When a user command for recognizing the action of a person included in the image is input, the processor 130 loads data for the module for recognizing the action of the person stored in the non-volatile memory to perform various operations into the volatile memory. (loading) is possible. In addition, the processor 130 may load the first to third neural networks into the volatile memory. The processor 130 may perform various operations through various modules and neural networks based on data loaded into the volatile memory. Here, the loading refers to an operation of loading and storing data stored in the nonvolatile memory into the volatile memory so that the processor 130 can access it.

구체적으로, 프로세서(130)는 이미지를 획득할 수 있다. 특히, 프로세서(130)는 카메라(110)를 통해 이미지를 획득할 수 있으나, 이는 일 실시예에 불과할 뿐, 외부(예로, 외부 장치, 외부 서버 등)로 이미지를 획득할 수 있다. 이때, 획득된 이미지는 동영상일 수 있으나, 이는 일 실시예에 불과할 뿐, 정지 영상일 수 있다. 또한, 획득된 이미지에는 적어도 하나의 인물 및 적어도 하나의 오브젝트가 포함될 수 있다. 뿐만 아니라, 획득된 이미지에는 인물 이외에 움직일 수 있는 동물이나 물체 등이 포함될 수 있다.Specifically, the processor 130 may acquire an image. In particular, the processor 130 may acquire an image through the camera 110 , but this is only an exemplary embodiment, and the processor 130 may acquire the image externally (eg, an external device, an external server, etc.). In this case, the acquired image may be a moving image, but this is only an example and may be a still image. Also, the acquired image may include at least one person and at least one object. In addition, the acquired image may include a movable animal or object other than a person.

또한, 프로세서(130)는 RGB 정보로 이루어진 이미지를 획득할 수 있으나, 이는 일 실시예에 불과할 뿐, 뎁스 정보로 이루어진 이미지를 획득할 수 있으며, 다양한 센서를 통해 옵티컬 플로우(optical flow) 정보, 사운드 정보 등을 획득할 수 있다.In addition, the processor 130 may acquire an image composed of RGB information, but this is only an example, and may acquire an image composed of depth information, and optical flow information and sound through various sensors information can be obtained.

프로세서(130)는 이미지 처리 모듈(131)을 통해 획득된 이미지를 처리할 수 있다. 특히, 프로세서(130)는 이미지 처리 모듈(131)을 통해 획득된 동영상에 대한 이미지 샘플링을 수행할 수 있다. 즉, 프로세서(130)는 획득된 동영상에 포함된 복수의 이미지 프레임 중 행동 인식을 위한 적어도 하나의 이미지 프레임을 추출할 수 있다. 또는 프로세서(130)는 획득된 동영상에 포함된 복수의 이미지 프레임 중 특정 구간에 대응되는 이미지 프레임을 추출할 수 있다.The processor 130 may process the image acquired through the image processing module 131 . In particular, the processor 130 may perform image sampling on a moving picture acquired through the image processing module 131 . That is, the processor 130 may extract at least one image frame for behavior recognition from among a plurality of image frames included in the obtained video. Alternatively, the processor 130 may extract an image frame corresponding to a specific section from among a plurality of image frames included in the obtained video.

프로세서(130)는 어포던스 특징 획득 모듈(132)을 통해 샘플링된 이미지 프레임으로부터 오브젝트에 포함된 복수의 영역의 어포던스에 대응되는 제1 특징값을 획득할 수 있다. 이때, 어포던스는 인물이 오브젝트를 이용하여 수행할 수 있는 행동에 대한 특성이나 오브젝트에 내제된 특성을 나타내는 것일 수 있다. 예를 들어, "가위"의 어포던스는 "자르다(cutting)."일 수 있으며, "컵의 손잡이"의 어포던스는 "붙잡다(graspable)" 일 수 있다.The processor 130 may acquire a first feature value corresponding to affordances of a plurality of regions included in an object from an image frame sampled through the affordance feature acquisition module 132 . In this case, the affordance may indicate a characteristic of an action that a person can perform using the object or a characteristic inherent in the object. For example, the affordance of "scissors" may be "cutting." and the affordance of "handle of a cup" may be "graspable."

특히, 어포던스 특징 획득 모듈(132)은 제1 뉴럴 네트워크(133)를 이용하여 오브젝트에 포함된 복수의 영역에 대한 제1 특징값을 획득할 수 있다. 이때, 제1 뉴럴 네트워크(133)는 오브젝트에 포함된 복수의 영역 각각에 대응되는 어포던스에 대한 특징값을 획득하도록 학습될 수 있다. 예를 들어, 오브젝트가 "가위"인 경우, 제1 뉴럴 네트워크(133)는 "가위의 손잡이 영역"에 대한 특징값으로서, "가위의 손잡이 영역"에 대응되는 어포던스인 "쥐다(gripping)"에 대응되는 특징값을 획득하고, "가위의 날 영역"에 대한 특징값으로서, "가위의 날 영역"에 대응되는 어포던스인 "자르다(cuttiong)"에 대응되는 특징값을 획득할 수 있다. 즉, 제1 뉴럴 네트워크(133)는 하나의 오브젝트에 대해서도 오브젝트를 구성하는 영역에 따라 상이한 어포던스에 대한 특징값을 획득할 수 있다.In particular, the affordance feature acquisition module 132 may acquire first feature values for a plurality of regions included in the object by using the first neural network 133 . In this case, the first neural network 133 may be trained to obtain a feature value for affordances corresponding to each of a plurality of regions included in the object. For example, if the object is "scissors", the first neural network 133 is a feature value for "handle area of scissors", A corresponding feature value may be obtained, and as a feature value for the “scissors blade region”, a feature value corresponding to “cuttiong” that is an affordance corresponding to the “scissor blade region” may be acquired. That is, even for one object, the first neural network 133 may acquire feature values for different affordances according to regions constituting the object.

이때, 제1 뉴럴 네트워크(133)는 학습 오브젝트의 이미지와 학습 오브젝트의 복수의 영역에 대한 어포던스에 대한 정보를 매칭한 어포던스 라벨 데이터, 및 일반 오브젝트의 이미지와 일반 오브젝트에 대한 복수의 어포던스를 매칭하여 저장하는 지식 데이터베이스를 이용하여 학습될 수 있다. 이에 대해서는 도 2 내지 도 5를 참조하여 추후에 설명하기로 한다. At this time, the first neural network 133 matches the affordance label data of matching the image of the learning object with information on affordances for a plurality of regions of the learning object, and the image of the general object and the plurality of affordances for the general object by matching them. It can be learned using a knowledge database that stores it. This will be described later with reference to FIGS. 2 to 5 .

프로세서(130)는 행동 특징 획득 모듈(134)을 통해 인물의 행동에 대한 제2 특징값을 획득할 수 있다. 특히, 행동 특징 획득 모듈(134)은 제2 뉴럴 네트워크(135)를 이용하여 인물의 행동에 대한 제2 특징값을 획득할 수 있다. 이때, 제2 뉴럴 네트워크(135)는 인물의 행동에 대한 정보를 획득하도록 학습된 제2 뉴럴 네트워크로서, 인물의 행동에 대한 클래스(또는 유형)에 대응되는 특징값을 획득할 수 있다.The processor 130 may acquire the second characteristic value for the behavior of the person through the behavior characteristic acquisition module 134 . In particular, the behavior feature acquisition module 134 may acquire a second feature value for the behavior of the person using the second neural network 135 . In this case, the second neural network 135 is a second neural network that has been trained to acquire information on the behavior of the person, and may acquire a feature value corresponding to a class (or type) of the behavior of the person.

프로세서(130)는 행동 인식 모듈(136)을 통해 어포던스 특징 획득 모듈(132)로부터 획득된 제1 특징값 및 행동 특징 획득 모듈(134)로부터 획득된 제2 특징값을 이용하여 이미지에 포함된 인물의 행동을 인식할 수 있다.The processor 130 uses the first feature value acquired from the affordance feature acquisition module 132 through the behavior recognition module 136 and the second feature value acquired from the behavior feature acquisition module 134 , the person included in the image behavior can be recognized.

구체적으로, 행동 인식 모듈(136)은 어포던스 특징 획득 모듈(132)로부터 획득된 제1 특징값 및 행동 특징 획득 모듈(134)로부터 획득된 제2 특징값으로부터 제3 특징값을 획득할 수 있다. 예를 들어, 행동 인식 모듈(136)은 제1 특징값 및 제2 특징값에 대해 sum, concatenate, pooling 중 적어도 하나의 연산을 수행하여 제3 특징값을 획득할 수 있다.Specifically, the behavior recognition module 136 may acquire the third feature value from the first feature value acquired from the affordance feature acquisition module 132 and the second feature value acquired from the behavior feature acquisition module 134 . For example, the behavior recognition module 136 may obtain the third feature value by performing at least one of sum, concatenate, and pooling on the first feature value and the second feature value.

그리고, 행동 인식 모듈(136)은 제3 특징값을 제3 뉴럴 네트워크(137)에 입력하여 인물의 행동을 인식할 수 있다. 이때, 제3 뉴럴 네트워크(137)는 어포던스 특징 획득 모듈(132)로부터 획득된 제1 특징값 및 행동 특징 획득 모듈(134)로부터 획득된 제2 특징값으로부터 획득된 제3 특징값을 입력하여 인물의 행동에 대한 정보를 획득하도록 학습된 신경망 모델로서, 행동 인식 모듈(136)은 제3 뉴럴 네트워크(137)를 통해 최종 행동 인식 결과를 획득할 수 있다.Then, the behavior recognition module 136 may recognize the behavior of the person by inputting the third feature value into the third neural network 137 . At this time, the third neural network 137 inputs the third feature value obtained from the first feature value obtained from the affordance feature obtaining module 132 and the second feature value obtained from the behavior feature obtaining module 134 to input the person As a neural network model trained to acquire information on the behavior of , the behavior recognition module 136 may acquire a final behavior recognition result through the third neural network 137 .

본 개시의 일 실시예에 따른, 제1 내지 제3 뉴럴 네트워크(133,135,137)는 합성곱 신경망(Convolution Neural Network, CNN) 모델로 구현될 수 있으나, 이는 일 실시예에 불과할 뿐, 심층 신경망(Deep Neural Network, DNN), 순환 신경망(Recurrent Neural Network, RNN) 및 생성적 적대 신경망(Generative Adversarial Networks, GAN) 중 적어도 하나의 인공 신경망 모델로 구현될 수 있다.According to an embodiment of the present disclosure, the first to third neural networks 133 , 135 , and 137 may be implemented as a convolutional neural network (CNN) model, but this is only an embodiment, and a deep neural network (Deep Neural Network) Network, DNN), a recurrent neural network (RNN), and a generative adversarial network (GAN) may be implemented as an artificial neural network model of at least one.

즉, 기존에는 제2 뉴럴 네트워크(135)만을 이용하여 인물의 행동을 인식하였으나, 본 발명과 같이, 제1 내지 제3 뉴럴 네트워크(133,135,137)를 이용하여 인물의 행동을 인식함으로써, 더욱 정확한 인물의 행동 인식이 가능해 질 수 있게 된다. 예를 들어, 종래에는 제2 뉴럴 네트워크(135)만을 이용하여 인물의 행동을 인식할 경우, 인물이 칼날을 쥐고 있을 때와 인물이 칼 손잡이를 쥐고 있을 때 모두 "사람이 칼을 쥐다"라는 행동 인식 결과가 도출되나, 본 개시의 일 실시예에 의하면, 인물이 칼 손잡이를 쥐고 있을 때는 "사람이 칼을 쥐다"라는 행동 인식 결과가 도출되나, 인물이 칼 날을 쥐고 있을 때는 "사람이 손에 베이다"라는 행동 인식 결과가 도출될 수 있다. That is, in the past, the behavior of a person was recognized using only the second neural network 135, but as in the present invention, by recognizing the behavior of a person using the first to third neural networks 133, 135, 137, more accurate behavior recognition becomes possible. For example, in the prior art, when a person's action is recognized using only the second neural network 135, the action of "a person holding a knife" is both when the person is holding the blade and when the person is holding the knife handle. Although the recognition result is derived, according to an embodiment of the present disclosure, when the person is holding the knife handle, the action recognition result is derived "a person holds the knife", but when the person is holding the blade, "the person is holding the A behavior recognition result of "Evada" can be derived.

이하에서는 도 2 내지 도 6을 참조하여 본 개시의 일 실시예에 따른, 제1 뉴럴 네트워크(133)를 학습하는 방법에 대해 설명하기로 한다. Hereinafter, a method of learning the first neural network 133 according to an embodiment of the present disclosure will be described with reference to FIGS. 2 to 6 .

도 2는 본 개시의 일 실시예에 따른, 전자 장치의 제1 뉴럴 네트워크(133)의 학습에 이용되는 지식 데이터베이스를 구축하는 방법을 설명하기 위한 도면이다. 본 개시의 일 실시예로, 지식 데이터베이스는 외부 서버에 의해 구축할 수 있으나, 이는 일 실시예에 불과할 뿐, 전자 장치(100)에 의해 지식 데이터베이스를 구축할 수 있다.FIG. 2 is a diagram for explaining a method of constructing a knowledge database used for learning the first neural network 133 of an electronic device, according to an embodiment of the present disclosure. According to an embodiment of the present disclosure, the knowledge database may be built by an external server, but this is only an embodiment, and the knowledge database may be built by the electronic device 100 .

우선, 외부 서버는 오브젝트에 대한 이미지를 획득할 수 있다(210). 그리고, 외부 서버는 획득된 이미지를 분석하여 획득된 이미지에 포함된 오브젝트에 대한 시멘틱 특징(semantic feature)을 추출할 수 있다(230). 이때, 시멘틱 특징은 획득된 오브젝트가 가지는 의미적인 특징으로서, 오브젝트의 유형(group), 오브젝트의 이용(use), 오브젝트의 동작(action), 오브젝트의 속성(property), 오브젝트의 위치(location), 오브젝트의 연관성(association) 등을 포함할 수 있으나, 이에 한정되는 것은 아니다.First, the external server may acquire an image of the object ( 210 ). Then, the external server may analyze the acquired image to extract semantic features of the object included in the acquired image ( 230 ). In this case, the semantic feature is a semantic feature of the obtained object, and includes object type (group), object use (use), object action (action), object property (property), object location (location), It may include, but is not limited to, association of objects.

또한, 외부 서버는 오브젝트와 관련된 웹이나 서류(document) 등을 검색할 수 있다(220). 외부 서버는 웹이나 서류를 통해 오브젝트와 관련된 어포던스를 분석할 수 있다(240), 구체적으로, 외부 서버는 웹이나 서류에 포함된 대량의 문장으로부터 문장의 구성 요소(예로, 주어, 서술어, 목적어) 등을 추출할 수 있다. 그리고, 외부 서버는 추출된 문장의 구성 요소로부터 오브젝트와 오브젝트를 설명하는 어포던스에 대한 정보를 획득할 수 있다. 예로, 외부 서버는 웹이나 서류에 포함된 문장 등을 통해 "가위"라는 오브젝트와 관련된 어포던스로서, "cutting", "gripping" 등을 획득할 수 있다.Also, the external server may search for a web or document related to the object ( 220 ). The external server may analyze the affordance related to the object through the web or the document ( 240 ). Specifically, the external server is a sentence component (eg, subject, predicate, object) from a large number of sentences included in the web or document. etc. can be extracted. In addition, the external server may obtain information about the object and affordance describing the object from the extracted sentence elements. For example, the external server may acquire "cutting", "gripping", etc. as affordances related to the object "scissors" through the web or a sentence included in a document.

외부 서버는 획득된 이미지, 시멘틱 특징 및 어포던스를 바탕으로 지식 그래프를 생성할 수 있다(250). 이때, 지식 그래프는 정보와 정보 사이의 연관 관계를 나타내는 그래프로서, 예로, 도 3에 도시된 바와 같이, 지식 그래프는 가위 이미지(310)와 "cutting"(320), "gripping"(330)이라는 어포던스가 연관되어 있을 수 있음을 나타낼 수 있으며, 전정 가위(prunning shear) 이미지(340)와 "cutting"(320), "gripping"(330), "locking"(350)이라는 어포던스가 연관될 수 있음을 나타낼 수 있으며, 유틸리티 나이프(utility knife)(360) 이미지(360)와 "cutting"(320), "gripping"(330), "retracting"(370), "blade"(380)라는 어포던스가 연관될 수 있음을 나타낼 수 있다. The external server may generate a knowledge graph based on the acquired image, semantic characteristics, and affordance ( 250 ). At this time, the knowledge graph is a graph indicating the relationship between information and information. For example, as shown in FIG. 3 , the knowledge graph is a scissors image 310 and "cutting" 320, "gripping" (330). It may indicate that affordances may be related, and the pruning shear image 340 and affordances of "cutting" 320, "gripping" (330), and "locking" (350) may be associated may represent, and affordances such as utility knife (360) image (360) and "cutting" (320), "gripping" (330), "retracting" (370), and "blade" (380) are associated can indicate that it can be

외부 서버는 획득된 지식 그래프를 통해 지식 데이터베이스를 구축할 수 있다(260). 이때, 지식 데이터베이스는 일반 오브젝트의 이미지, 및 웹 또는 문서상에 포함된 일반 오브젝트와 관련된 정보를 바탕으로 획득된 상기 오브젝트의 복수의 어포던스에 대한 정보를 지식 그래프 형태로 저장할 수 있다. 한편, 외부 서버는 획득된 복수의 지식 그래프를 연관시켜 지식 데이터베이스를 구축할 수 있으며, 획득된 지식 그래프를 확장시켜 지식 데이터베이스를 구축할 수 있다. The external server may build a knowledge database through the acquired knowledge graph ( 260 ). In this case, the knowledge database may store information about a plurality of affordances of the object obtained based on the image of the general object and information related to the general object included in the web or document in the form of a knowledge graph. Meanwhile, the external server may build a knowledge database by associating a plurality of acquired knowledge graphs, and may build a knowledge database by expanding the acquired knowledge graphs.

도 4는 본 개시의 일 실시예에 따른, 제1 뉴럴 네트워크의 학습을 설명하기 위한 도면이다. 본 개시의 일 실시예로, 제1 뉴럴 네트워크(133)는 외부 서버에 의해 학습될 수 있으나, 이는 일 실시예에 불과할 뿐, 전자 장치(100)에 의해 제1 뉴럴 네트워크(133)가 학습될 수 있다.4 is a diagram for explaining learning of a first neural network, according to an embodiment of the present disclosure. As an embodiment of the present disclosure, the first neural network 133 may be learned by an external server, but this is only an embodiment, and the first neural network 133 may be learned by the electronic device 100 . can

구체적으로, 외부 서버는 도 2 및 도 3에서 설명한 바와 같은 방법으로 일반 오브젝트의 이미지와 일반 오브젝트에 대한 복수의 어포던스를 매칭하여 저장하는 지식 데이터베이스를 획득할 수 있다(410). Specifically, the external server may acquire a knowledge database that matches and stores an image of a general object and a plurality of affordances of the general object in the method described with reference to FIGS. 2 and 3 ( 410 ).

또한, 외부 서버는 학습 오브젝트에 대한 어포던스 라벨 데이터를 획득할 수 있다(420). 이때, 어포던스 라벨 데이터는 학습 오브젝트에 대한 이미지와 학습 오브젝트에 포함된 복수의 영역에 대응되는 어포던스에 대한 정보를 매칭하여 저장하는 데이터일 수 있다. 이때, 어포던스에 대한 정보는 이미지, 텍스트 및 다차원의 벡터 형태로 표현될 수 있다.Also, the external server may acquire affordance label data for the learning object ( 420 ). In this case, the affordance label data may be data stored by matching the image of the learning object with information on affordances corresponding to a plurality of regions included in the learning object. In this case, the affordance information may be expressed in the form of an image, text, and a multidimensional vector.

특히, 어포던스에 대한 정보가 이미지로 표현되는 경우, 학습 오브젝트에서 어포던스를 갖는 영역은 도 5a에 도시된 바와 같이, 바운딩 박스 형태로 표현되거나 도 5b에 도시된 바와 같이, 세그멘테이션 맵(segmentation map) 형태로 표현될 수 있다. 일 예로, 도 5a에 도시된 바와 같이, "가위"의 어포던스에 대한 정보는 칼날 영역을 포함하는 제1 바운딩 박스(510)와 "cutting"이라는 텍스트 정보가 매칭되어 표현될 수 있으며, 손잡이 영역을 포함하는 제2 바운딩 박스(520)와 "gripping"이라는 텍스트 정보가 매칭되어 표현될 수 있다. 또 다른 예로, 도 5a에 도시된 바와 같이, "가위"의 어포던스에 대한 정보는 칼날 영역을 포함하는 제1 세그먼트 영역(530)와 "cutting"이라는 텍스트 정보가 매칭되어 표현될 수 있으며, 손잡이 영역을 포함하는 제2 세그먼트 영역(540)와 "gripping"이라는 텍스트 정보가 매칭되어 표현될 수 있다. 이때, 어포던스 라벨 데이터는 지식 데이터베이스에 포함된 데이터의 수보다 적을 수 있다.In particular, when information on affordance is expressed as an image, the region having affordance in the learning object is expressed in the form of a bounding box as shown in FIG. 5A or in the form of a segmentation map as shown in FIG. 5B . can be expressed as For example, as shown in FIG. 5A , information on affordance of “scissors” may be expressed by matching the text information “cutting” with the first bounding box 510 including a blade area, and a handle area. The included second bounding box 520 and text information “griping” may be matched and expressed. As another example, as shown in FIG. 5A , the information on affordance of “scissors” may be expressed by matching the text information “cutting” with the first segment region 530 including the blade region, and the handle region. The second segment region 540 including the text information “gripping” may be matched and expressed. In this case, the affordance label data may be less than the number of data included in the knowledge database.

외부 서버는 지식 데이터베이스와 어포던스 라벨 데이터를 이용하여 제1 뉴럴 네트워크(133)를 학습시킬 수 있다(430). 구체적으로, 외부 서버는 지식 데이터베이스에 포함된 일반 오브젝트 이미지를 제1 뉴럴 네트워크(133)에 입력하여 일반 오브젝트에 포함된 복수의 영역에 대한 특징값을 획득할 수 있다. 또한, 외부 서버는 어포던스 라벨 데이터에 포함된 학습 오브젝트 이미지를 제1 뉴럴 네트워크(133)에 입력하여 학습 오브젝트에 포함된 복수의 영역에 대한 특징값을 획득할 수 있다. 이때, 외부 서버는 어퍼던스 라벨 데이터의 학습 오브젝트의 이미지를 제1 뉴럴 네트워크(133)에 입력했을 때 출력되는 특징값과 학습 오브젝트와 동일한 어포던스를 포함하는 지식 데이터베이스에 포함된 일반 오브젝트 이미지를 제1 뉴럴 네트워크(133)에 입력했을 때 출력되는 특징값이 임계 범위 내에 존재하도록 학습될 수 있다. 이때, 임계 범위 내에 존재한다는 의미는 제1 뉴럴 네트워크(133)에 입력했을 때 출력되는 특징값과 학습 오브젝트와 동일한 어포던스를 포함하는 지식 데이터베이스에 포함된 일반 오브젝트 이미지를 제1 뉴럴 네트워크(133)에 입력했을 때 출력되는 특징값이 동일하거나 유사한 것을 말할 수 있다. The external server may train the first neural network 133 using the knowledge database and affordance label data ( 430 ). Specifically, the external server may input the general object image included in the knowledge database into the first neural network 133 to acquire feature values for a plurality of regions included in the general object. Also, the external server may input the learning object image included in the affordance label data to the first neural network 133 to obtain feature values for a plurality of regions included in the learning object. In this case, the external server first converts the general object image included in the knowledge database including the feature value output when the image of the learning object of the affordance label data is input to the first neural network 133 and the same affordance as the learning object. The feature value output when input to the neural network 133 may be learned to exist within a threshold range. At this time, the meaning of being within the threshold range means that the general object image included in the knowledge database including the same affordance as the feature value and the learning object output when input to the first neural network 133 is transferred to the first neural network 133 . It can be said that the feature value output when input is the same or similar.

예를 들어, 외부 서버는 도 6에 도시된 바와 같이, 제1 뉴럴 네트워크(133)에 "가위"라는 학습 오브젝트의 이미지(610)가 입력되었을 때 획득할 수 있는 특징값과 제1 뉴럴 네트워크(133)에 "가위의 칼날 영역"과 동일한 어포던스(즉, cutting)를 가지는 "전장 가위", "칼", "전기톱"에 대한 이미지(620,630,640)가 입력되었을 때 획득할 수 있는 특징값이 동일하거나 유사해지도록 제1 뉴럴 네트워크(133)를 학습시킬 수 있다. 즉, 제1 뉴럴 네트워크(133)는 동일한 어포던스를 가지는 영역을 포함하는 오브젝트들의 특징값이 서로 유사해 지도록 학습될 수 있다. 이를 통해, 제1 뉴럴 네트워크는 이미지를 통해 오브젝트를 구성하는 복수의 영역 각각에 대응되는 어포던스에 대한 정보를 획득하도록 학습될 수 있다. For example, as shown in FIG. 6 , the external server includes feature values obtainable when an image 610 of a learning object called “scissors” is input to the first neural network 133 and the first neural network ( 133), when images (620, 630, 640) for “full length scissors”, “knife”, and “chainsaw” that have the same affordance (ie, cutting) as “scissor blade area” are input, the feature values that can be obtained are the same Alternatively, the first neural network 133 may be trained to be similar. That is, the first neural network 133 may be trained so that feature values of objects including regions having the same affordance become similar to each other. Through this, the first neural network may be trained to acquire information on affordances corresponding to each of a plurality of regions constituting an object through an image.

상술한 바와 같은 방법을 통해, 외부 서버는 소량의 어포던스 라벨 데이터를 이용하여 오브젝트에 포함된 복수의 영역에 대한 어포던스의 정보를 획득할 수 있는 제1 뉴럴 네트워크(133)를 학습시킬 수 있게 된다.Through the method as described above, the external server can learn the first neural network 133 capable of acquiring affordance information for a plurality of regions included in an object by using a small amount of affordance label data.

한편, 상술한 실시예에서는 제1 내지 제3 뉴럴 네트워크를 이용하여 인물의 행동을 인식하는 방법을 설명하였으나, 이는 일 실시예에 불과할 뿐, 제1 뉴럴 네트워크와 오브젝트 인식 모델을 통해 인물의 행동을 인식할 수 있다. 이하에서는 도 7 및 도 8을 참조하여 제1 뉴럴 네트워크 및 오브젝트 인식 모델을 이용하여 인물의 행동을 인식하는 방법에 대해 설명하기로 한다.Meanwhile, in the above-described embodiment, a method of recognizing a person's behavior using the first to third neural networks has been described, but this is only an example, and the first neural network and the object recognition model are used to recognize the person's behavior. can recognize Hereinafter, a method of recognizing a person's behavior using a first neural network and an object recognition model will be described with reference to FIGS. 7 and 8 .

도 7은 본 개시의 일 실시예에 따른, 전자 장치의 구성을 나타내는 블록도이다. 전자 장치(700)는 카메라(710), 메모리(720) 및 프로세서(730)를 포함한다. 한편, 카메라(710) 및 메모리(720)는 도 1에서 설명한 카메라(110) 및 메모리(120)와 동일한 구성이므로, 중복되는 설명은 생략하기로 한다.7 is a block diagram illustrating a configuration of an electronic device according to an embodiment of the present disclosure. The electronic device 700 includes a camera 710 , a memory 720 , and a processor 730 . Meanwhile, since the camera 710 and the memory 720 have the same configuration as the camera 110 and the memory 120 described with reference to FIG. 1 , overlapping descriptions will be omitted.

구체적으로, 프로세서(730)는 이미지를 획득할 수 있다. 특히, 프로세서(730)는 카메라(710)를 통해 이미지를 획득할 수 있으나, 이는 일 실시예에 불과할 뿐, 외부(예로, 외부 장치, 외부 서버 등)로 이미지를 획득할 수 있다. 이때, 획득된 이미지는 동영상일 수 있으나, 이는 일 실시예에 불과할 뿐, 정지 영상일 수 있다. 또한, 획득된 이미지에는 적어도 하나의 인물 및 적어도 하나의 오브젝트가 포함될 수 있다. 뿐만 아니라, 획득된 이미지에는 인물 이외에 움직일 수 있는 동물이나 물체 등이 포함될 수 있다. 예로, 프로세서(730)는 도 8에 도시된 바와 같은 사람 손(810), 칼(820) 및 사과(830)를 포함하는 이미지를 획득할 수 있다.Specifically, the processor 730 may acquire an image. In particular, the processor 730 may acquire an image through the camera 710 , but this is only an exemplary embodiment and may acquire the image externally (eg, an external device, an external server, etc.). In this case, the acquired image may be a moving image, but this is only an example and may be a still image. Also, the acquired image may include at least one person and at least one object. In addition, the acquired image may include a movable animal or object other than a person. For example, the processor 730 may acquire an image including a human hand 810 , a knife 820 , and an apple 830 as shown in FIG. 8 .

프로세서(730)는 이미지 처리 모듈(731)을 통해 획득된 이미지를 처리할 수 있다. 특히, 프로세서(730)는 이미지 처리 모듈(731)을 통해 획득된 동영상에 대한 이미지 샘플링을 수행할 수 있다. The processor 730 may process the image acquired through the image processing module 731 . In particular, the processor 730 may perform image sampling on a moving picture acquired through the image processing module 731 .

프로세서(730)는 어포던스 특징 획득 모듈(732)을 통해 샘플링된 이미지 프레임으로부터 오브젝트에 포함된 복수의 영역의 어포던스에 대응되는 제1 특징값을 획득할 수 있다. 특히, 어포던스 특징 획득 모듈(732)은 제1 뉴럴 네트워크(733)를 이용하여 오브젝트에 포함된 복수의 영역에 대한 제1 특징값을 획득할 수 있다. The processor 730 may acquire a first feature value corresponding to affordances of a plurality of regions included in an object from an image frame sampled through the affordance feature acquisition module 732 . In particular, the affordance feature acquisition module 732 may acquire first feature values for a plurality of regions included in the object by using the first neural network 733 .

예를 들어, 도 8과 같은 이미지가 입력되면, 제1 뉴럴 네트워크(133)는 "가위의 손잡이 영역"에 대한 특징값으로서, "가위의 손잡이 영역"에 대응되는 어포던스인 "쥐다(gripping)"에 대응되는 특징값을 획득하고, "가위의 날 영역"에 대한 특징값으로서, "가위의 날 영역"에 대응되는 어포던스인 "자르다(cuttiong)"에 대응되는 특징값을 획득할 수 있다. 그리고, 어포던스 특징 획득 모듈(732)은 제1 뉴럴 네트워크(733)를 통해 획득한 특징값을 통해 칼에 대한 어포던스로 "쥐다"와 "자르다"를 획득할 수 있다.For example, when an image as shown in FIG. 8 is input, the first neural network 133 as a feature value for the “handle region of the scissors” is “gripping”, which is an affordance corresponding to the “handle region of the scissors”. It is possible to obtain a feature value corresponding to , and as a feature value for the “scissors blade region”, a feature value corresponding to “cuttiong” which is an affordance corresponding to the “scissors blade region” may be acquired. In addition, the affordance feature acquisition module 732 may acquire “grasp” and “cut” as affordances to the knife through the feature value acquired through the first neural network 733 .

프로세서(730)는 오브젝트 정보 획득 모듈(734)을 이용하여 이미지 속에 포함된 오브젝트(인물 포함)에 대한 정보를 획득할 수 있다. 특히, 오브젝트 정보 획득 모듈(734)은 제4 뉴럴 네트워크(735)에 획득된 이미지를 입력하여 이미지에 포함된 인물 및 오브젝트에 대한 정보를 획득할 수 있다. 이때, 제4 뉴럴 네트워크(735)는 인물 또는 오브젝트를 인식하도록 학습된 신경망 모델일 수 있다. 예로, 도 8에 도시된 이미지가 입력되면, 오브젝트 정보 획득 모듈(734)은 오브젝트 인식 결과로서 사람 손(810), 칼(820-1,820-2), 사과(830)에 대한 정보를 획득할 수 있다.The processor 730 may acquire information on an object (including a person) included in the image by using the object information acquisition module 734 . In particular, the object information acquisition module 734 may input the acquired image to the fourth neural network 735 to acquire information about a person and an object included in the image. In this case, the fourth neural network 735 may be a neural network model trained to recognize a person or an object. For example, when the image shown in FIG. 8 is input, the object information acquisition module 734 may acquire information on the human hand 810, the knives 820-1 and 820-2, and the apple 830 as an object recognition result. have.

프로세서(730)는 행동 인식 모듈(736)을 통해 어포던스 특징 획득 모듈(732)을 통해 획득된 어포던스에 대한 정보와 오브젝트 획득 모듈(734)을 통해 획득된 오브젝트에 대한 정보를 바탕으로 이미지에 포함된 인물의 행동을 인식할 수 있다. 구체적으로, 행동 인식 모듈(736)은 오브젝트의 각 영역과 관련된 인물 또는 다른 오브젝트와의 연관 관계를 이용하여 인물의 행동 인식을 수행할 수 있다. 이때, 행동 인식 모듈(736)은 오브젝트와 인물 또는 다른 오브젝트 사이의 위치 관계를 바탕으로 연관 관계를 판단할 수 있다.The processor 730 includes information on the affordance acquired through the affordance feature acquisition module 732 through the behavior recognition module 736 and information about the object acquired through the object acquisition module 734 included in the image. Recognize a character's actions. In detail, the behavior recognition module 736 may perform behavior recognition of a person by using a relationship with a person or another object related to each area of the object. In this case, the behavior recognition module 736 may determine the relation based on the positional relation between the object and the person or other object.

예로, 어포던스 특징 획득 모듈(732)을 통해 획득된 칼 손잡이의 어포던스에 대한 정보로 "쥐다"가 획득되고, "칼 날"의 어포던스에 대한 정보로 "자르다"가 획득되고, 오브젝트에 대한 정보로 사람 손(810), 칼(820-1,820-2), 사과(830)에 대한 정보가 획득된 경우, 행동 인식 모듈(736)은 칼 손잡이와 관련된 인물과 어포던스를 바탕으로 "사람이 칼을 쥐다"라는 행동 인식 결과를 획득할 수 있으며, 칼 날과 관련된 다른 오브젝트와 어포던스를 바탕으로 "칼로 사과를 자르다"라는 행동 인식 결과를 획득할 수 있게 된다. 따라서, 행동 인식 모듈(736)은 최종 행동 인식 결과로서, "사람이 칼을 쥐고, 칼로 사과를 자른다"를 출력할 수 있다. For example, "grasp" is obtained as information about the affordance of a knife handle obtained through the affordance feature acquisition module 732, "cut" is obtained as information about the affordance of "knife blade", and information about the object When information on the human hand 810, the knife 820-1, 820-2, and the apple 830 is obtained, the behavior recognition module 736 determines "a person holds a knife" based on the person and affordance related to the knife handle. A behavior recognition result of "Cutting an apple with a knife" can be obtained based on the affordance and other objects related to the blade. Accordingly, the behavior recognition module 736 may output “a person holds a knife and cuts an apple with a knife” as a final action recognition result.

도 9는 본 개시의 일 실시예에 따른, 전자 장치의 제어 방법을 설명하기 위한 도면이다.9 is a diagram for explaining a method of controlling an electronic device according to an embodiment of the present disclosure.

우선, 전자 장치(100)는 인물 및 오브젝트를 포함하는 이미지를 획득할 수 있다(S910). 이때, 전자 장치(100)는 카메라를 통해 이미지를 획득할 수 있으나, 이는 일 실시예에 불과할 뿐, 외부 장치나 외부 서버로부터 이미지를 수신할 수 있다.First, the electronic device 100 may obtain an image including a person and an object (S910). In this case, the electronic device 100 may acquire an image through the camera, but this is only an exemplary embodiment and may receive an image from an external device or an external server.

전자 장치(100)는 제1 뉴럴 네트워크에 획득된 이미지를 입력하여 이미지에 포함된 오브젝트의 복수의 영역에 대응되는 어포던스에 대한 제1 특징값을 획득할 수 있다(S920). 이때, 제1 뉴럴 네트워크는 오브젝트에 포함된 복수의 영역 각각에 대응되는 어포던스에 대한 특징값을 획득하도록 학습될 수 있다. 특히, 제1 뉴럴 네트워크는 학습 오브젝트의 이미지와 학습 오브젝트의 복수의 영역에 대한 어포던스에 대한 정보를 매칭한 어포던스 라벨 데이터, 및 일반 오브젝트의 이미지와 일반 오브젝트에 대한 복수의 어포던스를 매칭하여 저장하는 지식 데이터베이스를 이용하여 학습될 수 있다.The electronic device 100 may obtain a first feature value of affordances corresponding to a plurality of regions of an object included in the image by inputting the acquired image to the first neural network (S920). In this case, the first neural network may be trained to acquire a feature value for affordances corresponding to each of a plurality of regions included in the object. In particular, the first neural network includes affordance label data that matches the image of the learning object with information on affordances for a plurality of regions of the learning object, and knowledge to match and store the image of the general object and the plurality of affordances for the general object It can be learned using a database.

전자 장치(100)는 획득된 제1 특징값을 바탕으로 오브젝트를 이용하는 인물의 행동을 인식할 수 있다(S930). 일 실시예로, 전자 장치(100)는 인물의 행동에 대한 정보를 획득하도록 학습된 제2 뉴럴 네트워크에 상기 획득된 이미지를 입력하여 인물의 행동에 대응되는 제2 특징값을 획득할 수 있다. 그리고, 전자 장치(100)는 제1 특징값 및 제2 특징값을 바탕으로 제3 특징값을 획득할 수 있으며, 인물의 행동을 인식하도록 학습된 제3 뉴럴 네트워크에 제3 특징값을 입력하여 인물의 행동을 인식할 수 있다. 또 다른 예로, 전자 장치(100)는 오브젝트를 인식하도록 학습된 제4 뉴럴 네트워크에 이미지를 입력하여 오브젝트와 연관된 인물 또는 다른 오브젝트를 인식할 수 있으며, 제1 특징값에 대응되는 어포던스와 인식된 인물 또는 다른 오브젝트를 이용하여 인물의 행동을 인식할 수 있다.The electronic device 100 may recognize the behavior of the person using the object based on the acquired first feature value (S930). As an embodiment, the electronic device 100 may acquire the second feature value corresponding to the behavior of the person by inputting the acquired image to a second neural network that has been trained to acquire information on the behavior of the person. In addition, the electronic device 100 may obtain a third feature value based on the first feature value and the second feature value, and input the third feature value to the third neural network learned to recognize the behavior of a person. Recognize a character's actions. As another example, the electronic device 100 may recognize a person or other object related to the object by inputting an image to the fourth neural network learned to recognize the object, and the affordance corresponding to the first feature value and the recognized person Alternatively, the action of the person may be recognized using another object.

상술한 바와 같이 오브젝트의 복수의 영역에 대한 어포던스를 바탕으로 인물의 행동을 인식함으로써, 전자 장치는 더욱 정확하게 인물의 행동에 대해 인식할 수 있게 된다.As described above, by recognizing the behavior of a person based on affordances for a plurality of regions of the object, the electronic device can more accurately recognize the behavior of the person.

도 10은 본 개시의 일 실시예에 따른, 전자 장치의 구성을 상세히 설명하기 위한 블록도이다. 도 10에 도시된 바와 같이, 본 개시에 따른 전자 장치(1000)는 디스플레이(1010), 스피커(1020), 카메라(1030), 메모리(1040), 통신 인터페이스(1050), 입력 인터페이스(1060), 센서(1070) 및 프로세서(1080)를 포함할 수 있다. 그러나, 이와 같은 구성은 예시적인 것으로서, 본 개시를 실시함에 있어 이와 같은 구성에 더하여 새로운 구성이 추가되거나 일부 구성이 생략될 수 있음을 물론이다. 한편, 카메라(1030), 메모리(1040) 및 프로세서(1080)는 도 1에서 설명한 카메라(10), 메모리(120) 및 프로세서(130)와 동일한 구성이므로, 중복되는 설명은 생략한다.10 is a block diagram for describing in detail the configuration of an electronic device according to an embodiment of the present disclosure. As shown in FIG. 10 , the electronic device 1000 according to the present disclosure includes a display 1010 , a speaker 1020 , a camera 1030 , a memory 1040 , a communication interface 1050 , an input interface 1060 , It may include a sensor 1070 and a processor 1080 . However, such a configuration is an example, and it goes without saying that a new configuration may be added or some configuration may be omitted in addition to such a configuration in carrying out the present disclosure. Meanwhile, the camera 1030 , the memory 1040 , and the processor 1080 have the same configuration as the camera 10 , the memory 120 , and the processor 130 described in FIG. 1 , and thus overlapping descriptions will be omitted.

디스플레이(1010)는 카메라(1030)를 통해 촬영된 영상을 디스플레이할 수 있다. 또한, 디스플레이(1010)는 촬된 영상 내에 모양이 변형된 문서를 둘러싸는 바운딩 박스를 디스플레이할 수 있다. 또한, 디스플레이(710)는 인물의 행동 인식으르 수행하기 위한 사용자 명령을 입력받기 위한 UI를 디스플레이할 수 있다.The display 1010 may display an image captured by the camera 1030 . Also, the display 1010 may display a bounding box surrounding a document whose shape is deformed in the captured image. In addition, the display 710 may display a UI for receiving a user command for recognizing a person's behavior.

한편, 디스플레이(1010)는 LCD(Liquid Crystal Display Panel), OLED(Organic Light Emitting Diodes) 등으로 구현될 수 있으며, 또한 디스플레이(1010)는 경우에 따라 플렉서블 디스플레이, 투명 디스플레이 등으로 구현되는 것도 가능하다. 다만, 본 개시에 따른 디스플레이(1010)가 특정한 종류에 한정되는 것은 아니다.Meanwhile, the display 1010 may be implemented as a liquid crystal display panel (LCD), organic light emitting diodes (OLED), etc., and the display 1010 may be implemented as a flexible display, a transparent display, etc. in some cases. . However, the display 1010 according to the present disclosure is not limited to a specific type.

스피커(1020)는 음성 메시지를 출력할 수 있다. 특히, 스피커(1020)는 전자 장치(1000) 내부에 포함될 수 있으나, 이는 일 실시예에 불과할 뿐, 전자 장치(1000)와 전기적으로 연결되어 외부에 위치할 수 있다. 이때, 스피커(1020)는 촬영된 영상 내에 포함된 행동 인식에 대한 결과를 안내하는 음성 메시지를 출력할 수 있다.The speaker 1020 may output a voice message. In particular, the speaker 1020 may be included in the electronic device 1000 , but this is only an exemplary embodiment, and may be electrically connected to the electronic device 1000 and located outside. In this case, the speaker 1020 may output a voice message guiding the result of behavior recognition included in the captured image.

통신 인터페이스(1050)는 회로를 포함하며, 외부 장치와의 통신을 수행할 수 있다. 구체적으로, 프로세서(1080)는 통신 인터페이스(1050)를 통해 연결된 외부 장치로부터 각종 데이터 또는 정보를 수신할 수 있으며, 외부 장치로 각종 데이터 또는 정보를 전송할 수도 있다.The communication interface 1050 includes a circuit and may communicate with an external device. Specifically, the processor 1080 may receive various data or information from an external device connected through the communication interface 1050 and may transmit various data or information to the external device.

통신 인터페이스(1050)는 WiFi 모듈, Bluetooth 모듈, 무선 통신 모듈, 및 NFC 모듈 중 적어도 하나를 포함할 수 있다. 구체적으로, WiFi 모듈과 Bluetooth 모듈 각각은 WiFi 방식, Bluetooth 방식으로 통신을 수행할 수 있다. WiFi 모듈이나 Bluetooth 모듈을 이용하는 경우에는 SSID 등과 같은 각종 연결 정보를 먼저 송수신하여, 이를 이용하여 통신 연결한 후 각종 정보들을 송수신할 수 있다. The communication interface 1050 may include at least one of a WiFi module, a Bluetooth module, a wireless communication module, and an NFC module. Specifically, each of the WiFi module and the Bluetooth module may perform communication using a WiFi method and a Bluetooth method. In the case of using a WiFi module or a Bluetooth module, various types of connection information such as an SSID may be first transmitted and received, and various types of information may be transmitted and received after communication connection using this.

또한, 무선 통신 모듈은 IEEE, Zigbee, 3G(3rd Generation), 3GPP(3rd Generation Partnership Project), LTE(Long Term Evolution), 5G(5th Generation) 등과 같은 다양한 통신 규격에 따라 통신을 수행할 수 있다. 그리고, NFC 모듈은 135kHz, 13.56MHz, 433MHz, 860~960MHz, 2.45GHz 등과 같은 다양한 RF-ID 주파수 대역들 중에서 13.56MHz 대역을 사용하는 NFC(Near Field Communication) 방식으로 통신을 수행할 수 있다.In addition, the wireless communication module may perform communication according to various communication standards such as IEEE, Zigbee, 3rd Generation (3G), 3rd Generation Partnership Project (3GPP), Long Term Evolution (LTE), 5th Generation (5G), and the like. In addition, the NFC module may perform communication using a Near Field Communication (NFC) method using a 13.56 MHz band among various RF-ID frequency bands such as 135 kHz, 13.56 MHz, 433 MHz, 860 to 960 MHz, and 2.45 GHz.

특히, 본 개시에 따른 다양한 실시 예에 있어서, 통신 인터페이스(1050)는제1 내지 제4 뉴럴 네트워크(133,135,137,735)에 관련된 데이터 등과 같은 다양한 종류의 정보를 외부 장치로부터 수신할 수 있다. 또한, 통신 인터페이스(1050)는 외부 단말이나 서버로부터 인물과 오브젝트를 포함하는 이미지를 수신할 수 있다.In particular, according to various embodiments of the present disclosure, the communication interface 1050 may receive various types of information, such as data related to the first to fourth neural networks 133 , 135 , 137 and 735 , from an external device. Also, the communication interface 1050 may receive an image including a person and an object from an external terminal or server.

입력 인터페이스(1060)는 회로를 포함하며, 프로세서(1080)는 입력 인터페이스(1060)를 통해 전자 장치(1000)의 동작을 제어하기 위한 사용자 명령을 수신할 수 있다. 구체적으로, 입력 인터페이스(1060)는 터치 스크린으로서 디스플레이(1010)에 포함된 형태로 구현될 수 있으나, 이는 일 실시예에 불과한 뿐, 버튼, 마이크 및 리모컨 신호 수신부(미도시) 등과 같은 구성으로 이루어 질 수 있다. The input interface 1060 includes a circuit, and the processor 1080 may receive a user command for controlling the operation of the electronic device 1000 through the input interface 1060 . Specifically, the input interface 1060 may be implemented in a form included in the display 1010 as a touch screen, but this is only an exemplary embodiment, and consists of a button, a microphone, and a remote control signal receiver (not shown). can get

특히, 본 개시에 따른 다양한 실시 예에 있어서, 입력 인터페이스(1060)는 카메라 어플리케이션을 실행하기 위한 사용자 명령, 영상을 촬영하기 위한 사용자 명령, UI를 통해 인물의 행동을 인식하기 위한 사용자 명령 등과 같은 다양한 사용자 명령을 입력받을 수 있다. In particular, in various embodiments according to the present disclosure, the input interface 1060 provides various functions, such as a user command for executing a camera application, a user command for taking an image, a user command for recognizing the action of a person through the UI, and the like. User commands can be input.

센서(1070)는 전자 장치(1000)와 관련된 다양한 정보를 획득할 수 있다. 특히, 센서(1070)는 전자 장치(1000)의 위치 정보를 획득할 수 있는 GPS를 포함할 수 있으며, 전자 장치(1000)를 사용하는 사용자의 생체 정보를 획득하기 위한 생체 센서(예로, 심박수 센서, PPG 센서 등), 전자 장치(1000)의 움직임을 감지하기 위한 움직임 센서 등과 같은 다양한 센서를 포함할 수 있다.The sensor 1070 may acquire various information related to the electronic device 1000 . In particular, the sensor 1070 may include a GPS capable of acquiring location information of the electronic device 1000 , and a biometric sensor (eg, a heart rate sensor) for acquiring biometric information of a user using the electronic device 1000 . , PPG sensor, etc.) and various sensors such as a motion sensor for detecting the motion of the electronic device 1000 .

한편, 이상에서 상술한 바와 같은 신경망 모델에 관련된 기능은 메모리 및 프로세서를 통해 수행될 수 있다. 프로세서는 하나 또는 복수의 프로세서로 구성될 수 있다. 이때, 하나 또는 복수의 프로세서는 CPU, AP 등과 같은 범용 프로세서, GPU. VPU 등과 같은 그래픽 전용 프로세서 또는 NPU와 같은 인공 지능 전용 프로세서일 수 있다. 하나 또는 복수의 프로세서는, 비휘발성 메모리 및 휘발성 메모리에 저장된 기 정의된 동작 규칙 또는 인공 지능 모델에 따라, 입력 데이터를 처리하도록 제어한다. 기 정의된 동작 규칙 또는 인공 지능 모델은 학습을 통해 만들어진 것을 특징으로 한다. Meanwhile, the functions related to the neural network model as described above may be performed through a memory and a processor. The processor may consist of one or a plurality of processors. In this case, one or a plurality of processors are general-purpose processors such as CPUs and APs, GPUs. It may be a graphics-only processor, such as a VPU, or an artificial intelligence-only processor, such as an NPU. One or more processors control to process input data according to a predefined operation rule or artificial intelligence model stored in the non-volatile memory and the volatile memory. The predefined action rule or artificial intelligence model is characterized in that it is created through learning.

여기서, 학습을 통해 만들어진다는 것은, 다수의 학습 데이터들에 학습 알고리즘을 적용함으로써, 원하는 특성의 기 정의된 동작 규칙 또는 인공 지능 모델이 만들어짐을 의미한다. 이러한 학습은 본 개시에 따른 인공 지능이 수행되는 기기 자체에서 이루어질 수도 있고, 별도의 서버/시스템을 통해 이루어 질 수도 있다. Here, being made through learning means that a predefined operation rule or artificial intelligence model of a desired characteristic is created by applying a learning algorithm to a plurality of learning data. Such learning may be performed in the device itself on which artificial intelligence according to the present disclosure is performed, or may be performed through a separate server/system.

인공 지능 모델은, 복수의 신경망 레이어들로 구성될 수 있다. 각 레이어는 복수의 가중치(weight values)을 갖고 있으며, 이전(previous) 레이어의 연산 결과와 복수의 가중치의 연산을 통해 레이어의 연산을 수행한다. 신경망의 예로는, CNN (Convolutional Neural Network), DNN (Deep Neural Network), RNN (Recurrent Neural Network), RBM (Restricted Boltzmann Machine), DBN (Deep Belief Network), BRDNN(Bidirectional Recurrent Deep Neural Network), GAN(Generative Adversarial Networks) 및 심층 Q-네트워크 (Deep Q-Networks)이 있으며, 본 개시에서의 신경망은 명시한 경우를 제외하고 전술한 예에 한정되지 않는다.The artificial intelligence model may be composed of a plurality of neural network layers. Each layer has a plurality of weight values, and the layer operation is performed through the operation of the previous layer and the operation of the plurality of weights. Examples of neural networks include Convolutional Neural Network (CNN), Deep Neural Network (DNN), Recurrent Neural Network (RNN), Restricted Boltzmann Machine (RBM), Deep Belief Network (DBN), Bidirectional Recurrent Deep Neural Network (BRDNN), GAN. There are Generative Adversarial Networks and Deep Q-Networks, and the neural network in the present disclosure is not limited to the above-described examples, except as otherwise specified.

학습 알고리즘은, 다수의 학습 데이터들을 이용하여 소정의 대상 기기(예컨대, 로봇)을 훈련시켜 소정의 대상 기기 스스로 결정을 내리거나 예측을 할 수 있도록 하는 방법이다. 학습 알고리즘의 예로는, 지도형 학습(supervised learning), 비지도형 학습(unsupervised learning), 준지도형 학습(semi-supervised learning) 또는 강화 학습(reinforcement learning)이 있으며, 본 개시에서의 학습 알고리즘은 명시한 경우를 제외하고 전술한 예에 한정되지 않는다.The learning algorithm is a method of training a predetermined target device (eg, a robot) using a plurality of learning data so that the predetermined target device can make a decision or make a prediction by itself. Examples of the learning algorithm include supervised learning, unsupervised learning, semi-supervised learning, or reinforcement learning, and the learning algorithm in the present disclosure is specified when It is not limited to the above example except for.

기기로 읽을 수 있는 저장매체는, 비일시적(non-transitory) 저장매체의 형태로 제공될 수 있다. 여기서, '비일시적 저장매체'는 실재(tangible)하는 장치이고, 신호(signal)(예: 전자기파)를 포함하지 않는다는 것을 의미할 뿐이며, 이 용어는 데이터가 저장매체에 반영구적으로 저장되는 경우와 임시적으로 저장되는 경우를 구분하지 않는다. 예로, '비일시적 저장매체'는 데이터가 임시적으로 저장되는 버퍼를 포함할 수 있다.The device-readable storage medium may be provided in the form of a non-transitory storage medium. Here, 'non-transitory storage medium' is a tangible device and only means that it does not contain a signal (eg, electromagnetic wave). It does not distinguish the case where it is stored as For example, the 'non-transitory storage medium' may include a buffer in which data is temporarily stored.

일 실시 예에 따르면, 본 문서에 개시된 다양한 실시 예들에 따른 방법은 컴퓨터 프로그램 제품(computer program product)에 포함되어 제공될 수 있다. 컴퓨터 프로그램 제품은 상품으로서 판매자 및 구매자 간에 거래될 수 있다. 컴퓨터 프로그램 제품은 기기로 읽을 수 있는 저장 매체(예: compact disc read only memory (CD-ROM))의 형태로 배포되거나, 또는 어플리케이션 스토어(예: 플레이 스토어TM)를 통해 또는 두개의 사용자 장치들(예: 스마트폰들) 간에 직접, 온라인으로 배포(예: 다운로드 또는 업로드)될 수 있다. 온라인 배포의 경우에, 컴퓨터 프로그램 제품(예: 다운로더블 앱(downloadable app))의 적어도 일부는 제조사의 서버, 어플리케이션 스토어의 서버, 또는 중계 서버의 메모리와 같은 기기로 읽을 수 있는 저장 매체에 적어도 일시 저장되거나, 임시적으로 생성될 수 있다.According to an embodiment, the method according to various embodiments disclosed in this document may be provided by being included in a computer program product. Computer program products may be traded between sellers and buyers as commodities. The computer program product is distributed in the form of a device-readable storage medium (eg compact disc read only memory (CD-ROM)), or through an application store (eg Play Store™) or on two user devices (eg, It can be distributed (eg downloaded or uploaded) directly or online between smartphones (eg: smartphones). In the case of online distribution, at least a portion of the computer program product (eg, a downloadable app) is stored at least on a machine-readable storage medium, such as a memory of a manufacturer's server, a server of an application store, or a relay server. It may be temporarily stored or temporarily created.

이상에서 상술한 바와 같은 본 개시의 다양한 실시 예들에 따른 구성 요소(예: 모듈 또는 프로그램) 각각은 단수 또는 복수의 개체로 구성될 수 있으며, 전술한 해당 서브 구성 요소들 중 일부 서브 구성 요소가 생략되거나, 또는 다른 서브 구성 요소가 다양한 실시 예에 더 포함될 수 있다. 대체적으로 또는 추가적으로, 일부 구성 요소들(예: 모듈 또는 프로그램)은 하나의 개체로 통합되어, 통합되기 이전의 각각의 해당 구성 요소에 의해 수행되는 기능을 동일 또는 유사하게 수행할 수 있다. As described above, each of the components (eg, a module or a program) according to various embodiments of the present disclosure may be composed of a singular or a plurality of entities, and some of the aforementioned sub-components are omitted. Alternatively, other sub-components may be further included in various embodiments. Alternatively or additionally, some components (eg, a module or a program) may be integrated into a single entity to perform the same or similar functions performed by each corresponding component prior to integration.

다양한 실시 예들에 따른, 모듈, 프로그램 또는 다른 구성 요소에 의해 수행되는 동작들은 순차적, 병렬적, 반복적 또는 휴리스틱하게 실행되거나, 적어도 일부 동작이 다른 순서로 실행되거나, 생략되거나, 또는 다른 동작이 추가될 수 있다.According to various embodiments, operations performed by a module, program, or other component may be sequentially, parallelly, repetitively or heuristically executed, or at least some operations may be executed in a different order, omitted, or other operations may be added. can

한편, 본 개시에서 사용된 용어 "부" 또는 "모듈"은 하드웨어, 소프트웨어 또는 펌웨어로 구성된 유닛을 포함하며, 예를 들면, 로직, 논리 블록, 부품, 또는 회로 등의 용어와 상호 호환적으로 사용될 수 있다. "부" 또는 "모듈"은, 일체로 구성된 부품 또는 하나 또는 그 이상의 기능을 수행하는 최소 단위 또는 그 일부가 될 수 있다. 예를 들면, 모듈은 ASIC(application-specific integrated circuit)으로 구성될 수 있다.On the other hand, the term “unit” or “module” used in the present disclosure includes a unit composed of hardware, software, or firmware, and may be used interchangeably with terms such as, for example, logic, logic block, part, or circuit. can A “unit” or “module” may be an integrally formed part or a minimum unit or a part that performs one or more functions. For example, the module may be configured as an application-specific integrated circuit (ASIC).

본 개시의 다양한 실시 예들은 기기(machine)(예: 컴퓨터)로 읽을 수 있는 저장 매체(machine-readable storage media에 저장된 명령어를 포함하는 소프트웨어로 구현될 수 있다. 기기는 저장 매체로부터 저장된 명령어를 호출하고, 호출된 명령어에 따라 동작이 가능한 장치로서, 개시된 실시 예들에 따른 전자 장치(예: 전자 장치(100))를 포함할 수 있다. Various embodiments of the present disclosure may be implemented as software including instructions stored in a machine-readable storage medium readable by a machine (eg, a computer). The device calls the stored instructions from the storage medium. and an electronic device (eg, the electronic device 100 ) according to the disclosed embodiments as a device capable of operating according to the called command.

상기 명령이 프로세서에 의해 실행될 경우, 프로세서가 직접 또는 상기 프로세서의 제어 하에 다른 구성요소들을 이용하여 상기 명령에 해당하는 기능을 수행할 수 있다. 명령은 컴파일러 또는 인터프리터에 의해 생성 또는 실행되는 코드를 포함할 수 있다. When the instruction is executed by the processor, the processor may perform a function corresponding to the instruction by using other components directly or under the control of the processor. Instructions may include code generated or executed by a compiler or interpreter.

이상에서는 본 개시의 바람직한 실시 예에 대하여 도시하고 설명하였지만, 본 개시는 상술한 특정의 실시 예에 한정되지 아니하며, 청구범위에서 청구하는 본 개시의 요지를 벗어남이 없이 당해 개시가 속하는 기술분야에서 통상의 지식을 가진 자에 의해 다양한 변형실시가 가능한 것은 물론이고, 이러한 변형실시들은 본 개시의 기술적 사상이나 전망으로부터 개별적으로 이해되어서는 안 될 것이다.In the above, preferred embodiments of the present disclosure have been illustrated and described, but the present disclosure is not limited to the specific embodiments described above, and it is common in the technical field to which the disclosure pertains without departing from the gist of the disclosure as claimed in the claims. Various modifications may be made by those having the knowledge of

100: 전자 장치 110: 카메라
120: 메모리 140: 프로세서100: electronic device 110: camera
120: memory 140: processor

Claims

A method for controlling an electronic device, comprising:
obtaining an image including a person and an object;
The obtained image is input to a first neural network trained to obtain a feature value of an affordance corresponding to each of a plurality of regions included in the object, and the obtained image is inputted to an affordance corresponding to a plurality of regions of the object included in the image. 1 obtaining a feature value; and
Recognizing the action of the person using the object based on the obtained first feature value; Control method comprising a.

According to claim 1,
The method further comprises: inputting the acquired image into a second neural network trained to acquire information on the behavior of the person to obtain a second feature value corresponding to the behavior of the person;
The recognizing step is
A control method for recognizing the behavior of the person using the object based on the first characteristic value and the second characteristic value.

3. The method of claim 2,
The recognizing step is
obtaining a third feature value based on the first feature value and the second feature value,
A control method for recognizing the action of the person by inputting the third feature value into a third neural network that has been trained to recognize the action of the person.

According to claim 1,
Recognizing a person or other object associated with the object by inputting the image to a fourth neural network trained to recognize the object; further comprising,
The step of recognizing the behavior of the person is,
A control method for recognizing an action of the person using an affordance corresponding to the first feature value and the recognized person or other object.

According to claim 1,
The first neural network,
Learning using a knowledge database that matches and stores an image of a learning object and information on affordances for a plurality of regions of the learning object, and affordance label data that matches and stores an image of a general object and a plurality of affordances for the general object control method.

6. The method of claim 5,
The first neural network,
A general object image included in the knowledge database including a feature value output when an image of a learning object of the affordance label data is input to the first neural network and the same affordance as the learning object is transferred to the first neural network A control method in which a feature value output when input is learned to exist within a threshold range.

6. The method of claim 5,
The knowledge database is
A control method for storing information on a plurality of affordances of the object obtained based on the image of the general object and information related to the general object included in a web or document in the form of a knowledge graph.

6. The method of claim 5,
The control method, characterized in that the affordance label data is less than the number of data included in the knowledge database.

In an electronic device,
a memory storing at least one instruction; and
processor; further comprising
The processor, by executing the at least one instruction,
Acquire an image including people and objects,
The obtained image is input to a first neural network trained to obtain a feature value of an affordance corresponding to each of a plurality of regions included in the object, and the obtained image is inputted to an affordance corresponding to a plurality of regions of the object included in the image. 1 to obtain a feature value,
An electronic device for recognizing an action of the person using the object based on the obtained first feature value.

10. The method of claim 9,
The processor is
inputting the acquired image to a second neural network trained to acquire information on the behavior of the person to obtain a second feature value corresponding to the behavior of the person,
An electronic device for recognizing an action of the person using the object based on the first feature value and the second feature value.

11. The method of claim 10,
The processor is
obtaining a third feature value based on the first feature value and the second feature value,
An electronic device for recognizing the action of the person by inputting the third feature value into a third neural network that has been trained to recognize the action of the person.

10. The method of claim 9,
The processor is
Recognizing a person or other object associated with the object by inputting the image to a fourth neural network trained to recognize the object,
An electronic device for recognizing an action of the person using an affordance corresponding to the first feature value and the recognized person or other object.

10. The method of claim 9,
The first neural network,
Learning using a knowledge database that matches and stores an image of a learning object and information on affordances for a plurality of regions of the learning object, and affordance label data that matches and stores an image of a general object and a plurality of affordances for the general object becoming an electronic device.

14. The method of claim 13,
The first neural network,
A general object image included in the knowledge database including a feature value output when an image of a learning object of the affordance label data is input to the first neural network and the same affordance as the learning object is transferred to the first neural network An electronic device that is trained so that a feature value that is output when input is within a threshold range.

15. The method of claim 14,
The knowledge database is
An electronic device for storing information on a plurality of affordances of the object obtained based on the image of the general object and information related to the general object included in a web or a document in the form of a knowledge graph.

15. The method of claim 14,
The electronic device, characterized in that the affordance label data is less than the number of data included in the knowledge database.