WO2022119179A1

WO2022119179A1 - Electronic device and method for controlling same

Info

Publication number: WO2022119179A1
Application number: PCT/KR2021/016776
Authority: WO
Inventors: 김영욱; 길태호; 김경수; 김대훈; 김현한; 백서현; 손규빈; 정호진
Original assignee: 삼성전자주식회사
Priority date: 2020-12-02
Filing date: 2021-11-16
Publication date: 2022-06-09
Also published as: KR20220077390A

Abstract

Provided are an electronic device and a method for controlling same. The present method for controlling the electronic device comprises obtaining an image including a person and an object, inputting the obtained image into a first neural network, which is trained to obtain characteristic values for affordances respectively corresponding to a plurality of regions included in the object, to obtain first characteristic values for affordances corresponding to the plurality of regions of the object included in the image, and recognizing a behavior of the person using the object, on the basis of the obtained first characteristic values.

Description

Electronic device and control method thereof

The present disclosure relates to an electronic device and a control method thereof, and more particularly, to an electronic device capable of recognizing a behavior of a person included in a photographed image and a control method thereof.

Recently, there is a technique for recognizing the behavior of a person included in an image using a neural network model. In particular, in the prior art, a neural network model was trained to derive a class corresponding to the behavior of a person included in an image.

In order to learn a neural network model to derive a class corresponding to a person's behavior, learning data corresponding to the class is required. In addition, in order to obtain a more precise character behavior recognition result through the neural network model, it is necessary to increase the class corresponding to the character behavior. That is, there is a limit in that the training data increases exponentially in order to obtain a more precise human behavior recognition result through the neural network model.

In addition, various actions of a person may exist for one object. For example, when a person holds a knife handle in the image, the result of behavior recognition may be “a person is holding a knife”, but when a person in the image holds a knife blade, the result of the action recognition is “a person is cut by a knife” can be

That is, there is a need to derive different behavior recognition results according to a region related to a person's behavior among a plurality of regions of an object.

It is an object of the present invention to obtain an affordance corresponding to a person-related area among a plurality of areas of an object included in an image through a first neural network trained to obtain a feature value of an affordance corresponding to each of a plurality of areas included in the object. To an electronic device capable of recognizing a person's behavior based on the , and a method for controlling the same.

According to an embodiment of the present disclosure, a method of controlling an electronic device includes: acquiring an image including a person and an object; The obtained image is input to a first neural network trained to obtain a feature value of an affordance corresponding to each of a plurality of regions included in the object, and the obtained image is inputted to an affordance corresponding to a plurality of regions of the object included in the image. 1 obtaining a feature value; and recognizing the action of the person using the object included in the image based on the acquired first feature value.

According to an embodiment of the present disclosure, the electronic device further includes: a memory for storing at least one instruction; and a processor, wherein the processor executes the at least one instruction to generate an image including a person and an object. and input the acquired image to a first neural network trained to obtain feature values for affordances corresponding to each of the plurality of regions included in the object, and affordances corresponding to the plurality of regions of the object included in the image It is possible to obtain a first feature value for , and recognize the action of the person using the object included in the image based on the acquired first feature value.

According to the embodiment of the present disclosure as described above, the electronic device can more accurately recognize the behavior of a person included in an image.

1 is a block diagram showing a configuration for recognizing an action of a person, according to an embodiment of the present disclosure;

2 is a view for explaining a method of building a knowledge database, according to an embodiment of the present disclosure;

3 is a diagram for explaining a knowledge graph stored in a knowledge database, according to an embodiment of the present disclosure;

4 is a diagram for explaining learning of a first neural network, according to an embodiment of the present disclosure;

5A and 5B are diagrams for explaining affordance for each area of an object included in affordance label data according to an embodiment of the present disclosure;

6 is a diagram for explaining learning of a first neural network, according to an embodiment of the present disclosure;

7 is a block diagram illustrating a configuration for recognizing a behavior of a person, according to another embodiment of the present disclosure;

8 is a view for explaining a method for recognizing a person's behavior by a method according to an embodiment of the present disclosure;

9 is a flowchart for explaining a method of controlling an electronic device, according to an embodiment of the present disclosure;

10 is a block diagram for describing in detail the configuration of an electronic device according to an embodiment of the present disclosure.

Since the present embodiments can apply various transformations and can have various embodiments, specific embodiments are illustrated in the drawings and described in detail in the detailed description. However, this is not intended to limit the scope of the specific embodiments, and should be understood to include various modifications, equivalents, and/or alternatives of the embodiments of the present disclosure. In connection with the description of the drawings, like reference numerals may be used for like components.

In describing the present disclosure, if it is determined that a detailed description of a related known function or configuration may unnecessarily obscure the subject matter of the present disclosure, a detailed description thereof will be omitted.

In addition, the following examples may be modified in various other forms, and the scope of the technical spirit of the present disclosure is not limited to the following examples. Rather, these embodiments are provided to more fully and complete the present disclosure, and to fully convey the technical spirit of the present disclosure to those skilled in the art.

The terms used in the present disclosure are used only to describe specific embodiments, and are not intended to limit the scope of rights. The singular expression includes the plural expression unless the context clearly dictates otherwise.

In the present disclosure, expressions such as “have,” “may have,” “include,” or “may include” indicate the presence of a corresponding characteristic (eg, a numerical value, function, operation, or component such as a part). and does not exclude the presence of additional features.

In this disclosure, expressions such as "A or B," "at least one of A and/and B," or "one or more of A or/and B" may include all possible combinations of the items listed together. . For example, "A or B," "at least one of A and B," or "at least one of A or B" means (1) includes at least one A, (2) includes at least one B; Or (3) it may refer to all cases including both at least one A and at least one B.

As used in the present disclosure, expressions such as “first,” “second,” “first,” or “second,” may modify various elements, regardless of order and/or importance, and refer to one element. It is used only to distinguish it from other components, and does not limit the components.

A component (eg, a first component) is "coupled with/to (operatively or communicatively)" to another component (eg, a second component); When referring to "connected to", it will be understood that the certain element may be directly connected to the other element or may be connected through another element (eg, a third element).

On the other hand, when it is said that a component (eg, a first component) is "directly connected" or "directly connected" to another component (eg, a second component), the component and the It may be understood that other components (eg, a third component) do not exist between other components.

The expression “configured to (or configured to)” as used in this disclosure, depending on the context, for example, “suitable for,” “having the capacity to” ," "designed to," "adapted to," "made to," or "capable of." The term “configured (or configured to)” may not necessarily mean only “specifically designed to” in hardware.

Instead, in some circumstances, the expression “a device configured to” may mean that the device is “capable of” with other devices or parts. For example, the phrase "a processor configured (or configured to perform) A, B, and C" refers to a dedicated processor (eg, an embedded processor) for performing the corresponding operations, or by executing one or more software programs stored in a memory device. , may mean a generic-purpose processor (eg, a CPU or an application processor) capable of performing corresponding operations.

In an embodiment, a 'module' or 'unit' performs at least one function or operation, and may be implemented as hardware or software, or a combination of hardware and software. In addition, a plurality of 'modules' or a plurality of 'units' may be integrated into at least one module and implemented with at least one processor, except for 'modules' or 'units' that need to be implemented with specific hardware.

Meanwhile, various elements and regions in the drawings are schematically drawn. Accordingly, the technical spirit of the present invention is not limited by the relative size or spacing drawn in the accompanying drawings.

Meanwhile, the electronic device according to various embodiments of the present disclosure may include, for example, at least one of a smart phone, a tablet PC, a desktop PC, a laptop PC, and a wearable device. A wearable device may be an accessory (e.g., watch, ring, bracelet, anklet, necklace, eyewear, contact lens, or head-mounted-device (HMD)), a textile or clothing integral (e.g. electronic garment); It may include at least one of body-attached (eg, skin pad or tattoo), or bioimplantable circuitry.

In some embodiments, the electronic device may include, for example, a television, digital video disk (DVD) player, audio, refrigerator, air conditioner, vacuum cleaner, oven, microwave oven, washing machine, air purifier, set-top box, home automation control panel, Secure at least one of a control panel, media box (eg Samsung HomeSync ^TM , Apple TV ^TM , or Google TV ^TM ), game console (eg Xbox ^TM , PlayStation ^TM ), electronic dictionary, electronic key, camcorder, or electronic picture frame. may include

Hereinafter, embodiments according to the present disclosure will be described in detail with reference to the accompanying drawings so that those of ordinary skill in the art to which the present disclosure pertains can easily implement them.

Hereinafter, the present disclosure will be described in more detail with reference to the drawings. 1 is a block diagram illustrating a configuration of an electronic device according to an embodiment of the present disclosure. The electronic device 100 includes a camera 110 , a memory 120 , and a processor 130 . In this case, the electronic device 100 may be implemented as a smart phone. However, the electronic device 100 according to the present disclosure is not limited to a specific type of device, and may be implemented as various types of electronic devices 100 such as a tablet PC and a notebook PC.

The camera 110 may capture an image. In particular, the camera 110 may capture an image including an object and a person. In this case, the image may be a still image or a moving image.

Also, the camera 110 may include a plurality of lenses different from each other. Here, that the plurality of lenses are different from each other may include a case in which a field of view (FOV) of each of the plurality of lenses is different from each other and a case in which positions at which each of the plurality of lenses are disposed are different, and the like.

The memory 120 may store data necessary for the module for recognizing the behavior of a person included in the image to perform various operations. The module for recognizing the behavior of the person may include an image processing module 131 , an affordance feature acquiring module 132 , a behavioral feature acquiring module 134 , and a behavior recognition module 136 . Also, the memory 110 may store the first to third

neural networks

133 , 135 , and 137 for recognizing the behavior of a person included in the image.

Meanwhile, the memory 120 may include a non-volatile memory capable of maintaining stored information even when power supply is interrupted, and a volatile memory requiring continuous power supply to maintain the stored information. Data for the module for recognizing the behavior of a person to perform various operations may be stored in a non-volatile memory. In addition, the first neural network 133 for acquiring information on affordances corresponding to each of the plurality of regions included in the object, the second neural network 135 for acquiring information on the behavioral characteristics of a person, and affordances The third neural network 137 for recognizing the behavior of the person based on the information about the person and the information on the behavioral characteristics of the person may also be stored in the non-volatile memory.

Also, the memory 120 may include at least one buffer for temporarily storing an image frame acquired through the camera 110 .

The processor 130 may be electrically connected to the memory 120 to control overall functions and operations of the electronic device 100 .

When a user command for recognizing the action of a person included in the image is input, the processor 130 loads data for the module for recognizing the action of the person stored in the non-volatile memory to perform various operations into the volatile memory. (loading) is possible. In addition, the processor 130 may load the first to third neural networks into the volatile memory. The processor 130 may perform various operations through various modules and neural networks based on data loaded into the volatile memory. Here, the loading refers to an operation of loading and storing data stored in the nonvolatile memory into the volatile memory so that the processor 130 can access it.

Specifically, the processor 130 may acquire an image. In particular, the processor 130 may acquire an image through the camera 110 , but this is only an exemplary embodiment, and the processor 130 may acquire the image externally (eg, an external device, an external server, etc.). In this case, the acquired image may be a moving image, but this is only an example and may be a still image. Also, the acquired image may include at least one person and at least one object. In addition, the acquired image may include a movable animal or object other than a person.

In addition, the processor 130 may acquire an image composed of RGB information, but this is only an example, and may acquire an image composed of depth information, and optical flow information and sound through various sensors information can be obtained.

The processor 130 may process the image acquired through the image processing module 131 . In particular, the processor 130 may perform image sampling on a moving picture acquired through the image processing module 131 . That is, the processor 130 may extract at least one image frame for behavior recognition from among a plurality of image frames included in the obtained video. Alternatively, the processor 130 may extract an image frame corresponding to a specific section from among a plurality of image frames included in the obtained video.

The processor 130 may acquire a first feature value corresponding to affordances of a plurality of regions included in an object from an image frame sampled through the affordance feature acquisition module 132 . In this case, the affordance may indicate a characteristic of an action that a person can perform using the object or a characteristic inherent in the object. For example, the affordance of "scissors" may be "cutting." and the affordance of "handle of a cup" may be "graspable."

In particular, the affordance feature acquisition module 132 may acquire first feature values for a plurality of regions included in the object by using the first neural network 133 . In this case, the first neural network 133 may be trained to obtain a feature value for affordances corresponding to each of a plurality of regions included in the object. For example, if the object is "scissors", the first neural network 133 is a feature value for "handle area of scissors", A corresponding feature value may be obtained, and as a feature value for the “scissors blade region”, a feature value corresponding to “cuttiong” that is an affordance corresponding to the “scissor blade region” may be acquired. That is, even for one object, the first neural network 133 may acquire feature values for different affordances according to regions constituting the object.

At this time, the first neural network 133 matches the affordance label data matching the image of the learning object with information on affordances for a plurality of regions of the learning object, and the image of the general object and the plurality of affordances for the general object by matching them. It can be learned using a knowledge database that stores it. This will be described later with reference to FIGS. 2 to 5 .

The processor 130 may acquire the second characteristic value for the behavior of the person through the behavior characteristic acquisition module 134 . In particular, the behavior feature acquisition module 134 may acquire a second feature value for the behavior of the person using the second neural network 135 . In this case, the second neural network 135 is a second neural network that has been trained to acquire information on the behavior of the person, and may acquire a feature value corresponding to a class (or type) of the behavior of the person.

The processor 130 uses the first feature value acquired from the affordance feature acquisition module 132 through the behavior recognition module 136 and the second feature value acquired from the behavior feature acquisition module 134 , the person included in the image behavior can be recognized.

Specifically, the behavior recognition module 136 may acquire the third feature value from the first feature value acquired from the affordance feature acquisition module 132 and the second feature value acquired from the behavior feature acquisition module 134 . For example, the behavior recognition module 136 may obtain the third feature value by performing at least one of sum, concatenate, and pooling on the first feature value and the second feature value.

Then, the behavior recognition module 136 may recognize the behavior of the person by inputting the third feature value into the third neural network 137 . At this time, the third neural network 137 inputs the third feature value obtained from the first feature value obtained from the affordance feature obtaining module 132 and the second feature value obtained from the behavior feature obtaining module 134 to input the person As a neural network model trained to acquire information on the behavior of , the behavior recognition module 136 may acquire a final behavior recognition result through the third neural network 137 .

According to an embodiment of the present disclosure, the first to third

neural networks

133 , 135 , and 137 may be implemented as a convolutional neural network (CNN) model, but this is only an embodiment, and a deep neural network (Deep Neural Network) Network, DNN), a recurrent neural network (RNN), and a generative adversarial network (GAN) may be implemented as an artificial neural network model of at least one.

That is, in the past, the behavior of a person was recognized using only the second neural network 135, but as in the present invention, by recognizing the behavior of a person using the first to third

neural networks

133, 135, 137, more accurate behavior recognition becomes possible. For example, in the prior art, when a person's action is recognized using only the second neural network 135, the action of "a person holding a knife" is both when the person is holding the blade and when the person is holding the knife handle. Although the recognition result is derived, according to an embodiment of the present disclosure, when the person is holding the knife handle, the action recognition result is derived "a person holds the knife", but when the person is holding the blade, "the person is holding the A behavior recognition result of "Evada" can be derived.

Hereinafter, a method of learning the first neural network 133 according to an embodiment of the present disclosure will be described with reference to FIGS. 2 to 6 .

FIG. 2 is a diagram for explaining a method of constructing a knowledge database used for learning the first neural network 133 of an electronic device, according to an embodiment of the present disclosure. According to an embodiment of the present disclosure, the knowledge database may be built by an external server, but this is only an embodiment, and the knowledge database may be built by the electronic device 100 .

First, the external server may acquire an image of the object ( 210 ). Then, the external server may analyze the acquired image to extract semantic features of the object included in the acquired image ( 230 ). In this case, the semantic feature is a semantic feature of the obtained object, and includes object type (group), object use (use), object action (action), object property (property), object location (location), It may include, but is not limited to, association of objects.

Also, the external server may search for a web or document related to the object ( 220 ). The external server may analyze the affordance related to the object through the web or the document ( 240 ). Specifically, the external server is a sentence component (eg, subject, predicate, object) from a large number of sentences included in the web or document. etc. can be extracted. In addition, the external server may obtain information about the object and affordance describing the object from the extracted sentence components. For example, the external server may acquire "cutting", "gripping", etc. as affordances related to the object "scissors" through the web or a sentence included in a document.

The external server may generate a knowledge graph based on the acquired image, semantic characteristics, and affordance ( 250 ). At this time, the knowledge graph is a graph indicating the relationship between information and information. For example, as shown in FIG. 3 , the knowledge graph is a scissors image 310 and "cutting" 320, "gripping" (330). It may indicate that affordances may be related, and the pruning shear image 340 and affordances of "cutting" 320, "gripping" (330), and "locking" (350) may be associated may represent, and affordances such as utility knife (360) image (360) and "cutting" (320), "gripping" (330), "retracting" (370), and "blade" (380) are associated can indicate that it can be

The external server may build a knowledge database through the acquired knowledge graph ( 260 ). In this case, the knowledge database may store information about a plurality of affordances of the object obtained based on the image of the general object and information related to the general object included in the web or document in the form of a knowledge graph. Meanwhile, the external server may build a knowledge database by associating a plurality of acquired knowledge graphs, and may build a knowledge database by expanding the acquired knowledge graphs.

4 is a diagram for explaining learning of a first neural network, according to an embodiment of the present disclosure. As an embodiment of the present disclosure, the first neural network 133 may be learned by an external server, but this is only an embodiment, and the first neural network 133 may be learned by the electronic device 100 . can

Specifically, the external server may acquire a knowledge database that matches and stores an image of a general object and a plurality of affordances of the general object in the method described with reference to FIGS. 2 and 3 ( 410 ).

Also, the external server may acquire affordance label data for the learning object ( 420 ). In this case, the affordance label data may be data stored by matching the image of the learning object with information on affordances corresponding to a plurality of regions included in the learning object. In this case, the affordance information may be expressed in the form of an image, text, and a multidimensional vector.

In particular, when information on affordance is expressed as an image, the region having affordance in the learning object is expressed in the form of a bounding box as shown in FIG. 5A or in the form of a segmentation map as shown in FIG. 5B . can be expressed as For example, as shown in FIG. 5A , information on affordance of “scissors” may be expressed by matching the text information “cutting” with the first bounding box 510 including a blade area, and a handle area. The included second bounding box 520 and text information “griping” may be matched and expressed. As another example, as shown in FIG. 5A , the information on affordance of “scissors” may be expressed by matching the text information “cutting” with the first segment region 530 including the blade region, and the handle region. The second segment region 540 including the text information “gripping” may be matched and expressed. In this case, the affordance label data may be less than the number of data included in the knowledge database.

The external server may train the first neural network 133 using the knowledge database and affordance label data ( 430 ). Specifically, the external server may input the general object image included in the knowledge database into the first neural network 133 to acquire feature values for a plurality of regions included in the general object. Also, the external server may input the learning object image included in the affordance label data to the first neural network 133 to obtain feature values for a plurality of regions included in the learning object. In this case, the external server first converts the general object image included in the knowledge database including the feature value output when the image of the learning object of the affordance label data is input to the first neural network 133 and the same affordance as the learning object. The feature value output when input to the neural network 133 may be learned to exist within a threshold range. At this time, the meaning of being within the threshold range means that the general object image included in the knowledge database including the same affordance as the feature value and the learning object output when input to the first neural network 133 is transferred to the first neural network 133 . It can be said that the feature value output when input is the same or similar.

For example, as shown in FIG. 6 , the external server includes feature values obtainable when an image 610 of a learning object called “scissors” is input to the first neural network 133 and the first neural network ( 133), when images (620, 630, 640) for “full length scissors”, “knife”, and “chainsaw” that have the same affordance (ie, cutting) as “scissor blade area” are input, the feature values that can be obtained are the same Alternatively, the first neural network 133 may be trained to be similar. That is, the first neural network 133 may be trained so that feature values of objects including regions having the same affordance become similar to each other. Through this, the first neural network may be trained to acquire information on affordances corresponding to each of a plurality of regions constituting an object through an image.

Through the method as described above, the external server can learn the first neural network 133 capable of acquiring affordance information for a plurality of regions included in an object by using a small amount of affordance label data.

Meanwhile, in the above-described embodiment, a method of recognizing a person's behavior using the first to third neural networks has been described, but this is only an example, and the first neural network and the object recognition model are used to recognize the person's behavior. can recognize Hereinafter, a method of recognizing a person's behavior using a first neural network and an object recognition model will be described with reference to FIGS. 7 and 8 .

7 is a block diagram illustrating a configuration of an electronic device according to an embodiment of the present disclosure. The electronic device 700 includes a camera 710 , a memory 720 , and a processor 730 . Meanwhile, since the camera 710 and the memory 720 have the same configuration as the camera 110 and the memory 120 described with reference to FIG. 1 , overlapping descriptions will be omitted.

Specifically, the processor 730 may acquire an image. In particular, the processor 730 may acquire an image through the camera 710 , but this is only an exemplary embodiment and may acquire the image externally (eg, an external device, an external server, etc.). In this case, the acquired image may be a moving image, but this is only an example and may be a still image. Also, the acquired image may include at least one person and at least one object. In addition, the acquired image may include a movable animal or object other than a person. For example, the processor 730 may acquire an image including a human hand 810 , a knife 820 , and an apple 830 as shown in FIG. 8 .

The processor 730 may process the image acquired through the image processing module 731 . In particular, the processor 730 may perform image sampling on a moving picture acquired through the image processing module 731 .

The processor 730 may acquire a first feature value corresponding to affordances of a plurality of regions included in an object from an image frame sampled through the affordance feature acquisition module 732 . In particular, the affordance feature acquisition module 732 may acquire first feature values for a plurality of regions included in the object by using the first neural network 733 .

For example, when an image as shown in FIG. 8 is input, the first neural network 133 as a feature value for the “handle region of the scissors” is “gripping”, which is an affordance corresponding to the “handle region of the scissors”. It is possible to obtain a feature value corresponding to , and as a feature value for the “scissors blade region”, a feature value corresponding to “cuttiong” which is an affordance corresponding to the “scissors blade region” may be acquired. In addition, the affordance feature acquisition module 732 may acquire “grasp” and “cut” as affordances to the knife through the feature value acquired through the first neural network 733 .

The processor 730 may acquire information on an object (including a person) included in the image by using the object information acquisition module 734 . In particular, the object information acquisition module 734 may input the acquired image to the fourth neural network 735 to acquire information about a person and an object included in the image. In this case, the fourth neural network 735 may be a neural network model trained to recognize a person or an object. For example, when the image shown in FIG. 8 is input, the object information acquisition module 734 may acquire information on the human hand 810, the knives 820-1 and 820-2, and the apple 830 as an object recognition result. have.

The processor 730 includes information on the affordance acquired through the affordance feature acquisition module 732 through the behavior recognition module 736 and information about the object acquired through the object acquisition module 734 included in the image. Recognize a character's actions. In detail, the behavior recognition module 736 may perform behavior recognition of a person by using a relationship with a person or another object related to each area of the object. In this case, the behavior recognition module 736 may determine the relation based on the positional relation between the object and the person or other object.

For example, "grasp" is obtained as information about the affordance of a knife handle obtained through the affordance feature acquisition module 732, "cut" is obtained as information about the affordance of "knife blade", and information about the object When information on the human hand 810, the knife 820-1, 820-2, and the apple 830 is obtained, the behavior recognition module 736 determines "a person holds a knife" based on the person and affordance related to the knife handle. A behavior recognition result of "Cutting an apple with a knife" can be obtained based on the affordance and other objects related to the blade. Accordingly, the behavior recognition module 736 may output “a person holds a knife and cuts an apple with a knife” as a final action recognition result.

9 is a diagram for explaining a method of controlling an electronic device according to an embodiment of the present disclosure.

First, the electronic device 100 may obtain an image including a person and an object (S910). In this case, the electronic device 100 may acquire an image through the camera, but this is only an exemplary embodiment and may receive an image from an external device or an external server.

The electronic device 100 may obtain a first feature value of affordances corresponding to a plurality of regions of an object included in the image by inputting the acquired image to the first neural network (S920). In this case, the first neural network may be trained to acquire a feature value for affordances corresponding to each of a plurality of regions included in the object. In particular, the first neural network includes affordance label data that matches the image of the learning object with information on affordances for a plurality of regions of the learning object, and knowledge to match and store the image of the general object and the plurality of affordances for the general object It can be learned using a database.

The electronic device 100 may recognize the behavior of the person using the object based on the acquired first feature value (S930). As an embodiment, the electronic device 100 may acquire the second feature value corresponding to the behavior of the person by inputting the acquired image to a second neural network that has been trained to acquire information on the behavior of the person. In addition, the electronic device 100 may obtain a third feature value based on the first feature value and the second feature value, and input the third feature value to the third neural network learned to recognize the behavior of a person. Recognize a character's actions. As another example, the electronic device 100 may recognize a person or other object related to the object by inputting an image to the fourth neural network learned to recognize the object, and the affordance corresponding to the first feature value and the recognized person Alternatively, the action of the person may be recognized using another object.

As described above, by recognizing the behavior of a person based on affordances for a plurality of regions of the object, the electronic device can more accurately recognize the behavior of the person.

10 is a block diagram for describing in detail the configuration of an electronic device according to an embodiment of the present disclosure. As shown in FIG. 10 , the electronic device 1000 according to the present disclosure includes a display 1010 , a speaker 1020 , a camera 1030 , a memory 1040 , a communication interface 1050 , an input interface 1060 , It may include a sensor 1070 and a processor 1080 . However, such a configuration is an example, and it goes without saying that a new configuration may be added or some configuration may be omitted in addition to such a configuration in carrying out the present disclosure. Meanwhile, the camera 1030 , the memory 1040 , and the processor 1080 have the same configuration as the camera 10 , the memory 120 , and the processor 130 described in FIG. 1 , and thus overlapping descriptions will be omitted.

The display 1010 may display an image captured by the camera 1030 . Also, the display 1010 may display a bounding box surrounding a document whose shape is deformed in the captured image. In addition, the display 710 may display a UI for receiving a user command for recognizing a person's behavior.

Meanwhile, the display 1010 may be implemented as a liquid crystal display panel (LCD), organic light emitting diodes (OLED), etc., and the display 1010 may be implemented as a flexible display, a transparent display, etc. in some cases. . However, the display 1010 according to the present disclosure is not limited to a specific type.

The speaker 1020 may output a voice message. In particular, the speaker 1020 may be included in the electronic device 1000 , but this is only an exemplary embodiment, and may be electrically connected to the electronic device 1000 and located outside. In this case, the speaker 1020 may output a voice message guiding the result of behavior recognition included in the captured image.

The communication interface 1050 includes a circuit and may communicate with an external device. Specifically, the processor 1080 may receive various data or information from an external device connected through the communication interface 1050 and may transmit various data or information to the external device.

The communication interface 1050 may include at least one of a WiFi module, a Bluetooth module, a wireless communication module, and an NFC module. Specifically, each of the WiFi module and the Bluetooth module may perform communication using a WiFi method and a Bluetooth method. In the case of using a WiFi module or a Bluetooth module, various types of connection information such as an SSID may be first transmitted and received, and various types of information may be transmitted and received after communication connection using this.

In addition, the wireless communication module may perform communication according to various communication standards such as IEEE, Zigbee, 3rd Generation (3G), 3rd Generation Partnership Project (3GPP), Long Term Evolution (LTE), 5th Generation (5G), and the like. In addition, the NFC module may perform communication using a Near Field Communication (NFC) method using a 13.56 MHz band among various RF-ID frequency bands such as 135 kHz, 13.56 MHz, 433 MHz, 860 to 960 MHz, and 2.45 GHz.

In particular, according to various embodiments of the present disclosure, the communication interface 1050 may receive various types of information, such as data related to the first to fourth

neural networks

133 , 135 , 137 and 735 , from an external device. Also, the communication interface 1050 may receive an image including a person and an object from an external terminal or server.

The input interface 1060 includes a circuit, and the processor 1080 may receive a user command for controlling the operation of the electronic device 1000 through the input interface 1060 . Specifically, the input interface 1060 may be implemented in a form included in the display 1010 as a touch screen, but this is only an exemplary embodiment, and consists of a button, a microphone, and a remote control signal receiver (not shown). can get

In particular, in various embodiments of the present disclosure, the input interface 1060 provides various functions such as a user command for executing a camera application, a user command for taking an image, a user command for recognizing a person's behavior through the UI, and the like. User commands can be input.

The sensor 1070 may acquire various information related to the electronic device 1000 . In particular, the sensor 1070 may include a GPS capable of acquiring location information of the electronic device 1000 , and a biometric sensor (eg, a heart rate sensor) for acquiring biometric information of a user using the electronic device 1000 . , PPG sensor, etc.) and various sensors such as a motion sensor for detecting the motion of the electronic device 1000 .

Meanwhile, the functions related to the neural network model as described above may be performed through a memory and a processor. The processor may consist of one or a plurality of processors. In this case, one or a plurality of processors are general-purpose processors such as CPUs and APs, GPUs. It may be a graphics-only processor, such as a VPU, or an artificial intelligence-only processor, such as an NPU. One or more processors control to process input data according to a predefined operation rule or artificial intelligence model stored in the non-volatile memory and the volatile memory. The predefined action rule or artificial intelligence model is characterized in that it is created through learning.

Here, being made through learning means that a predefined operation rule or artificial intelligence model of a desired characteristic is created by applying a learning algorithm to a plurality of learning data. Such learning may be performed in the device itself on which artificial intelligence according to the present disclosure is performed, or may be performed through a separate server/system.

The artificial intelligence model may be composed of a plurality of neural network layers. Each layer has a plurality of weight values, and the layer operation is performed through the operation of the previous layer and the operation of the plurality of weights. Examples of neural networks include Convolutional Neural Network (CNN), Deep Neural Network (DNN), Recurrent Neural Network (RNN), Restricted Boltzmann Machine (RBM), Deep Belief Network (DBN), Bidirectional Recurrent Deep Neural Network (BRDNN), GAN. There are Generative Adversarial Networks and Deep Q-Networks, and the neural network in the present disclosure is not limited to the above-described examples, except as otherwise specified.

The learning algorithm is a method of training a predetermined target device (eg, a robot) using a plurality of learning data so that the predetermined target device can make a decision or make a prediction by itself. Examples of the learning algorithm include supervised learning, unsupervised learning, semi-supervised learning, or reinforcement learning, and the learning algorithm in the present disclosure is specified when It is not limited to the above example except for.

The device-readable storage medium may be provided in the form of a non-transitory storage medium. Here, 'non-transitory storage medium' is a tangible device and only means that it does not contain a signal (eg, electromagnetic wave). It does not distinguish the case where it is stored as For example, the 'non-transitory storage medium' may include a buffer in which data is temporarily stored.

According to an embodiment, the method according to various embodiments disclosed in this document may be provided by being included in a computer program product. Computer program products may be traded between sellers and buyers as commodities. The computer program product is distributed in the form of a device-readable storage medium (eg compact disc read only memory (CD-ROM)), or through an application store (eg Play Store™) or on two user devices (eg, It can be distributed (eg downloaded or uploaded) directly or online between smartphones (eg: smartphones). In the case of online distribution, at least a portion of the computer program product (eg, a downloadable app) is stored at least on a machine-readable storage medium, such as a memory of a manufacturer's server, a server of an application store, or a relay server. It may be temporarily stored or temporarily created.

As described above, each of the components (eg, a module or a program) according to various embodiments of the present disclosure may be composed of a singular or a plurality of entities, and some of the above-described corresponding sub-components are omitted. Alternatively, other sub-components may be further included in various embodiments. Alternatively or additionally, some components (eg, a module or a program) may be integrated into a single entity to perform the same or similar functions performed by each corresponding component prior to integration.

According to various embodiments, operations performed by a module, program, or other component may be sequentially, parallel, repetitively or heuristically executed, or at least some operations may be executed in a different order, omitted, or other operations may be added. can

Meanwhile, the term “unit” or “module” used in the present disclosure includes a unit composed of hardware, software, or firmware, and may be used interchangeably with terms such as, for example, logic, logic block, part, or circuit. can A “unit” or “module” may be an integrally formed component or a minimum unit or a part of performing one or more functions. For example, the module may be configured as an application-specific integrated circuit (ASIC).

Various embodiments of the present disclosure may be implemented as software including instructions stored in a machine-readable storage medium readable by a machine (eg, a computer). The device calls the stored instructions from the storage medium. and, as a device capable of operating according to the called command, the electronic device (eg, the electronic device 100) according to the disclosed embodiments may be included.

When the instruction is executed by the processor, the processor may perform a function corresponding to the instruction by using other components directly or under the control of the processor. Instructions may include code generated or executed by a compiler or interpreter.

In the above, preferred embodiments of the present disclosure have been illustrated and described, but the present disclosure is not limited to the specific embodiments described above, and it is common in the technical field to which the disclosure pertains without departing from the gist of the disclosure as claimed in the claims. Various modifications may be made by those having the knowledge of

Claims

A method for controlling an electronic device, comprising:

obtaining an image including a person and an object;

The obtained image is input to a first neural network trained to obtain a feature value of an affordance corresponding to each of a plurality of regions included in the object, and the obtained image is inputted to an affordance corresponding to a plurality of regions of the object included in the image. 1 obtaining a feature value; and

Recognizing the action of the person using the object based on the obtained first feature value; Control method comprising a.
According to claim 1,

The method further includes: inputting the acquired image into a second neural network trained to acquire information on the behavior of the person to obtain a second feature value corresponding to the behavior of the person;

The recognizing step is

A control method for recognizing the behavior of the person using the object based on the first characteristic value and the second characteristic value.
3. The method of claim 2,

The recognizing step is

obtaining a third feature value based on the first feature value and the second feature value,

A control method for recognizing the action of the person by inputting the third feature value into a third neural network that has been trained to recognize the action of the person.
The method of claim 1,

Recognizing a person or other object associated with the object by inputting the image to a fourth neural network trained to recognize the object; further comprising,

The step of recognizing the behavior of the person is,

A control method for recognizing an action of the person using an affordance corresponding to the first feature value and the recognized person or other object.
According to claim 1,

The first neural network,

Learning using an affordance label data matching the image of the learning object and information on affordances for a plurality of regions of the learning object, and a knowledge database that matches and stores the image of a general object and a plurality of affordances for the general object control method.
6. The method of claim 5,

The first neural network,

A general object image included in the knowledge database including a feature value output when an image of a learning object of the affordance label data is input to the first neural network and the same affordance as the learning object is transferred to the first neural network A control method in which a feature value output when input is learned to exist within a threshold range.
6. The method of claim 5,

The knowledge database is

A control method for storing information on a plurality of affordances of the object obtained based on the image of the general object and information related to the general object included in a web or document in the form of a knowledge graph.
6. The method of claim 5,

The control method, characterized in that the affordance label data is less than the number of data included in the knowledge database.
In an electronic device,

a memory storing at least one instruction; and

processor; further comprising

The processor, by executing the at least one instruction,

Acquire an image including people and objects,

The obtained image is input to a first neural network trained to obtain a feature value of an affordance corresponding to each of a plurality of regions included in the object, and the obtained image is inputted to an affordance corresponding to a plurality of regions of the object included in the image. 1 to obtain a feature value,

An electronic device for recognizing an action of the person using the object based on the obtained first feature value.
10. The method of claim 9,

The processor is

inputting the acquired image to a second neural network trained to acquire information on the behavior of the person to obtain a second feature value corresponding to the behavior of the person,

An electronic device for recognizing an action of the person using the object based on the first feature value and the second feature value.
11. The method of claim 10,

The processor is

obtaining a third feature value based on the first feature value and the second feature value,

An electronic device for recognizing the action of the person by inputting the third feature value into a third neural network that has been trained to recognize the action of the person.
10. The method of claim 9,

The processor is

Recognizing a person or other object associated with the object by inputting the image to a fourth neural network trained to recognize the object,

An electronic device for recognizing an action of the person using an affordance corresponding to the first feature value and the recognized person or other object.
10. The method of claim 9,

The first neural network,

Learning using an affordance label data matching the image of the learning object and information on affordances for a plurality of regions of the learning object, and a knowledge database that matches and stores the image of a general object and a plurality of affordances for the general object becoming an electronic device.
14. The method of claim 13,

The first neural network,

A general object image included in the knowledge database including a feature value output when an image of a learning object of the affordance label data is input to the first neural network and the same affordance as the learning object is transferred to the first neural network An electronic device that is trained so that a feature value that is output when input is within a threshold range.
15. The method of claim 14,

The knowledge database is

An electronic device for storing information on a plurality of affordances of the object obtained based on the image of the general object and information related to the general object included in a web or document in the form of a knowledge graph.