WO2018163555A1

WO2018163555A1 - Image processing device, image processing method, and image processing program

Info

Publication number: WO2018163555A1
Application number: PCT/JP2017/045011
Authority: WO
Inventors: 宏大和; 義満青木; 鈴木　智之
Original assignee: コニカミノルタ株式会社
Priority date: 2017-03-07
Filing date: 2017-12-15
Publication date: 2018-09-13

Abstract

The present invention provides an image processing device provided with: an image acquisition unit (10) which acquires an image generated by an image capturing device (200); a human body feature extraction unit (30) which extracts a posture feature of a human shown in the image; a surrounding feature extraction unit (40) which extracts a surrounding feature that represents the shape, position, or category of an object surrounding the human shown in the image; a surrounding feature filtering unit (50) which filters the surrounding feature on the basis of the posture feature and the importance of the surrounding feature set in association with the posture feature; and a behavior determination unit (60) which estimates a behavior class of the human shown in the image on the basis of the posture feature and the surrounding feature filtered by the surrounding feature filtering unit.

Description

Image processing apparatus, image processing method, and image processing program

The present disclosure relates to an image processing device, an image processing method, and an image processing program.

Conventionally, a technique for recognizing a human action from an acquired image is known. As an object for recognizing a person's behavior, for example, an elderly person or an attendant thereof can be cited in consideration of a living situation of an elderly person or a mechanism for recognizing oneself in the field of elderly care. Specifically, for example, in the case of an elderly person, the daily life such as sleeping, getting up, getting out of bed, sitting down, squatting, walking, eating, toilet, going out, taking things, etc. Basic behaviors and behaviors that occur during accidents such as falls and falls.

Of these actions, many actions can be recognized by capturing changes in the posture of the person. For example, a sleeping action may be that a person walks close to a bed and sits down after sitting down. At this time, the posture of the person changes in the order of standing, sitting and lying. In order to recognize such behavior, it is important to recognize an accurate posture.

An example of a technology for recognizing behavior is a technology for estimating a human joint position from an acquired image. In this technique, the posture of a person is estimated from the relationship between the estimated joint positions, and the behavior of the person is recognized from changes in the estimated posture and position of the person.

For example, Non-Patent Document 1 discloses a technique for estimating a human posture using a convolutional neural network (Convolutional Neural Network: hereinafter, “CNN”).

Also, Non-Patent Document 2 discloses a technique for estimating human behavior using a recurrent neural network (hereinafter referred to as “RNN”).

Patent Document 1 discloses a technique for performing action recognition on a rule basis based on the positional relationship between a human posture estimated from an image and object information.

International Publication No. 2016/181837

By the way, in the action recognition system that recognizes the action of the person shown in the image, even if the person is performing the same action due to the positional relationship between the camera and the person, the person appearing in the image in terms of size, direction, distance, etc. There is a problem that a difference occurs in posture characteristics. In particular, with respect to an image captured using a wide-angle camera, it is difficult to recognize the positional relationship in the depth direction of each part of a person.

In this regard, focusing on the fact that human behavior often occurs as an interaction with an object, when recognizing the behavior, a method using information on surrounding objects in addition to human posture characteristics is considered. Has been.

For example, as in the prior art of Patent Document 1, there is a method of identifying an object to be monitored in advance and performing action recognition based on a rule using the positional relationship between the object to be monitored and a person. Conceivable. On the other hand, using a convolutional neural network or the like, a method for extracting features of surrounding objects as well as human posture features is also conceivable.

However, any of these methods can easily recognize actions under conditions where the type, shape, position, or appearance of the object to be noted is fixed. There is a problem that the number of patterns to be recognized becomes enormous, leading to erroneous recognition and increased processing load. In addition, the amount of data prepared in advance is enormous.

The present disclosure has been made in view of the above-described problems, and an object thereof is to provide an image processing device, an image processing method, and an image processing program that enable action recognition with higher accuracy.

The main present disclosure for solving the above-described problems is as follows.
An image acquisition unit for acquiring an image generated by the imaging device;
A human body feature extraction unit for extracting posture characteristics of a person shown in the image;
A peripheral feature extraction unit that extracts a peripheral feature indicating the shape, position, or type of a peripheral object of a person shown in the image;
A peripheral feature filter unit that filters the peripheral feature based on the posture feature and the importance of the peripheral feature set in association with the posture feature;
Based on the posture feature and the peripheral feature filtered by the peripheral feature filter unit, an action determination unit that estimates a person's action class shown in the image;
An image processing apparatus.

In other aspects,
Obtain the image generated by the imaging device,
Extract the posture characteristics of the person reflected in the image,
Extract peripheral features that indicate the shape, position or type of human peripheral objects shown in the image,
Filtering the peripheral feature based on the posture feature and the importance of the peripheral feature set in association with the posture feature;
Estimating an action class of a person shown in the image based on the posture feature and the filtered peripheral feature;
This is an image processing method.

In other aspects,
On the computer,
Processing to acquire an image generated by the imaging device;
A process of extracting the posture characteristics of the person shown in the image;
A process of extracting a peripheral feature indicating the shape, position or type of a peripheral object of a person shown in the image;
Processing for filtering the peripheral feature based on the posture feature and the importance of the peripheral feature set in association with the posture feature;
A process for estimating a behavior class of a person shown in the image based on the posture feature and the filtered peripheral feature;
An image processing program for executing

The image processing apparatus according to the present disclosure enables more accurate action recognition.

FIG. 1 is a diagram illustrating an example of an action recognition system according to the embodiment. FIG. 2 is a diagram illustrating an example of a hardware configuration of the image processing apparatus according to the embodiment. FIG. 3 is a diagram illustrating an example of functional blocks of the image processing apparatus according to the embodiment. FIG. 4 is a diagram illustrating an example of each configuration of the image processing apparatus according to the embodiment. FIG. 5 is a diagram illustrating an example of a human region in an image detected by the human region detection unit according to the embodiment. FIG. 6 is a diagram illustrating an example of posture feature data extracted by the human body feature extraction unit according to the embodiment. FIG. 7 is a diagram illustrating filtering processing of the peripheral feature filter unit according to the embodiment. FIG. 8 is a diagram illustrating an example of the configuration of the hierarchical LSTM of the behavior determination unit according to the embodiment. FIG. 9 is a diagram illustrating the learning process of the learning unit according to the embodiment. FIG. 10 is a flowchart of operations performed by the image processing apparatus according to the embodiment. FIG. 11 is a flowchart of operations performed by the image processing apparatus according to the embodiment. FIG. 12 is a flowchart of operations performed by the image processing apparatus according to the embodiment. 13A, 13B, and 13C are diagrams schematically illustrating each process of image processing performed by the image processing apparatus according to the embodiment. 14A, 14B, and 14C are diagrams schematically illustrating each process of image processing performed by the image processing apparatus according to the embodiment.

(Configuration of action recognition system)
The configuration of the action recognition system according to one embodiment and the outline of the configuration of the image processing apparatus 100 applied to the action recognition system will be described below with reference to FIGS.

FIG. 1 is a diagram illustrating an example of an action recognition system according to the present embodiment.

The action recognition system according to the present embodiment includes an image processing device 100, an imaging device 200, and a communication network 300.

The imaging device 200 is, for example, a general camera or a wide-angle camera, and generates image data by performing AD conversion on an image signal generated by an imaging element of the camera. The imaging apparatus 200 according to the present embodiment is configured to continuously generate image data in units of frames and to capture a moving image (hereinafter also referred to as “moving image data”). Note that the imaging device 200 is installed at an appropriate position in the room so that the person B1 to be recognized for action is reflected in the image.

The imaging apparatus 200 transmits moving image data to the image processing apparatus 100 via the communication network 300.

The image processing apparatus 100 is an apparatus that determines the behavior of the person B1 shown in the image based on the moving image data generated by the imaging apparatus 200 and outputs the result.

FIG. 2 is a diagram illustrating an example of a hardware configuration of the image processing apparatus 100 according to the present embodiment.

The image processing apparatus 100 includes, as main components, a CPU (Central Processing Unit) 101, a ROM (Read Only Memory) 102, a RAM (Random Access Memory) 103, an external storage device (for example, a flash memory) 104, a communication interface 105, and the like. It is a computer equipped with.

Each function to be described later of the image processing apparatus 100 includes, for example, a control program (for example, an image processing program) stored in the ROM 102, the RAM 103, the external storage device 104, and various data (for example, a learned network parameter). This is realized by referring to. However, some or all of the functions may be realized by processing by a DSP (Digital Signal Processor) instead of or by the CPU. Similarly, some or all of the functions may be realized by processing by a dedicated hardware circuit instead of or together with processing by software.

FIG. 3 is a diagram illustrating an example of functional blocks of the image processing apparatus 100 according to the present embodiment.

The image processing apparatus 100 includes an image acquisition unit 10, a human region detection unit 20, a human body feature extraction unit 30, a peripheral feature extraction unit 40, a peripheral feature filter unit 50, a behavior determination unit 60, and a learning unit 70.

The image acquisition unit 10 acquires image data D1 of an image (here, a moving image) generated by the imaging device 200.

The human area detection unit 20 detects a human area from the image of the image data D1.

The human body feature extraction unit 30 extracts the posture feature of the person shown in the image based on the image data D1 and the data D2 indicating the human region.

The peripheral feature extraction unit 40 extracts peripheral features of a human peripheral object shown in the image based on the image data D1 and the data D2 indicating the human region.

The peripheral feature filter unit 50 filters the peripheral feature data D4 based on the pose feature data D3 and the importance data Da of the peripheral features set in association with the pose feature.

The behavior discriminating unit 60 discriminates the behavior class of the person shown in the image based on the posture feature data D3 and the filtered peripheral feature data D4a, and outputs the result data D5.

The learning unit 70 performs a learning process based on the teacher data D6 so that network parameters (for example, a weighting factor and a bias of a neural network described later) are optimized.

Note that the image processing apparatus 100 according to the present embodiment acquires moving image data D1 from the imaging apparatus 200, and the data D2 to D5 are continuously generated for each frame or at a plurality of frame intervals. The

(Configuration of image processing apparatus)
Hereinafter, with reference to FIGS. 4 to 9, details of each component of the image processing apparatus 100 according to the present embodiment will be described.

FIG. 4 is a diagram illustrating an example of each configuration of the image processing apparatus 100 according to the present embodiment. In addition, the arrow in FIG. 4 represents transmission / reception of data. An example of the operation of the image processing apparatus 100 will be described later with reference to FIGS.

[Image acquisition unit]
The image acquisition unit 10 acquires the moving image data D <b> 1 generated by the imaging device 200 and outputs it to the human region detection unit 20. Of course, the image acquisition unit 10 may be configured to acquire the image data D1 stored in the external storage device 104 or the image data D1 provided via the Internet line or the like.

[Human area detection unit]
The human region detection unit 20 acquires the image data D1 from the image acquisition unit 10, performs predetermined arithmetic processing on the image data D1, and detects a human region including a person in the image. Then, the human region detection unit 20 outputs the detected human region data D2 to the human body feature extraction unit 30 and the peripheral feature extraction unit 40 together with the image data D1.

FIG. 5 is a diagram illustrating an example of a human region in an image detected by the human region detection unit 20. In FIG. 5, T1 represents the entire area of the image, T2 represents the human area in the image, and T2a represents the peripheral area of the person in the image.

The method in which the human region detection unit 20 detects the human region T2 is arbitrary. For example, a difference image in the image T1 is detected from a moving image, and the human region T2 is detected from the difference image. The human region detection unit 20 may also use a learned neural network, template matching, a combination of HOG (Histogramsogramof Oriented Gradients) features and SVM (Support Vector Machine), or a method such as a background subtraction method. Good.

Note that the human region detection unit 20 may be integrated with the human body feature extraction unit 30. In other words, the processing of the human region detection unit 20 may be executed in a series of processing when the human body feature extraction unit 30 detects the posture feature of the person shown in the image.

[Human body feature extraction unit]
The human body feature extraction unit 30 acquires the image data D1 and the data D2 indicating the human region from the human region detection unit 20, performs a predetermined calculation process on the image of the human region T2, and the posture of the person shown in the image Extract features. Then, the human body feature extraction unit 30 outputs the human posture feature data D3 shown in the extracted image to the peripheral feature filter unit 50 and the action determination unit 60.

Here, “person's posture characteristics” are extracted from the characteristics of the posture of the human body such as walking and sitting. “Human posture characteristics” include, for example, the joint position of the human body, the position of each part of the human body (for example, the head, feet, etc.), the type of posture (for example, the state of standing up, the state of bending forward), or It is expressed by these temporal changes. However, the data format for representing the “person's posture feature” is arbitrary, such as a type format, a coordinate format, or a relative position between parts. The “person's posture feature” may be abstracted data (for example, HOG feature value). Since “person's posture feature” may be expressed by type or the like, the word “feature” is used as a high-level concept and the word “feature” is used (the same applies to peripheral features).

FIG. 5 shows the joint positions of the human body as an example of the posture characteristics of the person. In FIG. 5, as the joint positions of the human body, the right ankle p0, right knee p1, right waist p2, left waist p3, left knee p4, left ankle p5, right wrist p6, right elbow p7, right shoulder p8, left shoulder p9, The left elbow p10, the left wrist p11, the neck p12, and the crown p13 are shown.

As shown in FIG. 4, the human body feature extraction unit 30 extracts a human posture feature from an image using, for example, a learned CNN. Note that the CNN constituting the human body feature extraction unit 30 performs, for example, a learning process using teacher data indicating the correspondence between the human body image and the coordinates of the joint position of the human body (two-dimensional position or three-dimensional estimated position) in the image. Is used (generally also referred to as R-CNN).

The human body feature extraction unit 30 includes, for example, a preprocessing unit 31, a convolution processing unit 32, an all coupling unit 33, and an all coupling unit 34.

The pre-processing unit 31 normalizes the image by cutting out the image T2 of the human area from the image T1 of the entire area and converting it into a predetermined size and aspect ratio based on the data D2 indicating the human area. In addition, the preprocessing unit 31 may perform area setting according to the viewing distance of the human area or may perform color division processing.

The convolution processing unit 32 is configured by hierarchically connecting a plurality of feature quantity extraction layers. The convolution processing unit 32 performs convolution operation processing, activation processing, and pooling processing on input data input from the previous layer in each feature amount extraction layer.

The convolution processing unit 32 repeats the processing in each feature quantity extraction layer in this way, and extracts feature quantities (for example, edges, regions, distributions, etc.) of a plurality of viewpoints included in the image in a high dimension, The result is output to all coupling units 33.

The total coupling portion 33 is constituted by, for example, a multilayer perceptron that fully couples a plurality of feature quantities (the same applies to all the

other coupling portions

34, 43, 51, 54, and 63). The fully combining unit 33 fully combines the plurality of intermediate calculation result data obtained from the convolution processing unit 32 to generate data D3 indicating the posture characteristics of the person. Then, the all combination unit 33 outputs data D3 indicating the posture characteristics of the person to the all combination unit 34 and the peripheral feature filter unit 50.

FIG. 6 is a diagram illustrating an example of posture feature data D3 extracted by the human body feature extraction unit 30. In FIG. 6, the feature amount of each joint position of the human body is represented in 4096 dimensions.

The all coupling unit 34 is an output layer for coupling all the outputs of the all coupling unit 33 and outputting the data D3 indicating the posture characteristics of the person to the action determining unit 60.

The human body feature extraction unit 30 according to the present embodiment inputs sparse and high-dimensional features having a correlation with the learned posture feature by inputting the output from the all combination unit 33 to the peripheral feature filter unit 51. Extract. However, the peripheral feature filter unit 50 may be acquired from the total coupling unit 34.

It should be noted that the above-described method for extracting the human posture feature is the same as a known method. For details, see Non-Patent Document 1, for example.

Further, the posture feature extraction processing performed by the human body feature extraction unit 30 is not limited to the above method, and any method may be used. The human body feature extraction unit 30 may use, for example, silhouette extraction processing, region division processing, skin color extraction processing, luminance gradient extraction processing, motion extraction processing, shape model fitting, or a combination thereof. Also, a method of performing extraction processing for each part of the human body and integrating them may be used.

[Nearby feature extraction unit]
The peripheral feature extraction unit 40 acquires the image data D1 and the data D2 indicating the human region from the human region detection unit 20, performs predetermined arithmetic processing on the image around the human region T2, and displays the human image in the image. The peripheral features of the surrounding objects are extracted. Then, the peripheral feature extraction unit 40 outputs the extracted peripheral feature data D4 to the peripheral feature filter unit 50.

Here, the “peripheral feature” represents the shape of the peripheral object existing around the person, the position of the peripheral object, the type of the peripheral object, or the like. However, the data format for representing “peripheral features” is arbitrary, such as a type format, a coordinate format, or a relative position. The “peripheral feature” may be abstracted data (for example, HOG feature amount).

More preferably, the “peripheral feature” includes information indicating a positional relationship between the peripheral object and each part of the human body (for example, a positional relationship with the human hand). The “peripheral feature” more preferably includes information indicating the type of the peripheral object (for example, bed, chair, etc.). This is because the information can be an important determinant when the action determination unit 60 determines a person's action.

In the present embodiment, the peripheral feature has a role of complementing the interaction between the person and the object. In other words, the human behavior that is difficult to discriminate only by the human posture feature is complemented by the peripheral feature.

As shown in FIG. 4, the peripheral feature extraction unit 40 extracts peripheral features from the image using CNN or the like, for example, in the same manner as the human body feature extraction unit 30. The CNN constituting the peripheral feature extraction unit 40 is, for example, one that has been subjected to learning processing using teacher data indicating the correspondence between the image of the object and the shape, type, position of each part, etc. It is done. More preferably, an image obtained by performing learning processing using an image around a human body including a human body and teacher data indicating a correspondence relationship between the shape of the object and the coordinates of the positional relationship in the image is used.

The peripheral feature extraction unit 40 includes, for example, a preprocessing unit 41, a convolution processing unit 42, and an all combination unit 43.

Based on the image data D1 and the data D2 indicating the human region, the preprocessing unit 41 cuts out the peripheral region image T2a that is larger than the human region T2 from the human region T2 based on the human region T2 (see FIG. 5). Then, the preprocessing unit 41 performs normalization of the image such as conversion to a predetermined size and aspect ratio.

The processing of the convolution processing unit 42 and the total combining unit 43 is as described above. The convolution processing unit 42 extracts feature amounts (for example, edges, regions, distributions, etc.) of a plurality of viewpoints of peripheral features in a high dimension by convolution processing or the like. The full combining unit 43 fully combines the plurality of intermediate calculation result data obtained from the convolution processing unit 42, and outputs peripheral features as final calculation result data.

Note that the extraction processing performed by the peripheral feature extraction unit 40 is not limited to the above method, and any method may be used. The peripheral feature extraction unit 40 may use, for example, HOG feature amount extraction processing, silhouette extraction processing, region division processing, luminance gradient extraction processing, motion extraction processing, shape model fitting, or a combination thereof. Further, a method of performing feature extraction processing for each predetermined area and integrating them may be used.

[Peripheral feature filter section]
The peripheral feature filter unit 50 acquires posture feature data D3 from the human region detection unit 20 and peripheral feature data D4 from the peripheral feature extraction unit 40. The peripheral feature filter unit 50 filters the peripheral features based on the importance level data Da of the peripheral features set in association with the posture feature. Then, the peripheral feature filter unit 50 outputs the filtered peripheral feature data D4a to the behavior determination unit 60.

When discriminating the actions of people appearing in an image, as described above, in environments where the type, position, or appearance of the object appearing in the image are different, it should be recognized unless the surrounding object to be noted is specified. The number of patterns becomes enormous, leading to misrecognition and increased processing load.

On the other hand, the peripheral features to be noted are not necessarily the same for each human action. In other words, the important peripheral feature elements and the unnecessary peripheral feature elements change depending on the posture of the person due to the action. For example, when discriminating between the action of `` sitting on the bed (flooring) '' and the action of `` sitting on the chair '', the peripheral features around the waist (for example, the round edge of the chair seat, the vertical edge of the bed) , And back space) is important. In addition, when distinguishing the action of “taking an object” and the action of “putting an object”, the peripheral features near the hand are important.

From such a viewpoint, the image processing apparatus 100 according to the present embodiment performs filtering of the peripheral features extracted by the peripheral feature extraction unit 40 by the peripheral feature filter unit 50.

Here, “importance” means the position, shape, type, etc. of the peripheral feature to be noted. In other words, “importance” makes it possible to narrow down actions that are likely to be related according to the posture characteristics of a person and to specify the position, shape, type, or the like of a peripheral object to be noted. However, the data format for representing “importance” may be any format as long as it is uniquely converted from the posture feature data D3 and can filter the peripheral feature data D4.

Note that, as the “importance”, more preferably, information on temporal changes in human posture characteristics (for example, a difference from the posture characteristics of a person before a predetermined frame) is also used. This makes it easier to narrow down the peripheral features to be focused on. For example, when a temporal change in a person's posture characteristic indicates that a hand is extending in a certain direction, it can be predicted that the person is trying to pick up an object ahead of it, so that behavior determination described later The unit 60 may be configured to focus only on the feature of the object.

The “importance” is set by machine learning (learning unit 70 to be described later) for each action class, for example, using the teacher data associating the posture feature with the peripheral feature to be noted. However, the “importance” is not limited to machine learning, and may be set by the user or the like.

As shown in FIG. 4, the peripheral feature filter unit 50 includes, for example, an input-side full coupling unit 51, an activation processing unit 52, a filtering unit 53, and an output-side full coupling unit 54.

The input-side full combining unit 51 performs the full combining process on the posture feature output from the full combining unit 33 of the human body feature extracting unit 30, thereby obtaining the data D3 related to the posture feature (here, a 4096-dimensional feature value vector). , The degree of importance of peripheral features (here, 200-dimensional feature quantity vectors) is converted. Then, the input-side full combining unit 51 outputs an importance vector having the same number of dimensions as the peripheral features (here, 200 dimensions).

In the input-side full combining unit 51, for example, a weighting factor for each feature amount vector of the posture feature is set so that the posture feature can be converted into the importance of the peripheral feature by machine learning for each action class. Yes.

The activation processing unit 52 performs activation processing on the importance of the peripheral features output from the input-side full combining unit 51 using, for example, a ReLU function. The ReLU function is a function that outputs 0 when a negative number is input and returns the input as it is when a number greater than 0 is input.

The degree of importance output from the activation processing unit 52 is expressed by, for example, the following expression (1).

The filtering unit 53 adds the importance of the peripheral features output from the activation processing unit 52 to the peripheral features output from the peripheral feature extraction unit 40, and filters the peripheral features. The peripheral feature output from the filtering unit 53 is expressed by the following equation (2), for example.

FIG. 7 is a diagram for explaining the filtering process of the peripheral feature filter unit 50. In the present embodiment, the data Da related to the importance is represented as a vector that reacts only to a specific dimension vector. Accordingly, the feature amount of the peripheral feature after filtering (right diagram) is a sparse feature vector compared to the feature amount of the peripheral feature before filtering (left diagram).

The output-side all combining unit 54 further abstracts the filtered peripheral feature data D4a and outputs the data to the combining unit 60a of the behavior determining unit 60.

The calculation method of the peripheral feature filter unit 50 is not limited to the above. For example, when the posture feature data D3 indicates the posture type (for example, middle waist), the peripheral feature filter unit 50 may associate the importance with the posture type on a one-to-one basis.

[Behavior discrimination part]
The behavior determination unit 60 acquires the human posture feature data D3 from the human body feature extraction unit 30, and acquires the filtered peripheral feature data D4a from the peripheral feature filter unit 50. Then, the action determination unit 60 determines the action class of the person shown in the image based on the time series data of the human posture feature data D3 and the filtered peripheral feature data D4a.

Since human behavior has temporal continuity and deep chronological relationships between behaviors, it is not necessary to perform single image data for each frame when discriminating human behavior classes. In addition, it is desirable to consider time-series data indicating temporal changes such as the posture of a person and the positional relationship between objects. In addition, when discriminating a human behavior class, it is desirable to consider not only the behavior between successive frames but also the behavior of the past (for example, one minute ago) separated to some extent. This is because, for example, when determining the action of “getting up from the chair”, data indicating that the action of “sitting on the chair” is performed in the past action is also a big clue.

Therefore, the behavior determination unit 60 according to the present embodiment performs time series analysis using a hierarchical LSTM (Long Short-Term Memory) which is a kind of recursive neural network. The hierarchical LSTM can recognize a relationship in a long time interval (for example, one minute before) in addition to a relationship in a short time interval (for example, the immediately preceding image frame).

FIG. 8 is a diagram illustrating an example of the configuration of the hierarchical LSTM of the behavior determination unit 60.

As shown in FIG. 4, the action determination unit 60 includes the human posture feature data D <b> 3 acquired from the human body feature extraction unit 30 combined by the combination unit 60 a and the peripheral feature filtered from the peripheral feature filter unit 50. (A (t), A (t-1), A (t-2), and A (t-3) in FIG. 8 are the posture feature data D3 and the peripheral feature data. This represents a combination of D4a (hereinafter the same). Here, the combining unit 60a combines both feature quantity vectors as feature quantity vectors of different dimensions.

Note that the combining unit 60a more preferably acquires the posture feature data D3 and the peripheral feature data D4a of the same image data D1. Therefore, the combining unit 60a performs, for example, synchronization processing to acquire the posture feature data D3 and the peripheral feature data D4a, or uses the identification code attached to the frame to obtain the posture feature data D3 and the peripheral feature data D4a. It is desirable to acquire.

As shown in FIG. 8, the hierarchical LSTM includes an intermediate layer 61, a normalization unit 62, an all combination unit 63, and an action class determination unit 64 of a plurality of layers (only three layers are shown in FIG. 8). Composed.

In FIG. 8, A (t) represents the input at the present time (representing the time to be processed. The same applies hereinafter), and A (t−1) is the first time before the present time (for example, one frame before). ), A (t-2) represents the input for the second time before the current time (for example, 10 frames before), and A (t-3) represents the third time for the current time. Represents an input (for example, 20 frames before).

Each of the intermediate layers 61 (61a to 61l) having a plurality of layers is constituted by LSTM units.

The first layer LSTM 61a is inputted with the current input data A (t) added with a predetermined weighting coefficient Wa1 (t) (representing a matrix of weighting coefficients; the same applies hereinafter), and the first time before The output data Z1 (t-1) of the first layer LSTM 61d (for example, one frame before) added with a predetermined weight coefficient Wb1 (t-1) is recursively input. Then, the LSTM 61a in the first layer performs predetermined arithmetic processing on these data, and outputs Z1 (t) data to the LSTM 61b and the normalization unit 62 in the second layer.

The second layer LSTM 61b is inputted with the current data A (t) added with a predetermined weight coefficient Wa2 (t), and the second layer before the second time (for example, 10 frames before). The LSTM 61h output data Z2 (t-2) with a predetermined weighting factor Wb2 (t-3) added is input recursively. Then, the LSTM 61b in the second layer performs predetermined arithmetic processing on these data, and outputs the data of Z2 (t) to the LSTM 61b and the normalization unit 62 in the third layer.

The third layer LSTM 61c is inputted with the current data A (t) added with a predetermined weight coefficient Wa3 (t), and the third layer before the third time (for example, 20 frames before). A data obtained by adding a predetermined weighting factor Wb3 (t-3) to the output data Z3 (t-3) of LSTML is recursively input. Then, the LSTM 61c in the third layer performs predetermined arithmetic processing on these data, and further outputs Z3 (t) data to the LSTM 61 (not shown) and the normalization unit 62 in the lower layer.

In this way, the hierarchical LSTM increases the length of the real time considered in the lower LSTM 61 by increasing the frame interval of the past data to be input as the lower LSTM 61, and various frame intervals. Time-series analysis can be performed.

Each LSTM unit 61 includes, for example, a memory cell that holds data of the immediately preceding (previous time) LSTM unit 61 and three gates (generally referred to as an input gate, a forgetting gate, and an output gate). (Not shown). The three gates are input with the current data A and the output data Z of the LSTM unit 61 immediately before (the past time). Each of the three gates outputs a value from 0 to 1 based on the current data A, the output data Z of the LSTM unit 61 immediately before (the past time), and the weighting factor set separately. . The outputs from the three gates are respectively integrated into the data input to the LSTM unit 61, the data output from the LSTM unit 61, and the data held in the memory cell. With such a unit structure, each LSTM unit 61 can appropriately reflect the characteristics at the past time and output the current characteristics.

The above-described hierarchical LSTM calculation method is the same as a known method. For details, see Non-Patent Document 2, for example.

The normalization unit 62 performs L2 normalization processing on the output data Z1 (t), Z2 (t), and Z3 (t) of each of the intermediate layers 61 (here, the

intermediate layers

61a, 61b, and 61c) of a plurality of layers. Apply.

The total combining unit 63 applies all the output data Z1 (t), Z2 (t), and Z3 (t) to each of the normalized intermediate layers 61 (here, the

intermediate layers

61a, 61b, and 61c). Perform the join process. Then, all the coupling units 63 output the probability for each behavior class (for example, for each behavior such as sitting on a chair or getting up from a bed) to the behavior class determination unit 64. The probability for each action class output from all the coupling units 63 is expressed by the following equation (3) using, for example, the softmax function.

Note that the intermediate layer 61 and the all coupling unit 63 are used in which weighting factors are appropriately set in advance by machine learning (learning unit 70 described later) for each action class. As a result, an appropriate probability is output for each action class from all the coupling units 63.

The action class determination unit 64 acquires the output from the all combination unit 63, determines that the action class having the maximum probability is the action class performed by the person shown in the image, and outputs the determination result. .

In addition, the calculation method of the action class discrimination | determination part 60 is not restricted above. For example, instead of the hierarchical LSTM structure, a recursive neural network having another structure may be used. In some cases, the human posture feature data D3 and the peripheral feature data D4a for each frame may be used without using the time series data in order to reduce the processing load.

[Learning Department]
The learning unit 70 performs machine learning using teacher data so that the human body feature extraction unit 30, the peripheral feature extraction unit 40, the peripheral feature filter unit 50, and the behavior determination unit 60 can execute the above-described processing.

The learning unit 70 uses, for example, teacher data in which a normalized human region image and a human posture characteristic (for example, a joint position) are associated with each other, the convolution processing unit 32 of the human body feature extraction unit 30, the all coupling unit 33, and network parameters (for example, weighting factor, bias) of all coupling units 34 are adjusted.

In addition, the learning unit 70 uses, for example, teacher data in which a normalized image of an object around a person and an object characteristic (for example, a positional relationship with the human body) are associated with each other, the convolution processing of the peripheral feature extraction unit 40 The network parameters (for example, weighting factor, bias) of the unit 42 and the total coupling unit 43 are adjusted.

In addition, the learning unit 70 uses, for example, teacher data in which the posture characteristics of a person and the importance of surrounding objects are associated with each other, the network parameters (for example, the total coupling unit 51 of the peripheral feature filter unit 50 and the total coupling unit 54). , Weight coefficient, bias).

In addition, the learning unit 70 uses, for example, time series data of human posture features and peripheral features, and teacher data in which a correct behavior class is associated, and the intermediate layer 61 and the all coupling unit 63 of the behavior determination unit 60. Network parameters (e.g., weighting factor, bias).

The learning unit 70 may perform these learning processes using, for example, a known error back propagation method. Then, the learning unit 70 stores the network parameter adjusted by the learning process in the storage unit (for example, the external storage device 104).

However, in order to perform highly versatile action recognition, it is necessary to prevent deterioration in accuracy due to changes in the surrounding environment. In other words, the peripheral feature data may complement only the posture feature data of the human body, while indicating only the features specialized for the environment of the teacher data in the learning process.

Therefore, the learning unit 70 according to the present embodiment executes the learning process so that at least the behavior determination unit 60 has the posture characteristic of the human body as a main determinant in determining the behavior than the peripheral feature. .

FIG. 9 is a diagram for explaining the learning process of the learning unit 70.

When the learning unit 70 adjusts the network parameters (for example, weighting factor, bias) of the intermediate layer 61 and the total coupling unit 63 of the behavior determination unit 60, time series data D6a of human posture characteristics, time series data of peripheral characteristics Learning is performed using teacher data associated with D6b and the correct behavior class D6c so that the loss Loss indicating the error of the output data (here, the output of all the coupling units 63) with respect to the correct answer is reduced. .

The loss function is expressed as in the following equation (4) using, for example, a softmax cross entropy function.

At this time, the learning unit 70 according to the present embodiment uses the loss Loss1 when only the time series data D6a of the human posture feature is input to the hierarchical LSTM, the time series data D6a of the human posture feature, and the surroundings. Two of loss Loss2 are prepared when both are input to the hierarchical LSTM using both of the feature time series data D6b.

Then, the learning unit 70 uses, for example, an error back propagation method or the like, so that the sum of the loss Loss1 and the loss Loss2 (see the following equation (5)) is minimized, so that the weighting coefficient of the network parameter of the hierarchical LSTM And adjust the bias. This makes it possible to adjust the network parameters of the hierarchical LSTM so that the posture feature of the human body is the main determinant in the action discrimination than the peripheral feature.

Further, the loss function may be expressed as in the following equation (6) by adding a regularization term of importance. By doing so, overlearning can be suppressed.

Note that the learning unit 70 may perform learning processing for each functional block collectively using the moving image data and the correct action class as teacher data instead of performing the learning processing for each functional block.

(Operation of image processing device)
Hereinafter, an example of the operation of the image processing apparatus 100 according to the present embodiment will be described with reference to FIGS.

10 to 12 are examples of flowcharts of operations performed by the image processing apparatus 100. FIG. Note that the operation flows shown in FIGS. 10 to 12 are executed by the CPU according to the computer program, for example.

FIG. 13 and FIG. 14 are diagrams schematically illustrating each process of image processing performed by the image processing apparatus 100. In FIGS. 13 and 14, the process for determining the behavior of the person B1 shown in the image shown in FIG. 1 is shown as an image. Here, it is assumed that the image captured by the imaging apparatus 200 includes a bed B2, a trash can B3, a television B4, and an illumination B5 in addition to the person B1.

First, the image processing apparatus 100 (image acquisition unit 10) acquires image data from the imaging apparatus 200 (step S1).

Next, the image processing apparatus 100 (human region detection unit 20, human body feature extraction unit 30, peripheral feature extraction unit 40, and peripheral feature filter unit 50) performs feature extraction processing (step S2).

In step S2, the image processing apparatus 100 (human area detecting unit 20) detects a human area from the image of the acquired image data (step S21). Next, the image processing apparatus 100 (human body feature extraction unit 30) extracts the posture feature of the person B1 shown in the image of the person region as shown in FIG. 13A (step S22). Next, the image processing apparatus 100 (peripheral feature extraction unit 40) extracts the peripheral features (here, the bed B2, the trash can B3, and the lighting B5) of the person B1 shown in the image of the person area as shown in FIG. 13B. (Step S23). Next, the image processing apparatus 100 (peripheral feature filter unit 50) performs filtering of the peripheral features as shown in FIG. 13C (step S24).

In FIG. 13C, since the person B1 is in a lying posture, the bed B2 is a peripheral object closely related to the action of the person B1, while the peripheral objects other than the bed B2 (the trash can B3 and the lighting B5) are It can be regarded as a peripheral object unrelated to the action of the person B1. From this point of view, the image processing apparatus 100 (peripheral feature filter unit 50) shows a state where peripheral objects other than the bed B2 (trash box B3 and illumination B5) are removed by filtering.

Next, the image processing apparatus 100 (behavior determination unit 60) determines the action of the person shown in the image based on the feature data extracted in the feature extraction process in step S2 (step S3).

In step S3, the image processing apparatus 100 (behavior determination unit 60) inputs data D3 related to the human posture feature and data D4a related to the peripheral feature (step S31). Next, the image processing apparatus 100 (behavior determination unit 60) uses the hierarchical LSTM 60 to calculate the probability for each action class (step S32). Next, the image processing apparatus 100 (behavior determination unit 60) determines that the action class having the maximum probability is the action of the person shown in the image, and outputs data indicating the action class (step S33).

In step S3, the image processing apparatus 100 (behavior determination unit 60), for example, in the order of FIG. 14A, FIG. 14B, and FIG. 14C, from the state where the posture of the person B1 lies sideways with respect to the bed B2. A state that changes over time to a state in which it rises is extracted. The image processing apparatus 100 (behavior determination unit 60) can determine that the action class of the person B1 corresponds to a wake-up by such a change over time.

In step S4, when the image processing apparatus 100 determines to end the series of action recognition processes (step S4: YES), the process ends. On the other hand, when it determines with continuing the process of action recognition (step S4: NO), it returns to step S1 and the image processing apparatus 100 continues a process.

As described above, according to the image processing apparatus 100 according to the present embodiment, the importance of the peripheral feature is set in association with the posture feature of the human body, and the peripheral feature is filtered, so that only the peripheral object related to the human behavior is filtered. Can be extracted. Accordingly, the image processing apparatus 100 according to the present embodiment can estimate a human action class with high accuracy even in an environment in which the types, positions, and appearances of surrounding objects are different.

In particular, the image processing apparatus 100 according to the present embodiment extracts temporal changes in the positional relationship of peripheral objects related to the posture of the human body based on time-series data of the posture features and the peripheral features, and determines the human action class. Since it is configured to estimate, it is possible to estimate a human action class with higher accuracy.

In addition, the image processing apparatus 100 according to the present embodiment uses a recursive neural network (particularly, a hierarchical LSTM method), and from the viewpoint of both a long time interval and a short time interval, a time series of posture features and peripheral features. Since it is configured to perform temporal analysis of data, it is possible to estimate a human action class with higher accuracy.

(Other embodiments)
The present invention is not limited to the above embodiment, and various modifications can be considered.

In the above embodiment, as an example of the configuration of the image processing apparatus 100, the image acquisition unit 10, the human region detection unit 20, the human body feature extraction unit 30, the peripheral feature extraction unit 40, the peripheral feature filter unit 50, the behavior determination unit 60, and Although the function of the learning unit 70 has been described as being realized by one computer, it is needless to say that the learning unit 70 may be realized by a plurality of computers. Moreover, the program and data read by the computer may be distributed and stored in a plurality of computers.

In the above embodiment, as an example of the operation of the image processing apparatus 100, the image acquisition unit 10, the human region detection unit 20, the human body feature extraction unit 30, the peripheral feature extraction unit 40, the peripheral feature filter unit 50, and the behavior determination unit 60. Although the above processing is shown as being executed in a series of flows, it is needless to say that some or all of these processing may be executed in parallel.

Although specific examples of the present invention have been described in detail above, these are merely examples and do not limit the scope of the claims. The technology described in the claims includes various modifications and changes of the specific examples illustrated above.

The disclosure of the specification, drawings and abstract contained in the Japanese application of Japanese Patent Application No. 2017-043072 filed on March 7, 2017 is incorporated herein by reference.

The image processing apparatus according to the present disclosure enables more accurate action recognition without increasing the processing load.

DESCRIPTION OF SYMBOLS 10 Image acquisition part 20 Human area | region detection part 30 Human body feature extraction part 40 Peripheral feature extraction part 50 Peripheral feature filter part 60 Behavior discrimination | determination part 70 Learning part 100 Image processing apparatus 200 Imaging apparatus 300 Network D1 Image data D2 Human area data D3 Posture characteristic Data D4 Peripheral feature data D4a Peripheral feature data after filtering D5 Action class result data D6 Teacher data Da Importance data

Claims

An image acquisition unit for acquiring an image generated by the imaging device;
A human body feature extraction unit for extracting posture characteristics of a person shown in the image;
A peripheral feature extraction unit that extracts a peripheral feature indicating the shape, position, or type of a peripheral object of a person shown in the image;
A peripheral feature filter unit that filters the peripheral feature based on the posture feature and the importance of the peripheral feature set in association with the posture feature;
Based on the posture feature and the peripheral feature filtered by the peripheral feature filter unit, an action determination unit that estimates a person's action class shown in the image;
An image processing apparatus comprising:
The behavior determination unit estimates a behavior class of a person shown in the image based on time series data of the peripheral feature filtered by the posture feature and the peripheral feature filter unit.
The image processing apparatus according to claim 1.
The behavior determination unit estimates a behavior class of a person shown in the image using a recursive neural network.
The image processing apparatus according to claim 2.
The recurrent neural network includes a first LSTM that receives current data and data before a first time, a second LSTM that receives current data and data before a second time, A full coupling part for fully coupling the outputs from the first and second LSTMs,
The image processing apparatus according to claim 3.
The image processing apparatus according to claim 3, further comprising a learning unit that adjusts network parameters of the recursive neural network based on teacher data of an image input in association with an action class.
The learning unit extracts the first loss of the recurrent neural network calculated when using the posture feature and the peripheral feature extracted from the teacher data image, and the teacher data image. A second loss of the recursive neural network calculated when using only the posture feature among the posture feature and the peripheral feature;
Adjusting the network parameters of the recurrent neural network so that the sum of the first loss and the second loss is reduced;
The image processing apparatus according to claim 5.
A human area detecting unit for detecting a human area from the image,
The human body feature extraction unit sets the human region detected by the human region detection unit as a region for extracting the posture feature,
The peripheral feature extraction unit sets an area obtained by enlarging the human region detected by the human region detection unit as a region for extracting the peripheral feature.
The image processing apparatus according to claim 1.
The human body feature extraction unit normalizes the human region image detected by the human region detection unit to a predetermined shape, and extracts a human posture feature reflected in the image;
The image processing apparatus according to claim 7.
The posture feature includes a joint position of a person reflected in the image, a position of each part of the person reflected in the image, or a type of posture of the person reflected in the image.
The image processing apparatus according to claim 1.
The peripheral feature includes a positional relationship between a person reflected in the image and a peripheral object.
The image processing apparatus according to claim 1.
The human body feature extraction unit extracts the posture feature using a convolutional neural network.
The image processing apparatus according to claim 1.
The importance of the peripheral feature is set in association with the temporal change of the posture feature.
The image processing apparatus according to claim 1.
Obtain the image generated by the imaging device,
Extract the posture characteristics of the person reflected in the image,
Extract peripheral features that indicate the shape, position or type of human peripheral objects shown in the image,
Filtering the peripheral feature based on the posture feature and the importance of the peripheral feature set in association with the posture feature;
Estimating an action class of a person shown in the image based on the posture feature and the filtered peripheral feature;
Image processing method.
On the computer,
Processing to acquire an image generated by the imaging device;
A process of extracting the posture characteristics of the person shown in the image;
A process of extracting a peripheral feature indicating the shape, position or type of a peripheral object of a person shown in the image;
Processing for filtering the peripheral feature based on the posture feature and the importance of the peripheral feature set in association with the posture feature;
A process for estimating a behavior class of a person shown in the image based on the posture feature and the filtered peripheral feature;
An image processing program for executing