WO2018163555A1 - Image processing device, image processing method, and image processing program - Google Patents

Image processing device, image processing method, and image processing program Download PDF

Info

Publication number
WO2018163555A1
WO2018163555A1 PCT/JP2017/045011 JP2017045011W WO2018163555A1 WO 2018163555 A1 WO2018163555 A1 WO 2018163555A1 JP 2017045011 W JP2017045011 W JP 2017045011W WO 2018163555 A1 WO2018163555 A1 WO 2018163555A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature
image
peripheral
posture
unit
Prior art date
Application number
PCT/JP2017/045011
Other languages
French (fr)
Japanese (ja)
Inventor
宏 大和
義満 青木
鈴木 智之
Original Assignee
コニカミノルタ株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by コニカミノルタ株式会社 filed Critical コニカミノルタ株式会社
Publication of WO2018163555A1 publication Critical patent/WO2018163555A1/en

Links

Images

Classifications

    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • A61B5/103Detecting, measuring or recording devices for testing the shape, pattern, colour, size or movement of the body or parts thereof, for diagnostic purposes
    • A61B5/107Measuring physical dimensions, e.g. size of the entire body or parts thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • GPHYSICS
    • G08SIGNALLING
    • G08BSIGNALLING OR CALLING SYSTEMS; ORDER TELEGRAPHS; ALARM SYSTEMS
    • G08B25/00Alarm systems in which the location of the alarm condition is signalled to a central station, e.g. fire or police telegraphic systems

Definitions

  • the present disclosure relates to an image processing device, an image processing method, and an image processing program.
  • a technique for recognizing a human action from an acquired image is known.
  • an object for recognizing a person's behavior for example, an elderly person or an attendant thereof can be cited in consideration of a living situation of an elderly person or a mechanism for recognizing oneself in the field of elderly care.
  • the daily life such as sleeping, getting up, getting out of bed, sitting down, squatting, walking, eating, toilet, going out, taking things, etc.
  • Basic behaviors and behaviors that occur during accidents such as falls and falls.
  • a sleeping action may be that a person walks close to a bed and sits down after sitting down. At this time, the posture of the person changes in the order of standing, sitting and lying. In order to recognize such behavior, it is important to recognize an accurate posture.
  • An example of a technology for recognizing behavior is a technology for estimating a human joint position from an acquired image.
  • the posture of a person is estimated from the relationship between the estimated joint positions, and the behavior of the person is recognized from changes in the estimated posture and position of the person.
  • Non-Patent Document 1 discloses a technique for estimating a human posture using a convolutional neural network (Convolutional Neural Network: hereinafter, “CNN”).
  • CNN Convolutional Neural Network
  • Non-Patent Document 2 discloses a technique for estimating human behavior using a recurrent neural network (hereinafter referred to as “RNN”).
  • RNN recurrent neural network
  • Patent Document 1 discloses a technique for performing action recognition on a rule basis based on the positional relationship between a human posture estimated from an image and object information.
  • Patent Document 1 there is a method of identifying an object to be monitored in advance and performing action recognition based on a rule using the positional relationship between the object to be monitored and a person. Conceivable.
  • a convolutional neural network or the like a method for extracting features of surrounding objects as well as human posture features is also conceivable.
  • any of these methods can easily recognize actions under conditions where the type, shape, position, or appearance of the object to be noted is fixed. There is a problem that the number of patterns to be recognized becomes enormous, leading to erroneous recognition and increased processing load. In addition, the amount of data prepared in advance is enormous.
  • the present disclosure has been made in view of the above-described problems, and an object thereof is to provide an image processing device, an image processing method, and an image processing program that enable action recognition with higher accuracy.
  • An image acquisition unit for acquiring an image generated by the imaging device;
  • a human body feature extraction unit for extracting posture characteristics of a person shown in the image;
  • a peripheral feature extraction unit that extracts a peripheral feature indicating the shape, position, or type of a peripheral object of a person shown in the image;
  • a peripheral feature filter unit that filters the peripheral feature based on the posture feature and the importance of the peripheral feature set in association with the posture feature;
  • An image processing apparatus for acquiring an image generated by the imaging device.
  • Obtain the image generated by the imaging device Extract the posture characteristics of the person reflected in the image, Extract peripheral features that indicate the shape, position or type of human peripheral objects shown in the image, Filtering the peripheral feature based on the posture feature and the importance of the peripheral feature set in association with the posture feature; Estimating an action class of a person shown in the image based on the posture feature and the filtered peripheral feature; This is an image processing method.
  • Processing to acquire an image generated by the imaging device A process of extracting the posture characteristics of the person shown in the image; A process of extracting a peripheral feature indicating the shape, position or type of a peripheral object of a person shown in the image; Processing for filtering the peripheral feature based on the posture feature and the importance of the peripheral feature set in association with the posture feature; A process for estimating a behavior class of a person shown in the image based on the posture feature and the filtered peripheral feature; An image processing program for executing
  • the image processing apparatus enables more accurate action recognition.
  • FIG. 1 is a diagram illustrating an example of an action recognition system according to the embodiment.
  • FIG. 2 is a diagram illustrating an example of a hardware configuration of the image processing apparatus according to the embodiment.
  • FIG. 3 is a diagram illustrating an example of functional blocks of the image processing apparatus according to the embodiment.
  • FIG. 4 is a diagram illustrating an example of each configuration of the image processing apparatus according to the embodiment.
  • FIG. 5 is a diagram illustrating an example of a human region in an image detected by the human region detection unit according to the embodiment.
  • FIG. 6 is a diagram illustrating an example of posture feature data extracted by the human body feature extraction unit according to the embodiment.
  • FIG. 7 is a diagram illustrating filtering processing of the peripheral feature filter unit according to the embodiment.
  • FIG. 1 is a diagram illustrating an example of an action recognition system according to the embodiment.
  • FIG. 2 is a diagram illustrating an example of a hardware configuration of the image processing apparatus according to the embodiment.
  • FIG. 3 is a diagram illustrating an example of
  • FIG. 8 is a diagram illustrating an example of the configuration of the hierarchical LSTM of the behavior determination unit according to the embodiment.
  • FIG. 9 is a diagram illustrating the learning process of the learning unit according to the embodiment.
  • FIG. 10 is a flowchart of operations performed by the image processing apparatus according to the embodiment.
  • FIG. 11 is a flowchart of operations performed by the image processing apparatus according to the embodiment.
  • FIG. 12 is a flowchart of operations performed by the image processing apparatus according to the embodiment.
  • 13A, 13B, and 13C are diagrams schematically illustrating each process of image processing performed by the image processing apparatus according to the embodiment.
  • 14A, 14B, and 14C are diagrams schematically illustrating each process of image processing performed by the image processing apparatus according to the embodiment.
  • FIG. 1 is a diagram illustrating an example of an action recognition system according to the present embodiment.
  • the action recognition system includes an image processing device 100, an imaging device 200, and a communication network 300.
  • the imaging device 200 is, for example, a general camera or a wide-angle camera, and generates image data by performing AD conversion on an image signal generated by an imaging element of the camera.
  • the imaging apparatus 200 according to the present embodiment is configured to continuously generate image data in units of frames and to capture a moving image (hereinafter also referred to as “moving image data”). Note that the imaging device 200 is installed at an appropriate position in the room so that the person B1 to be recognized for action is reflected in the image.
  • the imaging apparatus 200 transmits moving image data to the image processing apparatus 100 via the communication network 300.
  • the image processing apparatus 100 is an apparatus that determines the behavior of the person B1 shown in the image based on the moving image data generated by the imaging apparatus 200 and outputs the result.
  • FIG. 2 is a diagram illustrating an example of a hardware configuration of the image processing apparatus 100 according to the present embodiment.
  • the image processing apparatus 100 includes, as main components, a CPU (Central Processing Unit) 101, a ROM (Read Only Memory) 102, a RAM (Random Access Memory) 103, an external storage device (for example, a flash memory) 104, a communication interface 105, and the like. It is a computer equipped with.
  • a CPU Central Processing Unit
  • ROM Read Only Memory
  • RAM Random Access Memory
  • a communication interface 105 a computer equipped with.
  • Each function to be described later of the image processing apparatus 100 includes, for example, a control program (for example, an image processing program) stored in the ROM 102, the RAM 103, the external storage device 104, and various data (for example, a learned network parameter).
  • a control program for example, an image processing program
  • various data for example, a learned network parameter.
  • DSP Digital Signal Processor
  • processing by a dedicated hardware circuit instead of or together with processing by software.
  • FIG. 3 is a diagram illustrating an example of functional blocks of the image processing apparatus 100 according to the present embodiment.
  • the image processing apparatus 100 includes an image acquisition unit 10, a human region detection unit 20, a human body feature extraction unit 30, a peripheral feature extraction unit 40, a peripheral feature filter unit 50, a behavior determination unit 60, and a learning unit 70.
  • the image acquisition unit 10 acquires image data D1 of an image (here, a moving image) generated by the imaging device 200.
  • the human area detection unit 20 detects a human area from the image of the image data D1.
  • the human body feature extraction unit 30 extracts the posture feature of the person shown in the image based on the image data D1 and the data D2 indicating the human region.
  • the peripheral feature extraction unit 40 extracts peripheral features of a human peripheral object shown in the image based on the image data D1 and the data D2 indicating the human region.
  • the peripheral feature filter unit 50 filters the peripheral feature data D4 based on the pose feature data D3 and the importance data Da of the peripheral features set in association with the pose feature.
  • the behavior discriminating unit 60 discriminates the behavior class of the person shown in the image based on the posture feature data D3 and the filtered peripheral feature data D4a, and outputs the result data D5.
  • the learning unit 70 performs a learning process based on the teacher data D6 so that network parameters (for example, a weighting factor and a bias of a neural network described later) are optimized.
  • network parameters for example, a weighting factor and a bias of a neural network described later
  • the image processing apparatus 100 acquires moving image data D1 from the imaging apparatus 200, and the data D2 to D5 are continuously generated for each frame or at a plurality of frame intervals.
  • FIG. 4 is a diagram illustrating an example of each configuration of the image processing apparatus 100 according to the present embodiment.
  • the arrow in FIG. 4 represents transmission / reception of data. An example of the operation of the image processing apparatus 100 will be described later with reference to FIGS.
  • the image acquisition unit 10 acquires the moving image data D ⁇ b> 1 generated by the imaging device 200 and outputs it to the human region detection unit 20.
  • the image acquisition unit 10 may be configured to acquire the image data D1 stored in the external storage device 104 or the image data D1 provided via the Internet line or the like.
  • the human region detection unit 20 acquires the image data D1 from the image acquisition unit 10, performs predetermined arithmetic processing on the image data D1, and detects a human region including a person in the image. Then, the human region detection unit 20 outputs the detected human region data D2 to the human body feature extraction unit 30 and the peripheral feature extraction unit 40 together with the image data D1.
  • FIG. 5 is a diagram illustrating an example of a human region in an image detected by the human region detection unit 20.
  • T1 represents the entire area of the image
  • T2 represents the human area in the image
  • T2a represents the peripheral area of the person in the image.
  • the method in which the human region detection unit 20 detects the human region T2 is arbitrary. For example, a difference image in the image T1 is detected from a moving image, and the human region T2 is detected from the difference image.
  • the human region detection unit 20 may also use a learned neural network, template matching, a combination of HOG (Histogramsogramof Oriented Gradients) features and SVM (Support Vector Machine), or a method such as a background subtraction method. Good.
  • the human region detection unit 20 may be integrated with the human body feature extraction unit 30.
  • the processing of the human region detection unit 20 may be executed in a series of processing when the human body feature extraction unit 30 detects the posture feature of the person shown in the image.
  • the human body feature extraction unit 30 acquires the image data D1 and the data D2 indicating the human region from the human region detection unit 20, performs a predetermined calculation process on the image of the human region T2, and the posture of the person shown in the image Extract features. Then, the human body feature extraction unit 30 outputs the human posture feature data D3 shown in the extracted image to the peripheral feature filter unit 50 and the action determination unit 60.
  • “person's posture characteristics” are extracted from the characteristics of the posture of the human body such as walking and sitting.
  • “Human posture characteristics” include, for example, the joint position of the human body, the position of each part of the human body (for example, the head, feet, etc.), the type of posture (for example, the state of standing up, the state of bending forward), or It is expressed by these temporal changes.
  • the data format for representing the “person's posture feature” is arbitrary, such as a type format, a coordinate format, or a relative position between parts.
  • the “person's posture feature” may be abstracted data (for example, HOG feature value). Since “person's posture feature” may be expressed by type or the like, the word “feature” is used as a high-level concept and the word “feature” is used (the same applies to peripheral features).
  • FIG. 5 shows the joint positions of the human body as an example of the posture characteristics of the person.
  • the right ankle p0, right knee p1, right waist p2, left waist p3, left knee p4, left ankle p5, right wrist p6, right elbow p7, right shoulder p8, left shoulder p9, The left elbow p10, the left wrist p11, the neck p12, and the crown p13 are shown.
  • the human body feature extraction unit 30 extracts a human posture feature from an image using, for example, a learned CNN.
  • the CNN constituting the human body feature extraction unit 30 performs, for example, a learning process using teacher data indicating the correspondence between the human body image and the coordinates of the joint position of the human body (two-dimensional position or three-dimensional estimated position) in the image. Is used (generally also referred to as R-CNN).
  • the human body feature extraction unit 30 includes, for example, a preprocessing unit 31, a convolution processing unit 32, an all coupling unit 33, and an all coupling unit 34.
  • the pre-processing unit 31 normalizes the image by cutting out the image T2 of the human area from the image T1 of the entire area and converting it into a predetermined size and aspect ratio based on the data D2 indicating the human area.
  • the preprocessing unit 31 may perform area setting according to the viewing distance of the human area or may perform color division processing.
  • the convolution processing unit 32 is configured by hierarchically connecting a plurality of feature quantity extraction layers.
  • the convolution processing unit 32 performs convolution operation processing, activation processing, and pooling processing on input data input from the previous layer in each feature amount extraction layer.
  • the convolution processing unit 32 repeats the processing in each feature quantity extraction layer in this way, and extracts feature quantities (for example, edges, regions, distributions, etc.) of a plurality of viewpoints included in the image in a high dimension, The result is output to all coupling units 33.
  • feature quantities for example, edges, regions, distributions, etc.
  • the total coupling portion 33 is constituted by, for example, a multilayer perceptron that fully couples a plurality of feature quantities (the same applies to all the other coupling portions 34, 43, 51, 54, and 63).
  • the fully combining unit 33 fully combines the plurality of intermediate calculation result data obtained from the convolution processing unit 32 to generate data D3 indicating the posture characteristics of the person. Then, the all combination unit 33 outputs data D3 indicating the posture characteristics of the person to the all combination unit 34 and the peripheral feature filter unit 50.
  • FIG. 6 is a diagram illustrating an example of posture feature data D3 extracted by the human body feature extraction unit 30.
  • the feature amount of each joint position of the human body is represented in 4096 dimensions.
  • the all coupling unit 34 is an output layer for coupling all the outputs of the all coupling unit 33 and outputting the data D3 indicating the posture characteristics of the person to the action determining unit 60.
  • the human body feature extraction unit 30 inputs sparse and high-dimensional features having a correlation with the learned posture feature by inputting the output from the all combination unit 33 to the peripheral feature filter unit 51. Extract.
  • the peripheral feature filter unit 50 may be acquired from the total coupling unit 34.
  • Non-Patent Document 1 For details, see Non-Patent Document 1, for example.
  • the posture feature extraction processing performed by the human body feature extraction unit 30 is not limited to the above method, and any method may be used.
  • the human body feature extraction unit 30 may use, for example, silhouette extraction processing, region division processing, skin color extraction processing, luminance gradient extraction processing, motion extraction processing, shape model fitting, or a combination thereof. Also, a method of performing extraction processing for each part of the human body and integrating them may be used.
  • the peripheral feature extraction unit 40 acquires the image data D1 and the data D2 indicating the human region from the human region detection unit 20, performs predetermined arithmetic processing on the image around the human region T2, and displays the human image in the image. The peripheral features of the surrounding objects are extracted. Then, the peripheral feature extraction unit 40 outputs the extracted peripheral feature data D4 to the peripheral feature filter unit 50.
  • the “peripheral feature” represents the shape of the peripheral object existing around the person, the position of the peripheral object, the type of the peripheral object, or the like.
  • the data format for representing “peripheral features” is arbitrary, such as a type format, a coordinate format, or a relative position.
  • the “peripheral feature” may be abstracted data (for example, HOG feature amount).
  • the “peripheral feature” includes information indicating a positional relationship between the peripheral object and each part of the human body (for example, a positional relationship with the human hand).
  • the “peripheral feature” more preferably includes information indicating the type of the peripheral object (for example, bed, chair, etc.). This is because the information can be an important determinant when the action determination unit 60 determines a person's action.
  • the peripheral feature has a role of complementing the interaction between the person and the object.
  • the human behavior that is difficult to discriminate only by the human posture feature is complemented by the peripheral feature.
  • the peripheral feature extraction unit 40 extracts peripheral features from the image using CNN or the like, for example, in the same manner as the human body feature extraction unit 30.
  • the CNN constituting the peripheral feature extraction unit 40 is, for example, one that has been subjected to learning processing using teacher data indicating the correspondence between the image of the object and the shape, type, position of each part, etc. It is done. More preferably, an image obtained by performing learning processing using an image around a human body including a human body and teacher data indicating a correspondence relationship between the shape of the object and the coordinates of the positional relationship in the image is used.
  • the peripheral feature extraction unit 40 includes, for example, a preprocessing unit 41, a convolution processing unit 42, and an all combination unit 43.
  • the preprocessing unit 41 Based on the image data D1 and the data D2 indicating the human region, the preprocessing unit 41 cuts out the peripheral region image T2a that is larger than the human region T2 from the human region T2 based on the human region T2 (see FIG. 5). Then, the preprocessing unit 41 performs normalization of the image such as conversion to a predetermined size and aspect ratio.
  • the processing of the convolution processing unit 42 and the total combining unit 43 is as described above.
  • the convolution processing unit 42 extracts feature amounts (for example, edges, regions, distributions, etc.) of a plurality of viewpoints of peripheral features in a high dimension by convolution processing or the like.
  • the full combining unit 43 fully combines the plurality of intermediate calculation result data obtained from the convolution processing unit 42, and outputs peripheral features as final calculation result data.
  • the peripheral feature extraction unit 40 may use, for example, HOG feature amount extraction processing, silhouette extraction processing, region division processing, luminance gradient extraction processing, motion extraction processing, shape model fitting, or a combination thereof. Further, a method of performing feature extraction processing for each predetermined area and integrating them may be used.
  • the peripheral feature filter unit 50 acquires posture feature data D3 from the human region detection unit 20 and peripheral feature data D4 from the peripheral feature extraction unit 40.
  • the peripheral feature filter unit 50 filters the peripheral features based on the importance level data Da of the peripheral features set in association with the posture feature. Then, the peripheral feature filter unit 50 outputs the filtered peripheral feature data D4a to the behavior determination unit 60.
  • the peripheral features to be noted are not necessarily the same for each human action.
  • the important peripheral feature elements and the unnecessary peripheral feature elements change depending on the posture of the person due to the action.
  • the peripheral features around the waist for example, the round edge of the chair seat, the vertical edge of the bed) , And back space
  • the peripheral features near the hand are important.
  • the image processing apparatus 100 performs filtering of the peripheral features extracted by the peripheral feature extraction unit 40 by the peripheral feature filter unit 50.
  • importance means the position, shape, type, etc. of the peripheral feature to be noted.
  • importance makes it possible to narrow down actions that are likely to be related according to the posture characteristics of a person and to specify the position, shape, type, or the like of a peripheral object to be noted.
  • the data format for representing “importance” may be any format as long as it is uniquely converted from the posture feature data D3 and can filter the peripheral feature data D4.
  • the “importance” more preferably, information on temporal changes in human posture characteristics (for example, a difference from the posture characteristics of a person before a predetermined frame) is also used. This makes it easier to narrow down the peripheral features to be focused on. For example, when a temporal change in a person's posture characteristic indicates that a hand is extending in a certain direction, it can be predicted that the person is trying to pick up an object ahead of it, so that behavior determination described later
  • the unit 60 may be configured to focus only on the feature of the object.
  • the “importance” is set by machine learning (learning unit 70 to be described later) for each action class, for example, using the teacher data associating the posture feature with the peripheral feature to be noted.
  • the “importance” is not limited to machine learning, and may be set by the user or the like.
  • the peripheral feature filter unit 50 includes, for example, an input-side full coupling unit 51, an activation processing unit 52, a filtering unit 53, and an output-side full coupling unit 54.
  • the input-side full combining unit 51 performs the full combining process on the posture feature output from the full combining unit 33 of the human body feature extracting unit 30, thereby obtaining the data D3 related to the posture feature (here, a 4096-dimensional feature value vector). , The degree of importance of peripheral features (here, 200-dimensional feature quantity vectors) is converted. Then, the input-side full combining unit 51 outputs an importance vector having the same number of dimensions as the peripheral features (here, 200 dimensions).
  • a weighting factor for each feature amount vector of the posture feature is set so that the posture feature can be converted into the importance of the peripheral feature by machine learning for each action class. Yes.
  • the activation processing unit 52 performs activation processing on the importance of the peripheral features output from the input-side full combining unit 51 using, for example, a ReLU function.
  • the ReLU function is a function that outputs 0 when a negative number is input and returns the input as it is when a number greater than 0 is input.
  • the degree of importance output from the activation processing unit 52 is expressed by, for example, the following expression (1).
  • the filtering unit 53 adds the importance of the peripheral features output from the activation processing unit 52 to the peripheral features output from the peripheral feature extraction unit 40, and filters the peripheral features.
  • the peripheral feature output from the filtering unit 53 is expressed by the following equation (2), for example.
  • FIG. 7 is a diagram for explaining the filtering process of the peripheral feature filter unit 50.
  • the data Da related to the importance is represented as a vector that reacts only to a specific dimension vector. Accordingly, the feature amount of the peripheral feature after filtering (right diagram) is a sparse feature vector compared to the feature amount of the peripheral feature before filtering (left diagram).
  • the output-side all combining unit 54 further abstracts the filtered peripheral feature data D4a and outputs the data to the combining unit 60a of the behavior determining unit 60.
  • the calculation method of the peripheral feature filter unit 50 is not limited to the above.
  • the peripheral feature filter unit 50 may associate the importance with the posture type on a one-to-one basis.
  • the behavior determination unit 60 acquires the human posture feature data D3 from the human body feature extraction unit 30, and acquires the filtered peripheral feature data D4a from the peripheral feature filter unit 50. Then, the action determination unit 60 determines the action class of the person shown in the image based on the time series data of the human posture feature data D3 and the filtered peripheral feature data D4a.
  • human behavior Since human behavior has temporal continuity and deep chronological relationships between behaviors, it is not necessary to perform single image data for each frame when discriminating human behavior classes.
  • time-series data indicating temporal changes such as the posture of a person and the positional relationship between objects.
  • discriminating a human behavior class it is desirable to consider not only the behavior between successive frames but also the behavior of the past (for example, one minute ago) separated to some extent. This is because, for example, when determining the action of “getting up from the chair”, data indicating that the action of “sitting on the chair” is performed in the past action is also a big clue.
  • the behavior determination unit 60 performs time series analysis using a hierarchical LSTM (Long Short-Term Memory) which is a kind of recursive neural network.
  • the hierarchical LSTM can recognize a relationship in a long time interval (for example, one minute before) in addition to a relationship in a short time interval (for example, the immediately preceding image frame).
  • FIG. 8 is a diagram illustrating an example of the configuration of the hierarchical LSTM of the behavior determination unit 60.
  • the action determination unit 60 includes the human posture feature data D ⁇ b> 3 acquired from the human body feature extraction unit 30 combined by the combination unit 60 a and the peripheral feature filtered from the peripheral feature filter unit 50.
  • a (t), A (t-1), A (t-2), and A (t-3) in FIG. 8 are the posture feature data D3 and the peripheral feature data.
  • the combining unit 60a combines both feature quantity vectors as feature quantity vectors of different dimensions.
  • the combining unit 60a more preferably acquires the posture feature data D3 and the peripheral feature data D4a of the same image data D1. Therefore, the combining unit 60a performs, for example, synchronization processing to acquire the posture feature data D3 and the peripheral feature data D4a, or uses the identification code attached to the frame to obtain the posture feature data D3 and the peripheral feature data D4a. It is desirable to acquire.
  • the hierarchical LSTM includes an intermediate layer 61, a normalization unit 62, an all combination unit 63, and an action class determination unit 64 of a plurality of layers (only three layers are shown in FIG. 8). Composed.
  • a (t) represents the input at the present time (representing the time to be processed. The same applies hereinafter), and A (t ⁇ 1) is the first time before the present time (for example, one frame before). ), A (t-2) represents the input for the second time before the current time (for example, 10 frames before), and A (t-3) represents the third time for the current time. Represents an input (for example, 20 frames before).
  • Each of the intermediate layers 61 (61a to 61l) having a plurality of layers is constituted by LSTM units.
  • the first layer LSTM 61a is inputted with the current input data A (t) added with a predetermined weighting coefficient Wa1 (t) (representing a matrix of weighting coefficients; the same applies hereinafter), and the first time before
  • the output data Z1 (t-1) of the first layer LSTM 61d (for example, one frame before) added with a predetermined weight coefficient Wb1 (t-1) is recursively input.
  • the LSTM 61a in the first layer performs predetermined arithmetic processing on these data, and outputs Z1 (t) data to the LSTM 61b and the normalization unit 62 in the second layer.
  • the second layer LSTM 61b is inputted with the current data A (t) added with a predetermined weight coefficient Wa2 (t), and the second layer before the second time (for example, 10 frames before).
  • the LSTM 61h output data Z2 (t-2) with a predetermined weighting factor Wb2 (t-3) added is input recursively.
  • the LSTM 61b in the second layer performs predetermined arithmetic processing on these data, and outputs the data of Z2 (t) to the LSTM 61b and the normalization unit 62 in the third layer.
  • the third layer LSTM 61c is inputted with the current data A (t) added with a predetermined weight coefficient Wa3 (t), and the third layer before the third time (for example, 20 frames before).
  • a data obtained by adding a predetermined weighting factor Wb3 (t-3) to the output data Z3 (t-3) of LSTML is recursively input.
  • the LSTM 61c in the third layer performs predetermined arithmetic processing on these data, and further outputs Z3 (t) data to the LSTM 61 (not shown) and the normalization unit 62 in the lower layer.
  • the hierarchical LSTM increases the length of the real time considered in the lower LSTM 61 by increasing the frame interval of the past data to be input as the lower LSTM 61, and various frame intervals. Time-series analysis can be performed.
  • Each LSTM unit 61 includes, for example, a memory cell that holds data of the immediately preceding (previous time) LSTM unit 61 and three gates (generally referred to as an input gate, a forgetting gate, and an output gate). (Not shown).
  • the three gates are input with the current data A and the output data Z of the LSTM unit 61 immediately before (the past time).
  • Each of the three gates outputs a value from 0 to 1 based on the current data A, the output data Z of the LSTM unit 61 immediately before (the past time), and the weighting factor set separately. .
  • the outputs from the three gates are respectively integrated into the data input to the LSTM unit 61, the data output from the LSTM unit 61, and the data held in the memory cell. With such a unit structure, each LSTM unit 61 can appropriately reflect the characteristics at the past time and output the current characteristics.
  • the normalization unit 62 performs L2 normalization processing on the output data Z1 (t), Z2 (t), and Z3 (t) of each of the intermediate layers 61 (here, the intermediate layers 61a, 61b, and 61c) of a plurality of layers. Apply.
  • the total combining unit 63 applies all the output data Z1 (t), Z2 (t), and Z3 (t) to each of the normalized intermediate layers 61 (here, the intermediate layers 61a, 61b, and 61c). Perform the join process. Then, all the coupling units 63 output the probability for each behavior class (for example, for each behavior such as sitting on a chair or getting up from a bed) to the behavior class determination unit 64.
  • the probability for each action class output from all the coupling units 63 is expressed by the following equation (3) using, for example, the softmax function.
  • the intermediate layer 61 and the all coupling unit 63 are used in which weighting factors are appropriately set in advance by machine learning (learning unit 70 described later) for each action class. As a result, an appropriate probability is output for each action class from all the coupling units 63.
  • the action class determination unit 64 acquires the output from the all combination unit 63, determines that the action class having the maximum probability is the action class performed by the person shown in the image, and outputs the determination result. .
  • determination part 60 is not restricted above.
  • a recursive neural network having another structure may be used instead of the hierarchical LSTM structure.
  • the human posture feature data D3 and the peripheral feature data D4a for each frame may be used without using the time series data in order to reduce the processing load.
  • the learning unit 70 performs machine learning using teacher data so that the human body feature extraction unit 30, the peripheral feature extraction unit 40, the peripheral feature filter unit 50, and the behavior determination unit 60 can execute the above-described processing.
  • the learning unit 70 uses, for example, teacher data in which a normalized human region image and a human posture characteristic (for example, a joint position) are associated with each other, the convolution processing unit 32 of the human body feature extraction unit 30, the all coupling unit 33, and network parameters (for example, weighting factor, bias) of all coupling units 34 are adjusted.
  • the learning unit 70 uses, for example, teacher data in which a normalized image of an object around a person and an object characteristic (for example, a positional relationship with the human body) are associated with each other, the convolution processing of the peripheral feature extraction unit 40 The network parameters (for example, weighting factor, bias) of the unit 42 and the total coupling unit 43 are adjusted.
  • the learning unit 70 uses, for example, teacher data in which the posture characteristics of a person and the importance of surrounding objects are associated with each other, the network parameters (for example, the total coupling unit 51 of the peripheral feature filter unit 50 and the total coupling unit 54). , Weight coefficient, bias).
  • the learning unit 70 uses, for example, time series data of human posture features and peripheral features, and teacher data in which a correct behavior class is associated, and the intermediate layer 61 and the all coupling unit 63 of the behavior determination unit 60.
  • Network parameters e.g., weighting factor, bias.
  • the learning unit 70 may perform these learning processes using, for example, a known error back propagation method. Then, the learning unit 70 stores the network parameter adjusted by the learning process in the storage unit (for example, the external storage device 104).
  • the peripheral feature data may complement only the posture feature data of the human body, while indicating only the features specialized for the environment of the teacher data in the learning process.
  • the learning unit 70 executes the learning process so that at least the behavior determination unit 60 has the posture characteristic of the human body as a main determinant in determining the behavior than the peripheral feature. .
  • FIG. 9 is a diagram for explaining the learning process of the learning unit 70.
  • time series data D6a of human posture characteristics, time series data of peripheral characteristics Learning is performed using teacher data associated with D6b and the correct behavior class D6c so that the loss Loss indicating the error of the output data (here, the output of all the coupling units 63) with respect to the correct answer is reduced.
  • the loss function is expressed as in the following equation (4) using, for example, a softmax cross entropy function.
  • the learning unit 70 uses the loss Loss1 when only the time series data D6a of the human posture feature is input to the hierarchical LSTM, the time series data D6a of the human posture feature, and the surroundings. Two of loss Loss2 are prepared when both are input to the hierarchical LSTM using both of the feature time series data D6b.
  • the learning unit 70 uses, for example, an error back propagation method or the like, so that the sum of the loss Loss1 and the loss Loss2 (see the following equation (5)) is minimized, so that the weighting coefficient of the network parameter of the hierarchical LSTM And adjust the bias.
  • This makes it possible to adjust the network parameters of the hierarchical LSTM so that the posture feature of the human body is the main determinant in the action discrimination than the peripheral feature.
  • the loss function may be expressed as in the following equation (6) by adding a regularization term of importance. By doing so, overlearning can be suppressed.
  • the learning unit 70 may perform learning processing for each functional block collectively using the moving image data and the correct action class as teacher data instead of performing the learning processing for each functional block.
  • FIGS. 10 to 12 are examples of flowcharts of operations performed by the image processing apparatus 100.
  • FIG. 13 and FIG. 14 are diagrams schematically illustrating each process of image processing performed by the image processing apparatus 100.
  • the process for determining the behavior of the person B1 shown in the image shown in FIG. 1 is shown as an image.
  • the image captured by the imaging apparatus 200 includes a bed B2, a trash can B3, a television B4, and an illumination B5 in addition to the person B1.
  • the image processing apparatus 100 acquires image data from the imaging apparatus 200 (step S1).
  • the image processing apparatus 100 (human region detection unit 20, human body feature extraction unit 30, peripheral feature extraction unit 40, and peripheral feature filter unit 50) performs feature extraction processing (step S2).
  • step S2 the image processing apparatus 100 (human area detecting unit 20) detects a human area from the image of the acquired image data (step S21).
  • the image processing apparatus 100 human body feature extraction unit 30
  • the image processing apparatus 100 extracts the posture feature of the person B1 shown in the image of the person region as shown in FIG. 13A (step S22).
  • the image processing apparatus 100 (peripheral feature extraction unit 40) extracts the peripheral features (here, the bed B2, the trash can B3, and the lighting B5) of the person B1 shown in the image of the person area as shown in FIG. 13B.
  • Step S23 the image processing apparatus 100 (peripheral feature filter unit 50) performs filtering of the peripheral features as shown in FIG. 13C (step S24).
  • the bed B2 is a peripheral object closely related to the action of the person B1, while the peripheral objects other than the bed B2 (the trash can B3 and the lighting B5) are It can be regarded as a peripheral object unrelated to the action of the person B1.
  • the image processing apparatus 100 peripheral feature filter unit 50 shows a state where peripheral objects other than the bed B2 (trash box B3 and illumination B5) are removed by filtering.
  • the image processing apparatus 100 determines the action of the person shown in the image based on the feature data extracted in the feature extraction process in step S2 (step S3).
  • step S3 the image processing apparatus 100 (behavior determination unit 60) inputs data D3 related to the human posture feature and data D4a related to the peripheral feature (step S31).
  • step S32 the image processing apparatus 100 (behavior determination unit 60) uses the hierarchical LSTM 60 to calculate the probability for each action class (step S32).
  • step S33 the image processing apparatus 100 (behavior determination unit 60) determines that the action class having the maximum probability is the action of the person shown in the image, and outputs data indicating the action class (step S33).
  • step S3 the image processing apparatus 100 (behavior determination unit 60), for example, in the order of FIG. 14A, FIG. 14B, and FIG. 14C, from the state where the posture of the person B1 lies sideways with respect to the bed B2. A state that changes over time to a state in which it rises is extracted.
  • the image processing apparatus 100 can determine that the action class of the person B1 corresponds to a wake-up by such a change over time.
  • step S4 when the image processing apparatus 100 determines to end the series of action recognition processes (step S4: YES), the process ends. On the other hand, when it determines with continuing the process of action recognition (step S4: NO), it returns to step S1 and the image processing apparatus 100 continues a process.
  • the importance of the peripheral feature is set in association with the posture feature of the human body, and the peripheral feature is filtered, so that only the peripheral object related to the human behavior is filtered. Can be extracted. Accordingly, the image processing apparatus 100 according to the present embodiment can estimate a human action class with high accuracy even in an environment in which the types, positions, and appearances of surrounding objects are different.
  • the image processing apparatus 100 extracts temporal changes in the positional relationship of peripheral objects related to the posture of the human body based on time-series data of the posture features and the peripheral features, and determines the human action class. Since it is configured to estimate, it is possible to estimate a human action class with higher accuracy.
  • the image processing apparatus 100 uses a recursive neural network (particularly, a hierarchical LSTM method), and from the viewpoint of both a long time interval and a short time interval, a time series of posture features and peripheral features. Since it is configured to perform temporal analysis of data, it is possible to estimate a human action class with higher accuracy.
  • a recursive neural network particularly, a hierarchical LSTM method
  • the image acquisition unit 10 As an example of the configuration of the image processing apparatus 100, the image acquisition unit 10, the human region detection unit 20, the human body feature extraction unit 30, the peripheral feature extraction unit 40, the peripheral feature filter unit 50, the behavior determination unit 60, and
  • the function of the learning unit 70 has been described as being realized by one computer, it is needless to say that the learning unit 70 may be realized by a plurality of computers.
  • the program and data read by the computer may be distributed and stored in a plurality of computers.
  • the image acquisition unit 10 As an example of the operation of the image processing apparatus 100, the image acquisition unit 10, the human region detection unit 20, the human body feature extraction unit 30, the peripheral feature extraction unit 40, the peripheral feature filter unit 50, and the behavior determination unit 60.
  • the above processing is shown as being executed in a series of flows, it is needless to say that some or all of these processing may be executed in parallel.
  • the image processing apparatus enables more accurate action recognition without increasing the processing load.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Theoretical Computer Science (AREA)
  • Pathology (AREA)
  • Molecular Biology (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Emergency Management (AREA)
  • Biophysics (AREA)
  • Business, Economics & Management (AREA)
  • Biomedical Technology (AREA)
  • Heart & Thoracic Surgery (AREA)
  • Medical Informatics (AREA)
  • Dentistry (AREA)
  • Surgery (AREA)
  • Animal Behavior & Ethology (AREA)
  • General Health & Medical Sciences (AREA)
  • Public Health (AREA)
  • Veterinary Medicine (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The present invention provides an image processing device provided with: an image acquisition unit (10) which acquires an image generated by an image capturing device (200); a human body feature extraction unit (30) which extracts a posture feature of a human shown in the image; a surrounding feature extraction unit (40) which extracts a surrounding feature that represents the shape, position, or category of an object surrounding the human shown in the image; a surrounding feature filtering unit (50) which filters the surrounding feature on the basis of the posture feature and the importance of the surrounding feature set in association with the posture feature; and a behavior determination unit (60) which estimates a behavior class of the human shown in the image on the basis of the posture feature and the surrounding feature filtered by the surrounding feature filtering unit.

Description

画像処理装置、画像処理方法、及び画像処理プログラムImage processing apparatus, image processing method, and image processing program
 本開示は、画像処理装置、画像処理方法、及び画像処理プログラムに関する。 The present disclosure relates to an image processing device, an image processing method, and an image processing program.
 従来、取得した画像から人の行動を認識する技術が知られている。人の行動を認識する対象としては、例えば、高齢者介護見守りの現場において、高齢者の生活状況や自己を認識する仕組みを考慮すると、高齢者やその介助者が挙げられる。具体的には、人の行動を認識する対象としては、例えば、高齢者の場合、就寝、起床、離床、座る、しゃがむ、歩行、食事、トイレ、外出、ものを取る、等のような日常生活における基本的な行動や、転倒、転落等の事故時に起こる行動が挙げられる。 Conventionally, a technique for recognizing a human action from an acquired image is known. As an object for recognizing a person's behavior, for example, an elderly person or an attendant thereof can be cited in consideration of a living situation of an elderly person or a mechanism for recognizing oneself in the field of elderly care. Specifically, for example, in the case of an elderly person, the daily life such as sleeping, getting up, getting out of bed, sitting down, squatting, walking, eating, toilet, going out, taking things, etc. Basic behaviors and behaviors that occur during accidents such as falls and falls.
 これらの行動のうち、多くの行動は人の姿勢の変化を捉えることで認識することが可能である。例えば、就寝の行動としては、人がベッドに歩いて近づき、一旦座ってから横たわることが考えられる。この際においては、立位、座位、臥位の順に人の姿勢が変動する。このような行動を認識するためには、正確な姿勢を認識することが重要である。 Of these actions, many actions can be recognized by capturing changes in the posture of the person. For example, a sleeping action may be that a person walks close to a bed and sits down after sitting down. At this time, the posture of the person changes in the order of standing, sitting and lying. In order to recognize such behavior, it is important to recognize an accurate posture.
 行動を認識する技術の一例としては、取得した画像から人の関節位置を推定する技術が考えられる。当該技術においては、推定した関節位置の関係から人の姿勢を推定し、推定した人の姿勢及び位置の変化から人の行動を認識する。 An example of a technology for recognizing behavior is a technology for estimating a human joint position from an acquired image. In this technique, the posture of a person is estimated from the relationship between the estimated joint positions, and the behavior of the person is recognized from changes in the estimated posture and position of the person.
 例えば、非特許文献1には、畳み込みニューラルネットワーク(Convolutional Neural Network:以下、「CNN」)を用いて、人の姿勢を推定する技術が開示されている。 For example, Non-Patent Document 1 discloses a technique for estimating a human posture using a convolutional neural network (Convolutional Neural Network: hereinafter, “CNN”).
 又、非特許文献2には、再帰型ニューラルネットワーク(Recurrent Neural Network:以下、「RNN」)を用いて、人の行動を推定する技術が開示されている。 Also, Non-Patent Document 2 discloses a technique for estimating human behavior using a recurrent neural network (hereinafter referred to as “RNN”).
 又、特許文献1には、画像から推定される人の姿勢と物体情報の位置関係に基づいて、ルールベースで行動認識を実施する技術が開示されている。 Patent Document 1 discloses a technique for performing action recognition on a rule basis based on the positional relationship between a human posture estimated from an image and object information.
国際公開第2016/181837号International Publication No. 2016/181837
 ところで、画像に映る人の行動を認識する行動認識システムにおいては、カメラと人の位置関係によって、人が同じ行動を行っていても、サイズ、向き、及び距離等の点で画像に映る人の姿勢特徴に違いが生じるという課題がある。特に、広角カメラを用いて撮影した画像については、人の各部位の奥行き方向の位置関係を認識することが困難である。 By the way, in the action recognition system that recognizes the action of the person shown in the image, even if the person is performing the same action due to the positional relationship between the camera and the person, the person appearing in the image in terms of size, direction, distance, etc. There is a problem that a difference occurs in posture characteristics. In particular, with respect to an image captured using a wide-angle camera, it is difficult to recognize the positional relationship in the depth direction of each part of a person.
 この点、人の行動は、物体とのインタラクションとして発生するものが多いことに着目して、当該行動を認識する際には、人の姿勢特徴に加えて、周辺物体の情報を用いる手法が検討されている。 In this regard, focusing on the fact that human behavior often occurs as an interaction with an object, when recognizing the behavior, a method using information on surrounding objects in addition to human posture characteristics is considered. Has been.
 例えば、特許文献1の従来技術のように、事前に、監視対象の物体を特定しておき、当該監視対象の物体と人の位置関係等を使ったルールベースでの行動認識を実施する手法が考えられる。他方、畳み込みニューラルネットワーク等を用いて、人の姿勢特徴と同様に、周辺物体の特徴を抽出する手法も考えられる。 For example, as in the prior art of Patent Document 1, there is a method of identifying an object to be monitored in advance and performing action recognition based on a rule using the positional relationship between the object to be monitored and a person. Conceivable. On the other hand, using a convolutional neural network or the like, a method for extracting features of surrounding objects as well as human posture features is also conceivable.
 しかし、いずれの手法も、着目すべき物体の種別、形状、位置又は見え方等が固定された条件下であれば、容易に行動認識が可能であるが、これらが種々に異なる環境下においては、認識すべきパターン数が膨大となってしまい、誤認識や処理負荷の増大につながってしまうという問題がある。又、予め用意するデータ量も膨大となる。 However, any of these methods can easily recognize actions under conditions where the type, shape, position, or appearance of the object to be noted is fixed. There is a problem that the number of patterns to be recognized becomes enormous, leading to erroneous recognition and increased processing load. In addition, the amount of data prepared in advance is enormous.
 本開示は、上記問題点に鑑みてなされたもので、より高精度な行動認識を可能とする画像処理装置、画像処理方法、及び画像処理プログラムを提供することを目的とする。 The present disclosure has been made in view of the above-described problems, and an object thereof is to provide an image processing device, an image processing method, and an image processing program that enable action recognition with higher accuracy.
 前述した課題を解決する主たる本開示は、
 撮像装置が生成した画像を取得する画像取得部と、
 前記画像に映る人の姿勢特徴を抽出する人体特徴抽出部と、
 前記画像に映る人の周辺物体の形状、位置又は種別を示す周辺特徴を抽出する周辺特徴抽出部と、
 前記姿勢特徴と当該姿勢特徴に関連付けて設定された周辺特徴の重要度とに基づいて、前記周辺特徴をフィルタリングする周辺特徴フィルター部と、
 前記姿勢特徴と前記周辺特徴フィルター部にフィルタリングされた前記周辺特徴とに基づいて、前記画像に映る人の行動クラスを推定する行動判別部と、
 を備える、画像処理装置である。
The main present disclosure for solving the above-described problems is as follows.
An image acquisition unit for acquiring an image generated by the imaging device;
A human body feature extraction unit for extracting posture characteristics of a person shown in the image;
A peripheral feature extraction unit that extracts a peripheral feature indicating the shape, position, or type of a peripheral object of a person shown in the image;
A peripheral feature filter unit that filters the peripheral feature based on the posture feature and the importance of the peripheral feature set in association with the posture feature;
Based on the posture feature and the peripheral feature filtered by the peripheral feature filter unit, an action determination unit that estimates a person's action class shown in the image;
An image processing apparatus.
 又、他の側面では、
 撮像装置が生成した画像を取得し、
 前記画像に映る人の姿勢特徴を抽出し、
 前記画像に映る人の周辺物体の形状、位置又は種別を示す周辺特徴を抽出し、
 前記姿勢特徴と当該姿勢特徴に関連付けて設定された周辺特徴の重要度とに基づいて、前記周辺特徴をフィルタリングし、
 前記姿勢特徴とフィルタリングされた前記周辺特徴とに基づいて、前記画像に映る人の行動クラスを推定する、
 画像処理方法である。
In other aspects,
Obtain the image generated by the imaging device,
Extract the posture characteristics of the person reflected in the image,
Extract peripheral features that indicate the shape, position or type of human peripheral objects shown in the image,
Filtering the peripheral feature based on the posture feature and the importance of the peripheral feature set in association with the posture feature;
Estimating an action class of a person shown in the image based on the posture feature and the filtered peripheral feature;
This is an image processing method.
 又、他の側面では、
 コンピュータに、
 撮像装置が生成した画像を取得する処理と、
 前記画像に映る人の姿勢特徴を抽出する処理と、
 前記画像に映る人の周辺物体の形状、位置又は種別を示す周辺特徴を抽出する処理と、
 前記姿勢特徴と当該姿勢特徴に関連付けて設定された周辺特徴の重要度とに基づいて、前記周辺特徴をフィルタリングする処理と、
 前記姿勢特徴とフィルタリングされた前記周辺特徴とに基づいて、前記画像に映る人の行動クラスを推定する処理と、
 を実行させる、画像処理プログラムである。
In other aspects,
On the computer,
Processing to acquire an image generated by the imaging device;
A process of extracting the posture characteristics of the person shown in the image;
A process of extracting a peripheral feature indicating the shape, position or type of a peripheral object of a person shown in the image;
Processing for filtering the peripheral feature based on the posture feature and the importance of the peripheral feature set in association with the posture feature;
A process for estimating a behavior class of a person shown in the image based on the posture feature and the filtered peripheral feature;
An image processing program for executing
 本開示に係る画像処理装置によれば、より高精度な行動認識が可能である。 The image processing apparatus according to the present disclosure enables more accurate action recognition.
図1は、実施形態に係る行動認識システムの一例を示す図である。FIG. 1 is a diagram illustrating an example of an action recognition system according to the embodiment. 図2は、実施形態に係る画像処理装置のハードウェア構成の一例を示す図である。FIG. 2 is a diagram illustrating an example of a hardware configuration of the image processing apparatus according to the embodiment. 図3は、実施形態に係る画像処理装置の機能ブロックの一例を示す図である。FIG. 3 is a diagram illustrating an example of functional blocks of the image processing apparatus according to the embodiment. 図4は、実施形態に係る画像処理装置の各構成の一例を示す図である。FIG. 4 is a diagram illustrating an example of each configuration of the image processing apparatus according to the embodiment. 図5は、実施形態に係る人領域検出部が検出する画像内における人領域の一例を示す図である。FIG. 5 is a diagram illustrating an example of a human region in an image detected by the human region detection unit according to the embodiment. 図6は、実施形態に係る人体特徴抽出部が抽出する姿勢特徴のデータの一例を示す図である。FIG. 6 is a diagram illustrating an example of posture feature data extracted by the human body feature extraction unit according to the embodiment. 図7は、実施形態に係る周辺特徴フィルター部のフィルタリング処理について説明する図である。FIG. 7 is a diagram illustrating filtering processing of the peripheral feature filter unit according to the embodiment. 図8は、実施形態に係る行動判別部の階層型LSTMの構成の一例を示す図である。FIG. 8 is a diagram illustrating an example of the configuration of the hierarchical LSTM of the behavior determination unit according to the embodiment. 図9は、実施形態に係る学習部の学習処理について説明する図である。FIG. 9 is a diagram illustrating the learning process of the learning unit according to the embodiment. 図10は、実施形態に係る画像処理装置が行う動作のフローチャートである。FIG. 10 is a flowchart of operations performed by the image processing apparatus according to the embodiment. 図11は、実施形態に係る画像処理装置が行う動作のフローチャートである。FIG. 11 is a flowchart of operations performed by the image processing apparatus according to the embodiment. 図12は、実施形態に係る画像処理装置が行う動作のフローチャートである。FIG. 12 is a flowchart of operations performed by the image processing apparatus according to the embodiment. 図13A、図13B、図13Cは、実施形態に係る画像処理装置が行う画像処理の各プロセスを模式的に説明する図である。13A, 13B, and 13C are diagrams schematically illustrating each process of image processing performed by the image processing apparatus according to the embodiment. 図14A、図14B、図14Cは、実施形態に係る画像処理装置が行う画像処理の各プロセスを模式的に説明する図である。14A, 14B, and 14C are diagrams schematically illustrating each process of image processing performed by the image processing apparatus according to the embodiment.
(行動認識システムの構成)
 以下、図1~図3を参照して、一実施形態に係る行動認識システムの構成、及び行動認識システムに適用した画像処理装置100の構成の概要について説明する。
(Configuration of action recognition system)
The configuration of the action recognition system according to one embodiment and the outline of the configuration of the image processing apparatus 100 applied to the action recognition system will be described below with reference to FIGS.
 図1は、本実施形態に係る行動認識システムの一例を示す図である。 FIG. 1 is a diagram illustrating an example of an action recognition system according to the present embodiment.
 本実施形態に係る行動認識システムは、画像処理装置100、撮像装置200、通信ネットワーク300を備えている。 The action recognition system according to the present embodiment includes an image processing device 100, an imaging device 200, and a communication network 300.
 撮像装置200は、例えば、一般的なカメラや広角カメラであり、カメラの撮像素子が生成した画像信号をAD変換して、画像データを生成する。本実施形態に係る撮像装置200は、フレーム単位の画像データを連続的に生成して、動画像(以下、「動画像のデータ」とも称する)を撮像可能に構成されている。尚、撮像装置200は、行動認識する対象の人B1が画像に映るように、部屋内の適宜な位置に設置される。 The imaging device 200 is, for example, a general camera or a wide-angle camera, and generates image data by performing AD conversion on an image signal generated by an imaging element of the camera. The imaging apparatus 200 according to the present embodiment is configured to continuously generate image data in units of frames and to capture a moving image (hereinafter also referred to as “moving image data”). Note that the imaging device 200 is installed at an appropriate position in the room so that the person B1 to be recognized for action is reflected in the image.
 撮像装置200は、通信ネットワーク300を介して、画像処理装置100に対して動画像のデータを送信する。 The imaging apparatus 200 transmits moving image data to the image processing apparatus 100 via the communication network 300.
 画像処理装置100は、撮像装置200で生成された動画像のデータに基づいて、当該画像に映る人B1の行動を判別して、その結果を出力する装置である。 The image processing apparatus 100 is an apparatus that determines the behavior of the person B1 shown in the image based on the moving image data generated by the imaging apparatus 200 and outputs the result.
 図2は、本実施形態に係る画像処理装置100のハードウェア構成の一例を示す図である。 FIG. 2 is a diagram illustrating an example of a hardware configuration of the image processing apparatus 100 according to the present embodiment.
 画像処理装置100は、主たるコンポーネントとして、CPU(Central Processing Unit)101、ROM(Read Only Memory)102、RAM(Random Access Memory)103、外部記憶装置(例えば、フラッシュメモリ)104、及び通信インターフェイス105等を備えたコンピュータである。 The image processing apparatus 100 includes, as main components, a CPU (Central Processing Unit) 101, a ROM (Read Only Memory) 102, a RAM (Random Access Memory) 103, an external storage device (for example, a flash memory) 104, a communication interface 105, and the like. It is a computer equipped with.
 画像処理装置100の後述する各機能は、例えば、CPU101がROM102、RAM103、外部記憶装置104等に記憶された制御プログラム(例えば、画像処理プログラム)や各種データ(例えば、学習済みのネットワークパラメータ等)を参照することによって実現される。但し、各機能の一部又は全部は、CPUによる処理に代えて、又は、これと共に、DSP(Digital Signal Processor)による処理によって実現されてもよい。又、同様に、各機能の一部又は全部は、ソフトウェアによる処理に代えて、又は、これと共に、専用のハードウェア回路による処理によって実現されてもよい。 Each function to be described later of the image processing apparatus 100 includes, for example, a control program (for example, an image processing program) stored in the ROM 102, the RAM 103, the external storage device 104, and various data (for example, a learned network parameter). This is realized by referring to. However, some or all of the functions may be realized by processing by a DSP (Digital Signal Processor) instead of or by the CPU. Similarly, some or all of the functions may be realized by processing by a dedicated hardware circuit instead of or together with processing by software.
 図3は、本実施形態に係る画像処理装置100の機能ブロックの一例を示す図である。 FIG. 3 is a diagram illustrating an example of functional blocks of the image processing apparatus 100 according to the present embodiment.
 画像処理装置100は、画像取得部10、人領域検出部20、人体特徴抽出部30、周辺特徴抽出部40、周辺特徴フィルター部50、行動判別部60、及び学習部70を備えている。 The image processing apparatus 100 includes an image acquisition unit 10, a human region detection unit 20, a human body feature extraction unit 30, a peripheral feature extraction unit 40, a peripheral feature filter unit 50, a behavior determination unit 60, and a learning unit 70.
 画像取得部10は、撮像装置200が生成した画像(ここでは、動画像)の画像データD1を取得する。 The image acquisition unit 10 acquires image data D1 of an image (here, a moving image) generated by the imaging device 200.
 人領域検出部20は、画像データD1の画像から人領域を検出する。 The human area detection unit 20 detects a human area from the image of the image data D1.
 人体特徴抽出部30は、画像データD1及び人領域を示すデータD2に基づいて、画像に映る人の姿勢特徴を抽出する。 The human body feature extraction unit 30 extracts the posture feature of the person shown in the image based on the image data D1 and the data D2 indicating the human region.
 周辺特徴抽出部40は、画像データD1及び人領域を示すデータD2に基づいて、画像に映る人の周辺物体の周辺特徴を抽出する。 The peripheral feature extraction unit 40 extracts peripheral features of a human peripheral object shown in the image based on the image data D1 and the data D2 indicating the human region.
 周辺特徴フィルター部50は、姿勢特徴のデータD3と、姿勢特徴に関連付けて設定された周辺特徴の重要度のデータDaと、に基づいて、周辺特徴のデータD4のフィルタリングを行う。 The peripheral feature filter unit 50 filters the peripheral feature data D4 based on the pose feature data D3 and the importance data Da of the peripheral features set in association with the pose feature.
 行動判別部60は、姿勢特徴のデータD3と、フィルタリング後の周辺特徴のデータD4aに基づいて、画像に映る人の行動クラスを判別し、その結果のデータD5を出力する。 The behavior discriminating unit 60 discriminates the behavior class of the person shown in the image based on the posture feature data D3 and the filtered peripheral feature data D4a, and outputs the result data D5.
 学習部70は、教師データD6に基づいて、行動判別部60等のネットワークパラメータ(例えば、後述するニューラルネットワークの重み係数及びバイアス)が最適化するように、学習処理を行う。 The learning unit 70 performs a learning process based on the teacher data D6 so that network parameters (for example, a weighting factor and a bias of a neural network described later) are optimized.
 尚、本実施形態に係る画像処理装置100は、撮像装置200から動画像のデータD1を取得しており、上記各データD2~D5は、フレーム毎又は複数のフレーム間隔で、連続的に生成される。 Note that the image processing apparatus 100 according to the present embodiment acquires moving image data D1 from the imaging apparatus 200, and the data D2 to D5 are continuously generated for each frame or at a plurality of frame intervals. The
(画像処理装置の構成)
 以下、図4~図9を参照して、本実施形態に係る画像処理装置100の各構成の詳細を説明する。
(Configuration of image processing apparatus)
Hereinafter, with reference to FIGS. 4 to 9, details of each component of the image processing apparatus 100 according to the present embodiment will be described.
 図4は、本実施形態に係る画像処理装置100の各構成の一例を示す図である。尚、図4中の矢印は、データの授受を表す。尚、画像処理装置100の動作の一例については、図10~図14を参照して後述する。 FIG. 4 is a diagram illustrating an example of each configuration of the image processing apparatus 100 according to the present embodiment. In addition, the arrow in FIG. 4 represents transmission / reception of data. An example of the operation of the image processing apparatus 100 will be described later with reference to FIGS.
[画像取得部]
 画像取得部10は、撮像装置200が生成した動画像のデータD1を取得し、人領域検出部20に出力する。尚、画像取得部10は、外部記憶装置104に格納された画像データD1や、インターネット回線等を介して提供された画像データD1を取得する構成であってもよいのは勿論である。
[Image acquisition unit]
The image acquisition unit 10 acquires the moving image data D <b> 1 generated by the imaging device 200 and outputs it to the human region detection unit 20. Of course, the image acquisition unit 10 may be configured to acquire the image data D1 stored in the external storage device 104 or the image data D1 provided via the Internet line or the like.
[人領域検出部]
 人領域検出部20は、画像取得部10から画像データD1を取得して、当該画像データD1に対して所定の演算処理を施して、画像内において人を含む人領域を検出する。そして、人領域検出部20は、検出した人領域のデータD2を画像データD1と共に人体特徴抽出部30及び周辺特徴抽出部40に出力する。
[Human area detection unit]
The human region detection unit 20 acquires the image data D1 from the image acquisition unit 10, performs predetermined arithmetic processing on the image data D1, and detects a human region including a person in the image. Then, the human region detection unit 20 outputs the detected human region data D2 to the human body feature extraction unit 30 and the peripheral feature extraction unit 40 together with the image data D1.
 図5は、人領域検出部20が検出する画像内における人領域の一例を示す図である。尚、図5中において、T1は画像の全領域、T2は画像内における人領域、T2aは画像内における人の周辺領域を表している。 FIG. 5 is a diagram illustrating an example of a human region in an image detected by the human region detection unit 20. In FIG. 5, T1 represents the entire area of the image, T2 represents the human area in the image, and T2a represents the peripheral area of the person in the image.
 人領域検出部20が人領域T2を検出する手法は、任意であり、例えば、動画像から、画像T1における差分画像を検出し、当該差分画像から人領域T2を検出する。又、人領域検出部20は、その他、学習済みのニューラルネットワーク、テンプレートマッチング、HOG(Histograms of Oriented Gradients)特徴量とSVM(Support Vector Machine)の組み合わせ、又は背景差分法等の手法を用いてもよい。 The method in which the human region detection unit 20 detects the human region T2 is arbitrary. For example, a difference image in the image T1 is detected from a moving image, and the human region T2 is detected from the difference image. The human region detection unit 20 may also use a learned neural network, template matching, a combination of HOG (Histogramsogramof Oriented Gradients) features and SVM (Support Vector Machine), or a method such as a background subtraction method. Good.
 尚、人領域検出部20は、人体特徴抽出部30と統合されてもよい。換言すると、人領域検出部20の処理は、人体特徴抽出部30が画像に映る人の姿勢特徴を検出する際の一連の処理の中で実行されてもよい。 Note that the human region detection unit 20 may be integrated with the human body feature extraction unit 30. In other words, the processing of the human region detection unit 20 may be executed in a series of processing when the human body feature extraction unit 30 detects the posture feature of the person shown in the image.
[人体特徴抽出部]
 人体特徴抽出部30は、人領域検出部20から画像データD1と人領域を示すデータD2を取得して、人領域T2の画像に対して所定の演算処理を施して、画像に映る人の姿勢特徴を抽出する。そして、人体特徴抽出部30は、抽出した画像に映る人の姿勢特徴のデータD3を周辺特徴フィルター部50及び行動判別部60に出力する。
[Human body feature extraction unit]
The human body feature extraction unit 30 acquires the image data D1 and the data D2 indicating the human region from the human region detection unit 20, performs a predetermined calculation process on the image of the human region T2, and the posture of the person shown in the image Extract features. Then, the human body feature extraction unit 30 outputs the human posture feature data D3 shown in the extracted image to the peripheral feature filter unit 50 and the action determination unit 60.
 ここで、「人の姿勢特徴」とは、歩いている状態や座っている状態のような人体の姿勢の特徴を抽出したものである。「人の姿勢特徴」は、例えば、人体の関節位置、人体の各部位(例えば、頭部、足部等)の位置、姿勢の種別(例えば、立ち上がろうとした状態、前屈みの状態等)、又はこれらの時間的変化等で表される。但し、「人の姿勢特徴」を表す際のデータ形式は、種別形式、座標形式、又は部位間の相対位置等、任意である。又、「人の姿勢特徴」は、抽象化したデータ(例えば、HOG特徴量)であってもよい。尚、「人の姿勢特徴」は、種別等で表す場合もあるため、「特徴量」の語を上位概念化して、「特徴」の語を用いる(周辺特徴についても同様)。 Here, “person's posture characteristics” are extracted from the characteristics of the posture of the human body such as walking and sitting. “Human posture characteristics” include, for example, the joint position of the human body, the position of each part of the human body (for example, the head, feet, etc.), the type of posture (for example, the state of standing up, the state of bending forward), or It is expressed by these temporal changes. However, the data format for representing the “person's posture feature” is arbitrary, such as a type format, a coordinate format, or a relative position between parts. The “person's posture feature” may be abstracted data (for example, HOG feature value). Since “person's posture feature” may be expressed by type or the like, the word “feature” is used as a high-level concept and the word “feature” is used (the same applies to peripheral features).
 図5には、人の姿勢特徴の一例として、人体の関節位置を示している。図5中では、人体の関節位置として、右足首p0、右膝p1、右腰p2、左腰p3、左膝p4、左足首p5、右手首p6、右肘p7、右肩p8、左肩p9、左肘p10、左手首p11、首p12、頭頂部p13を示している。 FIG. 5 shows the joint positions of the human body as an example of the posture characteristics of the person. In FIG. 5, as the joint positions of the human body, the right ankle p0, right knee p1, right waist p2, left waist p3, left knee p4, left ankle p5, right wrist p6, right elbow p7, right shoulder p8, left shoulder p9, The left elbow p10, the left wrist p11, the neck p12, and the crown p13 are shown.
 人体特徴抽出部30は、図4に示すように、例えば、学習済みのCNN等を用いて、画像から人の姿勢特徴を抽出する。尚、人体特徴抽出部30を構成するCNNは、例えば、人体の画像と、当該画像中における人体の関節位置の座標(二次元位置又は三次元推定位置)の対応関係を示す教師データによって学習処理が行われたものが用いられる(一般にR-CNNとも称される)。 As shown in FIG. 4, the human body feature extraction unit 30 extracts a human posture feature from an image using, for example, a learned CNN. Note that the CNN constituting the human body feature extraction unit 30 performs, for example, a learning process using teacher data indicating the correspondence between the human body image and the coordinates of the joint position of the human body (two-dimensional position or three-dimensional estimated position) in the image. Is used (generally also referred to as R-CNN).
 人体特徴抽出部30は、例えば、前処理部31、畳み込み処理部32、全結合部33、及び全結合部34を含んで構成される。 The human body feature extraction unit 30 includes, for example, a preprocessing unit 31, a convolution processing unit 32, an all coupling unit 33, and an all coupling unit 34.
 前処理部31は、人領域を示すデータD2に基づいて、全領域の画像T1から人領域の画像T2を切り出して、所定のサイズ及びアスペクト比に変換する等、画像の正規化を行う。又、前処理部31は、人領域の視距離に応じた領域設定を実施したり、色分割処理を行ったりしてもよい。 The pre-processing unit 31 normalizes the image by cutting out the image T2 of the human area from the image T1 of the entire area and converting it into a predetermined size and aspect ratio based on the data D2 indicating the human area. In addition, the preprocessing unit 31 may perform area setting according to the viewing distance of the human area or may perform color division processing.
 畳み込み処理部32は、複数の特徴量抽出層が階層的に接続されて構成されている。畳み込み処理部32は、各特徴量抽出層において、前階層から入力される入力データに対して畳み込み演算処理、活性化処理、及びプーリング処理を実行する。 The convolution processing unit 32 is configured by hierarchically connecting a plurality of feature quantity extraction layers. The convolution processing unit 32 performs convolution operation processing, activation processing, and pooling processing on input data input from the previous layer in each feature amount extraction layer.
 畳み込み処理部32は、このように各特徴量抽出層における処理を繰り返すことで、画像に含まれる複数の観点の特徴量(例えば、エッジ、領域、分布等)を高次元に抽出していき、その結果を全結合部33に出力する。 The convolution processing unit 32 repeats the processing in each feature quantity extraction layer in this way, and extracts feature quantities (for example, edges, regions, distributions, etc.) of a plurality of viewpoints included in the image in a high dimension, The result is output to all coupling units 33.
 全結合部33は、例えば、複数の特徴量を全結合する多層パーセプトロンで構成されている(他の全結合部34、43、51、54、63も同様)。全結合部33は、畳み込み処理部32から得られる複数の中間演算結果データを全結合して、人の姿勢特徴を示すデータD3を生成する。そして、全結合部33は、当該人の姿勢特徴を示すデータD3を全結合部34及び周辺特徴フィルター部50に対して出力する。 The total coupling portion 33 is constituted by, for example, a multilayer perceptron that fully couples a plurality of feature quantities (the same applies to all the other coupling portions 34, 43, 51, 54, and 63). The fully combining unit 33 fully combines the plurality of intermediate calculation result data obtained from the convolution processing unit 32 to generate data D3 indicating the posture characteristics of the person. Then, the all combination unit 33 outputs data D3 indicating the posture characteristics of the person to the all combination unit 34 and the peripheral feature filter unit 50.
 図6は、人体特徴抽出部30が抽出する姿勢特徴のデータD3の一例を示す図である。図6では、人体の各関節位置の特徴量が4096次元で表されている。 FIG. 6 is a diagram illustrating an example of posture feature data D3 extracted by the human body feature extraction unit 30. In FIG. 6, the feature amount of each joint position of the human body is represented in 4096 dimensions.
 全結合部34は、全結合部33の出力を全結合し、当該人の姿勢特徴を示すデータD3を行動判別部60に対して出力するための出力層である。 The all coupling unit 34 is an output layer for coupling all the outputs of the all coupling unit 33 and outputting the data D3 indicating the posture characteristics of the person to the action determining unit 60.
 尚、本実施形態に係る人体特徴抽出部30は、全結合部33からの出力を周辺特徴フィルター部51に対して入力することによって、学習した姿勢特徴と相関を持つ疎で高次元な特徴を抽出する。但し、周辺特徴フィルター部50は、全結合部34から取得してもよい。 The human body feature extraction unit 30 according to the present embodiment inputs sparse and high-dimensional features having a correlation with the learned posture feature by inputting the output from the all combination unit 33 to the peripheral feature filter unit 51. Extract. However, the peripheral feature filter unit 50 may be acquired from the total coupling unit 34.
 尚、上記した人の姿勢特徴の抽出方法は、公知の手法と同様の手法である。詳細は、例えば、非特許文献1を参照されたい。 It should be noted that the above-described method for extracting the human posture feature is the same as a known method. For details, see Non-Patent Document 1, for example.
 又、人体特徴抽出部30が行う姿勢特徴の抽出処理は、上記手法に限らず、任意の手法を用いてよい。人体特徴抽出部30は、例えば、シルエット抽出処理、領域分割処理、肌色抽出処理、輝度勾配抽出処理、動き抽出処理、形状モデルフィッティング又はこれらの組み合わせ等を用いてもよい。又、人体の各部位毎に抽出処理を行って、これらを統合する方式を用いてもよい。 Further, the posture feature extraction processing performed by the human body feature extraction unit 30 is not limited to the above method, and any method may be used. The human body feature extraction unit 30 may use, for example, silhouette extraction processing, region division processing, skin color extraction processing, luminance gradient extraction processing, motion extraction processing, shape model fitting, or a combination thereof. Also, a method of performing extraction processing for each part of the human body and integrating them may be used.
[周辺特徴抽出部]
 周辺特徴抽出部40は、人領域検出部20から画像データD1と人領域を示すデータD2を取得して、人領域T2の周辺の画像に対して所定の演算処理を施して、画像に映る人の周辺物体の周辺特徴を抽出する。そして、周辺特徴抽出部40は、抽出した周辺特徴のデータD4を周辺特徴フィルター部50に出力する。
[Nearby feature extraction unit]
The peripheral feature extraction unit 40 acquires the image data D1 and the data D2 indicating the human region from the human region detection unit 20, performs predetermined arithmetic processing on the image around the human region T2, and displays the human image in the image. The peripheral features of the surrounding objects are extracted. Then, the peripheral feature extraction unit 40 outputs the extracted peripheral feature data D4 to the peripheral feature filter unit 50.
 ここで、「周辺特徴」とは、人の周辺に存在する周辺物体の形状、周辺物体の位置、又は当該周辺物体の種別等を表すものである。但し、「周辺特徴」を表す際のデータ形式は、種別形式、座標形式、又は相対位置等、任意である。又、「周辺特徴」は、抽象化したデータ(例えば、HOG特徴量)であってもよい。 Here, the “peripheral feature” represents the shape of the peripheral object existing around the person, the position of the peripheral object, the type of the peripheral object, or the like. However, the data format for representing “peripheral features” is arbitrary, such as a type format, a coordinate format, or a relative position. The “peripheral feature” may be abstracted data (for example, HOG feature amount).
 「周辺特徴」は、より好適には、周辺物体と人体の各部位との位置関係(例えば、人体の手との位置関係)を示す情報を含む。又、「周辺特徴」は、より好適には、周辺物体の種別(例えば、ベッド、椅子等)を示す情報を含む。これらの情報は、行動判別部60が人の行動を判別する際の重要な決定要因となり得るためである。 More preferably, the “peripheral feature” includes information indicating a positional relationship between the peripheral object and each part of the human body (for example, a positional relationship with the human hand). The “peripheral feature” more preferably includes information indicating the type of the peripheral object (for example, bed, chair, etc.). This is because the information can be an important determinant when the action determination unit 60 determines a person's action.
 本実施形態において、周辺特徴は、人と物とのインタラクションを補完する役割を有する。換言すると、人の姿勢特徴だけでは判別することが困難な人の行動を周辺特徴によって補完する。 In the present embodiment, the peripheral feature has a role of complementing the interaction between the person and the object. In other words, the human behavior that is difficult to discriminate only by the human posture feature is complemented by the peripheral feature.
 周辺特徴抽出部40は、図4に示すように、例えば、人体特徴抽出部30と同様に、CNN等を用いて、画像から周辺特徴を抽出する。尚、周辺特徴抽出部40を構成するCNNは、例えば、物体の画像と、当該物体の形状、種別、又は各部位の位置等の対応関係を示す教師データによって学習処理が行われたものが用いられる。又、より好適には、人体を含む人体周辺の画像と、当該画像中における物体の形状及び位置関係の座標の対応関係を示す教師データによって学習処理が行われたものが用いられる。 As shown in FIG. 4, the peripheral feature extraction unit 40 extracts peripheral features from the image using CNN or the like, for example, in the same manner as the human body feature extraction unit 30. The CNN constituting the peripheral feature extraction unit 40 is, for example, one that has been subjected to learning processing using teacher data indicating the correspondence between the image of the object and the shape, type, position of each part, etc. It is done. More preferably, an image obtained by performing learning processing using an image around a human body including a human body and teacher data indicating a correspondence relationship between the shape of the object and the coordinates of the positional relationship in the image is used.
 周辺特徴抽出部40は、例えば、前処理部41、畳み込み処理部42、及び全結合部43を含んで構成される。 The peripheral feature extraction unit 40 includes, for example, a preprocessing unit 41, a convolution processing unit 42, and an all combination unit 43.
 前処理部41は、画像データD1と人領域を示すデータD2に基づいて、全領域の画像T1から人領域T2を基準として、当該人領域T2よりも拡大した周辺領域の画像T2aを切り出す(図5を参照)。そして、前処理部41は、所定のサイズ及びアスペクト比に変換する等、画像の正規化を行う。 Based on the image data D1 and the data D2 indicating the human region, the preprocessing unit 41 cuts out the peripheral region image T2a that is larger than the human region T2 from the human region T2 based on the human region T2 (see FIG. 5). Then, the preprocessing unit 41 performs normalization of the image such as conversion to a predetermined size and aspect ratio.
 畳み込み処理部42及び全結合部43の処理は、上記したとおりである。畳み込み処理部42は、畳み込み処理等によって、周辺特徴の複数の観点の特徴量(例えば、エッジ、領域、分布等)を高次元に抽出する。全結合部43は、畳み込み処理部42から得られる複数の中間演算結果データを全結合して、最終的な演算結果データたる周辺特徴を出力する。 The processing of the convolution processing unit 42 and the total combining unit 43 is as described above. The convolution processing unit 42 extracts feature amounts (for example, edges, regions, distributions, etc.) of a plurality of viewpoints of peripheral features in a high dimension by convolution processing or the like. The full combining unit 43 fully combines the plurality of intermediate calculation result data obtained from the convolution processing unit 42, and outputs peripheral features as final calculation result data.
 尚、周辺特徴抽出部40が行う抽出処理は、上記手法に限らず、任意の手法を用いてよい。周辺特徴抽出部40は、例えば、HOG特徴量抽出処理、シルエット抽出処理、領域分割処理、輝度勾配抽出処理、動き抽出処理、形状モデルフィッティング又はこれらの組み合わせ等を用いてもよい。又、所定領域毎に、特徴抽出処理を行って、これらを統合する方式を用いてもよい。 Note that the extraction processing performed by the peripheral feature extraction unit 40 is not limited to the above method, and any method may be used. The peripheral feature extraction unit 40 may use, for example, HOG feature amount extraction processing, silhouette extraction processing, region division processing, luminance gradient extraction processing, motion extraction processing, shape model fitting, or a combination thereof. Further, a method of performing feature extraction processing for each predetermined area and integrating them may be used.
[周辺特徴フィルター部]
 周辺特徴フィルター部50は、人領域検出部20から姿勢特徴のデータD3と周辺特徴抽出部40から周辺特徴のデータD4を取得する。そして、周辺特徴フィルター部50は、姿勢特徴と関連付けて設定された周辺特徴の重要度のデータDaに基づいて、周辺特徴をフィルタリングする。そして、周辺特徴フィルター部50は、フィルタリングした周辺特徴のデータD4aを行動判別部60に出力する。
[Peripheral feature filter section]
The peripheral feature filter unit 50 acquires posture feature data D3 from the human region detection unit 20 and peripheral feature data D4 from the peripheral feature extraction unit 40. The peripheral feature filter unit 50 filters the peripheral features based on the importance level data Da of the peripheral features set in association with the posture feature. Then, the peripheral feature filter unit 50 outputs the filtered peripheral feature data D4a to the behavior determination unit 60.
 画像に映る人の行動を判別する際、上記したように、画像に映る物体の種別、位置又は見え方が種々に異なる環境下においては、着目すべき周辺物体を特定しなければ、認識すべきパターン数が膨大となってしまい、誤認識や処理負荷の増大につながってしまう。 When discriminating the actions of people appearing in an image, as described above, in environments where the type, position, or appearance of the object appearing in the image are different, it should be recognized unless the surrounding object to be noted is specified. The number of patterns becomes enormous, leading to misrecognition and increased processing load.
 一方、人の行動毎に、着目すべき周辺特徴は必ずしも同じではない。換言すると、行動に起因した人の姿勢によって、重要となる周辺特徴の要素と不要となる周辺特徴の要素が変化する。例えば、「ベッドに座る(入床)」という行動と「椅子に座る」という行動を判別する場合には、腰付近の周辺の周辺特徴(例えば、椅子の座部の丸みエッジ、ベッドの縦エッジ、及び背中側の空間)が重要である。又、「物を取る」という行動と「物を置く」という行動を判別する場合には、手付近の周辺特徴が重要である。 On the other hand, the peripheral features to be noted are not necessarily the same for each human action. In other words, the important peripheral feature elements and the unnecessary peripheral feature elements change depending on the posture of the person due to the action. For example, when discriminating between the action of `` sitting on the bed (flooring) '' and the action of `` sitting on the chair '', the peripheral features around the waist (for example, the round edge of the chair seat, the vertical edge of the bed) , And back space) is important. In addition, when distinguishing the action of “taking an object” and the action of “putting an object”, the peripheral features near the hand are important.
 かかる観点から、本実施形態に係る画像処理装置100は、周辺特徴フィルター部50によって、周辺特徴抽出部40で抽出された周辺特徴のフィルタリングを行う。 From such a viewpoint, the image processing apparatus 100 according to the present embodiment performs filtering of the peripheral features extracted by the peripheral feature extraction unit 40 by the peripheral feature filter unit 50.
 ここで、「重要度」とは、着目すべき周辺特徴の位置、形状又は種別等を意味する。換言すると、「重要度」とは、人の姿勢特徴によって、関連がありそうな行動を絞り込み、着目すべき周辺物体の位置、形状又は種別等を特定することを可能とするものである。但し、「重要度」を表す際のデータ形式は、任意であって、姿勢特徴のデータD3から一意に変換され、周辺特徴のデータD4をフィルタリングし得る形式であればよい。 Here, “importance” means the position, shape, type, etc. of the peripheral feature to be noted. In other words, “importance” makes it possible to narrow down actions that are likely to be related according to the posture characteristics of a person and to specify the position, shape, type, or the like of a peripheral object to be noted. However, the data format for representing “importance” may be any format as long as it is uniquely converted from the posture feature data D3 and can filter the peripheral feature data D4.
 尚、「重要度」としては、より好適には、人の姿勢特徴の時間的変化の情報(例えば、所定フレーム前の人の姿勢特徴との差分)も用いる。これによって、より着目すべき周辺特徴の絞り込みが容易になる。例えば、人の姿勢特徴の時間的変化が、ある方向に手が伸びていくことを示す場合、当該人が、その先の物体を手に取ろうとしていることを予測できるため、後述する行動判別部60においては、当該物体の特徴のみを着目する構成とすることができる。 Note that, as the “importance”, more preferably, information on temporal changes in human posture characteristics (for example, a difference from the posture characteristics of a person before a predetermined frame) is also used. This makes it easier to narrow down the peripheral features to be focused on. For example, when a temporal change in a person's posture characteristic indicates that a hand is extending in a certain direction, it can be predicted that the person is trying to pick up an object ahead of it, so that behavior determination described later The unit 60 may be configured to focus only on the feature of the object.
 「重要度」は、例えば、姿勢特徴と着目すべき周辺特徴を関連付けた教師データとして用いて、行動クラス毎の機械学習(後述する学習部70)によって、設定される。但し、「重要度」は、その他、機械学習に限らず、ユーザー等が設定する構成としてもよいのは勿論である。 The “importance” is set by machine learning (learning unit 70 to be described later) for each action class, for example, using the teacher data associating the posture feature with the peripheral feature to be noted. However, the “importance” is not limited to machine learning, and may be set by the user or the like.
 周辺特徴フィルター部50は、図4に示すように、例えば、入力側全結合部51、活性化処理部52、フィルタリング部53、及び出力側全結合部54を含んで構成される。 As shown in FIG. 4, the peripheral feature filter unit 50 includes, for example, an input-side full coupling unit 51, an activation processing unit 52, a filtering unit 53, and an output-side full coupling unit 54.
 入力側全結合部51は、人体特徴抽出部30の全結合部33から出力された姿勢特徴を全結合処理することによって、姿勢特徴(ここでは、4096次元の特徴量ベクトル)に係るデータD3を、周辺特徴(ここでは、200次元の特徴量ベクトル)の重要度に変換する。そして、入力側全結合部51は、周辺特徴と同じ次元数(ここでは、200次元)の重要度のベクトルを出力する。 The input-side full combining unit 51 performs the full combining process on the posture feature output from the full combining unit 33 of the human body feature extracting unit 30, thereby obtaining the data D3 related to the posture feature (here, a 4096-dimensional feature value vector). , The degree of importance of peripheral features (here, 200-dimensional feature quantity vectors) is converted. Then, the input-side full combining unit 51 outputs an importance vector having the same number of dimensions as the peripheral features (here, 200 dimensions).
 尚、入力側全結合部51においては、例えば、行動クラス毎の機械学習によって、姿勢特徴を周辺特徴の重要度に変換し得るように、姿勢特徴の各特徴量ベクトルに対する重み係数が設定されている。 In the input-side full combining unit 51, for example, a weighting factor for each feature amount vector of the posture feature is set so that the posture feature can be converted into the importance of the peripheral feature by machine learning for each action class. Yes.
 活性化処理部52は、入力側全結合部51から出力される周辺特徴の重要度に対して、例えば、ReLU関数を用いて活性化処理を施す。尚、ReLU関数は、負の数を入力した場合は0を出力し、0以上の数を入力した場合は入力をそのまま返す関数である。 The activation processing unit 52 performs activation processing on the importance of the peripheral features output from the input-side full combining unit 51 using, for example, a ReLU function. The ReLU function is a function that outputs 0 when a negative number is input and returns the input as it is when a number greater than 0 is input.
 活性化処理部52から出力される重要度は、例えば、次式(1)のように表される。
Figure JPOXMLDOC01-appb-M000001
The degree of importance output from the activation processing unit 52 is expressed by, for example, the following expression (1).
Figure JPOXMLDOC01-appb-M000001
 フィルタリング部53は、周辺特徴抽出部40から出力される周辺特徴に対して、活性化処理部52から出力される周辺特徴の重要度を積算処理して、周辺特徴のフィルタリングを行う。フィルタリング部53から出力される周辺特徴は、例えば、次式(2)のように表される。
Figure JPOXMLDOC01-appb-M000002
The filtering unit 53 adds the importance of the peripheral features output from the activation processing unit 52 to the peripheral features output from the peripheral feature extraction unit 40, and filters the peripheral features. The peripheral feature output from the filtering unit 53 is expressed by the following equation (2), for example.
Figure JPOXMLDOC01-appb-M000002
 図7は、周辺特徴フィルター部50のフィルタリング処理について説明する図である。本実施形態では、重要度に係るデータDaは、特定の次元ベクトルにのみ反応するベクトルとして表される。従って、フィルタリング後の周辺特徴の特徴量(右図)は、フィルタリング前の周辺特徴の特徴量(左図)に比較して、疎な特徴量ベクトルとなる。 FIG. 7 is a diagram for explaining the filtering process of the peripheral feature filter unit 50. In the present embodiment, the data Da related to the importance is represented as a vector that reacts only to a specific dimension vector. Accordingly, the feature amount of the peripheral feature after filtering (right diagram) is a sparse feature vector compared to the feature amount of the peripheral feature before filtering (left diagram).
 出力側全結合部54は、当該フィルタリングされた周辺特徴のデータD4aを、更に抽象化して、行動判別部60の結合部60aに対して出力する。 The output-side all combining unit 54 further abstracts the filtered peripheral feature data D4a and outputs the data to the combining unit 60a of the behavior determining unit 60.
 尚、周辺特徴フィルター部50の演算手法は、上記に限らない。例えば、姿勢特徴のデータD3が姿勢の種別(例えば、中腰)を示すものである場合、周辺特徴フィルター部50は、当該姿勢の種別と一対一で重要度を関連付けておいてもよい。 The calculation method of the peripheral feature filter unit 50 is not limited to the above. For example, when the posture feature data D3 indicates the posture type (for example, middle waist), the peripheral feature filter unit 50 may associate the importance with the posture type on a one-to-one basis.
[行動判別部]
 行動判別部60は、人体特徴抽出部30から人の姿勢特徴のデータD3を取得すると共に、周辺特徴フィルター部50からフィルタリングされた周辺特徴のデータD4aを取得する。そして、行動判別部60は、人の姿勢特徴のデータD3とフィルタリングされた周辺特徴のデータD4aの時系列データに基づいて、画像に映る人の行動クラスを判別する。
[Behavior discrimination part]
The behavior determination unit 60 acquires the human posture feature data D3 from the human body feature extraction unit 30, and acquires the filtered peripheral feature data D4a from the peripheral feature filter unit 50. Then, the action determination unit 60 determines the action class of the person shown in the image based on the time series data of the human posture feature data D3 and the filtered peripheral feature data D4a.
 人の行動は、時間的連続性や、行動間での時系列的な深い関連を持っているため、人の行動クラスを判別する際には、画像データをフレーム毎に単一で行うのではなく、人の姿勢と物の位置関係等の時間的変化を示す時系列データも考慮するのが望ましい。加えて、人の行動クラスを判別する際には、連続するフレーム間の挙動のみならず、ある程度に離間した過去(例えば、1分前)の挙動も考慮するのが望ましい。その理由としては、例えば、「椅子から立ち上がる」という行動を判別する際には、過去の行動で「椅子に座る」という行動がされているというデータも大きな手がかりとなるためである。 Since human behavior has temporal continuity and deep chronological relationships between behaviors, it is not necessary to perform single image data for each frame when discriminating human behavior classes. In addition, it is desirable to consider time-series data indicating temporal changes such as the posture of a person and the positional relationship between objects. In addition, when discriminating a human behavior class, it is desirable to consider not only the behavior between successive frames but also the behavior of the past (for example, one minute ago) separated to some extent. This is because, for example, when determining the action of “getting up from the chair”, data indicating that the action of “sitting on the chair” is performed in the past action is also a big clue.
 そこで、本実施形態に係る行動判別部60は、再帰型ニューラルネットワークの一種である階層型LSTM(Long Short-Term Memory)を用いて、時系列解析を行う。階層型LSTMは、短い時間間隔(例えば、直前の画像フレーム)における関係に加えて、長い時間間隔(例えば、1分前)における関係を認識することが可能である。 Therefore, the behavior determination unit 60 according to the present embodiment performs time series analysis using a hierarchical LSTM (Long Short-Term Memory) which is a kind of recursive neural network. The hierarchical LSTM can recognize a relationship in a long time interval (for example, one minute before) in addition to a relationship in a short time interval (for example, the immediately preceding image frame).
 図8は、行動判別部60の階層型LSTMの構成の一例を示す図である。 FIG. 8 is a diagram illustrating an example of the configuration of the hierarchical LSTM of the behavior determination unit 60.
 行動判別部60には、図4に示すように、結合部60aにて結合された人体特徴抽出部30から取得した人の姿勢特徴のデータD3と、周辺特徴フィルター部50からフィルタリングされた周辺特徴のデータD4aとが入力される(図8中のA(t)、A(t-1)、A(t-2)、A(t-3)は、姿勢特徴のデータD3及び周辺特徴のデータD4aを結合したものを表す。以下同じ)。ここでは、結合部60aは、両特徴量ベクトルを異なる次元の特徴量ベクトルとして結合する。 As shown in FIG. 4, the action determination unit 60 includes the human posture feature data D <b> 3 acquired from the human body feature extraction unit 30 combined by the combination unit 60 a and the peripheral feature filtered from the peripheral feature filter unit 50. (A (t), A (t-1), A (t-2), and A (t-3) in FIG. 8 are the posture feature data D3 and the peripheral feature data. This represents a combination of D4a (hereinafter the same). Here, the combining unit 60a combines both feature quantity vectors as feature quantity vectors of different dimensions.
 尚、結合部60aは、より好適には、同一の画像データD1の姿勢特徴データD3及び周辺特徴データD4aを取得する。そのため、結合部60aは、例えば、同期処理を施して、姿勢特徴データD3及び周辺特徴データD4aを取得したり、フレームに付された識別符号を用いて、姿勢特徴データD3及び周辺特徴データD4aを取得するのが望ましい。 Note that the combining unit 60a more preferably acquires the posture feature data D3 and the peripheral feature data D4a of the same image data D1. Therefore, the combining unit 60a performs, for example, synchronization processing to acquire the posture feature data D3 and the peripheral feature data D4a, or uses the identification code attached to the frame to obtain the posture feature data D3 and the peripheral feature data D4a. It is desirable to acquire.
 階層型LSTMは、図8に示すように、複数階層(図8中では3層のみを示す)の中間層61、正規化部62、全結合部63、及び、行動クラス決定部64を含んで構成される。 As shown in FIG. 8, the hierarchical LSTM includes an intermediate layer 61, a normalization unit 62, an all combination unit 63, and an action class determination unit 64 of a plurality of layers (only three layers are shown in FIG. 8). Composed.
 図8中では、A(t)が現時点(演算処理対象の時点を表す。以下同じ)の入力を表し、A(t-1)が現時点に対して第1の時間前(例えば、1フレーム前)の入力を表し、A(t-2)が現時点に対して第2の時間前(例えば、10フレーム前)の入力を表し、A(t-3)が現時点に対して第3の時間前(例えば、20フレーム前)の入力を表す。 In FIG. 8, A (t) represents the input at the present time (representing the time to be processed. The same applies hereinafter), and A (t−1) is the first time before the present time (for example, one frame before). ), A (t-2) represents the input for the second time before the current time (for example, 10 frames before), and A (t-3) represents the third time for the current time. Represents an input (for example, 20 frames before).
 複数階層の中間層61(61a~61l)は、それぞれが、LSTMユニットによって構成されている。 Each of the intermediate layers 61 (61a to 61l) having a plurality of layers is constituted by LSTM units.
 1層目のLSTM61aには、現時点の入力データA(t)に所定の重み係数Wa1(t)(重み係数の行列を表す。以下同じ)を付加したものを入力すると共に、第1の時間前(例えば、1フレーム前)の1層目のLSTM61dの出力データZ1(t-1)に所定の重み係数Wb1(t-1)が付加したものを再帰的に入力する。そして、1層目のLSTM61aは、これらのデータに所定の演算処理を施して、2層目のLSTM61b及び正規化部62に対して、Z1(t)のデータを出力する。 The first layer LSTM 61a is inputted with the current input data A (t) added with a predetermined weighting coefficient Wa1 (t) (representing a matrix of weighting coefficients; the same applies hereinafter), and the first time before The output data Z1 (t-1) of the first layer LSTM 61d (for example, one frame before) added with a predetermined weight coefficient Wb1 (t-1) is recursively input. Then, the LSTM 61a in the first layer performs predetermined arithmetic processing on these data, and outputs Z1 (t) data to the LSTM 61b and the normalization unit 62 in the second layer.
 2層目のLSTM61bには、現時点のデータA(t)に所定の重み係数Wa2(t)を付加したものを入力すると共に、第2の時間前(例えば、10フレーム前)の2層目のLSTM61hの出力データZ2(t-2)に所定の重み係数Wb2(t-3)を付加したものを再帰的に入力する。そして、2層目のLSTM61bは、これらのデータに所定の演算処理を施して、3層目のLSTM61b及び正規化部62に対して、Z2(t)のデータを出力する。 The second layer LSTM 61b is inputted with the current data A (t) added with a predetermined weight coefficient Wa2 (t), and the second layer before the second time (for example, 10 frames before). The LSTM 61h output data Z2 (t-2) with a predetermined weighting factor Wb2 (t-3) added is input recursively. Then, the LSTM 61b in the second layer performs predetermined arithmetic processing on these data, and outputs the data of Z2 (t) to the LSTM 61b and the normalization unit 62 in the third layer.
 3層目のLSTM61cには、現時点のデータA(t)に所定の重み係数Wa3(t)が付加したものを入力すると共に、第3の時間前(例えば、20フレーム前)の3層目のLSTMlの出力データZ3(t-3)に所定の重み係数Wb3(t-3)が付加したものを再帰的に入力する。そして、3層目のLSTM61cは、これらのデータに所定の演算処理を施して、更に下位層のLSTM61(図示せず)及び正規化部62に対して、Z3(t)のデータを出力する。 The third layer LSTM 61c is inputted with the current data A (t) added with a predetermined weight coefficient Wa3 (t), and the third layer before the third time (for example, 20 frames before). A data obtained by adding a predetermined weighting factor Wb3 (t-3) to the output data Z3 (t-3) of LSTML is recursively input. Then, the LSTM 61c in the third layer performs predetermined arithmetic processing on these data, and further outputs Z3 (t) data to the LSTM 61 (not shown) and the normalization unit 62 in the lower layer.
 このように、階層型LSTMは、下層のLSTM61ほど、入力する過去時点のデータのフレーム間隔を大きくすることによって、下層のLSTM61において考慮される実時間の長さを増加させると共に、多様なフレーム間隔での時系列解析が可能となる。 In this way, the hierarchical LSTM increases the length of the real time considered in the lower LSTM 61 by increasing the frame interval of the past data to be input as the lower LSTM 61, and various frame intervals. Time-series analysis can be performed.
 尚、各LSTMユニット61は、例えば、直前(過去時点)のLSTMユニット61のデータを保持するメモリセルと、3つのゲート(一般に、入力ゲート、忘却ゲート、及び出力ゲートと称される)を備えている(図示せず)。3つのゲートには、それぞれ、上記した現時点のデータAと直前(過去時点)のLSTMユニット61の出力データZが入力される。そして、3つのゲートは、それぞれ、当該現時点のデータAと直前(過去時点)のLSTMユニット61の出力データZと、各別に設定された重み係数に基づいて、0から1までの値を出力する。3つのゲートからの出力は、それぞれ、LSTMユニット61へ入力されるデータ、LSTMユニット61から出力するデータ、及びメモリセルに保持されているデータに積算される。各LSTMユニット61は、このようなユニット構造によって、現時点の特徴に対して、適度に過去時点の特徴を反映させて出力することができる。 Each LSTM unit 61 includes, for example, a memory cell that holds data of the immediately preceding (previous time) LSTM unit 61 and three gates (generally referred to as an input gate, a forgetting gate, and an output gate). (Not shown). The three gates are input with the current data A and the output data Z of the LSTM unit 61 immediately before (the past time). Each of the three gates outputs a value from 0 to 1 based on the current data A, the output data Z of the LSTM unit 61 immediately before (the past time), and the weighting factor set separately. . The outputs from the three gates are respectively integrated into the data input to the LSTM unit 61, the data output from the LSTM unit 61, and the data held in the memory cell. With such a unit structure, each LSTM unit 61 can appropriately reflect the characteristics at the past time and output the current characteristics.
 上記した階層型LSTMの演算方法は、公知の手法と同様の手法である。詳細は、例えば、非特許文献2を参照されたい。 The above-described hierarchical LSTM calculation method is the same as a known method. For details, see Non-Patent Document 2, for example.
 正規化部62は、複数階層の中間層61(ここでは、中間層61a、61b、61c)それぞれの出力データZ1(t)、Z2(t)、Z3(t)に対して、L2正規化処理を施す。 The normalization unit 62 performs L2 normalization processing on the output data Z1 (t), Z2 (t), and Z3 (t) of each of the intermediate layers 61 (here, the intermediate layers 61a, 61b, and 61c) of a plurality of layers. Apply.
 全結合部63は、正規化された複数階層の中間層61(ここでは、中間層61a、61b、61c)それぞれの出力データZ1(t)、Z2(t)、Z3(t)に対して全結合処理を行う。そして、全結合部63は、行動クラス毎(例えば、椅子に座る、ベッドから起床する等の行動毎)の確率を行動クラス決定部64に出力する。全結合部63から出力する行動クラス毎の確率は、例えば、softmax関数を用いて、次式(3)のように表される。
Figure JPOXMLDOC01-appb-M000003
The total combining unit 63 applies all the output data Z1 (t), Z2 (t), and Z3 (t) to each of the normalized intermediate layers 61 (here, the intermediate layers 61a, 61b, and 61c). Perform the join process. Then, all the coupling units 63 output the probability for each behavior class (for example, for each behavior such as sitting on a chair or getting up from a bed) to the behavior class determination unit 64. The probability for each action class output from all the coupling units 63 is expressed by the following equation (3) using, for example, the softmax function.
Figure JPOXMLDOC01-appb-M000003
 尚、中間層61及び全結合部63は、予め行動クラス毎の機械学習(後述する学習部70)によって、重み係数が適切に設定されたものが用いられる。これによって、全結合部63からは、行動クラス毎に適切な確率が出力される。 Note that the intermediate layer 61 and the all coupling unit 63 are used in which weighting factors are appropriately set in advance by machine learning (learning unit 70 described later) for each action class. As a result, an appropriate probability is output for each action class from all the coupling units 63.
 行動クラス決定部64は、全結合部63からの出力を取得して、確率が最大の行動クラスを、画像に映る人が行っている行動クラスであると判定して、当該判定結果を出力する。 The action class determination unit 64 acquires the output from the all combination unit 63, determines that the action class having the maximum probability is the action class performed by the person shown in the image, and outputs the determination result. .
 尚、行動クラス判別部60の演算手法は、上記に限らない。例えば、階層型LSTMの構造に代えて、他の構造の再帰型ニューラルネットワークを用いてもよい。又、場合によっては、処理負荷を軽減することを狙って、時系列データを用いずに、フレーム毎の人の姿勢特徴のデータD3と周辺特徴のデータD4aを用いてもよい。 In addition, the calculation method of the action class discrimination | determination part 60 is not restricted above. For example, instead of the hierarchical LSTM structure, a recursive neural network having another structure may be used. In some cases, the human posture feature data D3 and the peripheral feature data D4a for each frame may be used without using the time series data in order to reduce the processing load.
[学習部]
 学習部70は、人体特徴抽出部30、周辺特徴抽出部40、周辺特徴フィルター部50、及び行動判別部60が上記した処理を実行し得るように、教師データを用いた機械学習を実行する。
[Learning Department]
The learning unit 70 performs machine learning using teacher data so that the human body feature extraction unit 30, the peripheral feature extraction unit 40, the peripheral feature filter unit 50, and the behavior determination unit 60 can execute the above-described processing.
 学習部70は、例えば、正規化された人領域の画像と人の姿勢特徴(例えば、関節位置)が関連付けられた教師データを用いて、人体特徴抽出部30の畳み込み処理部32、全結合部33、及び全結合部34のネットワークパラメータ(例えば、重み係数、バイアス)を調整する。 The learning unit 70 uses, for example, teacher data in which a normalized human region image and a human posture characteristic (for example, a joint position) are associated with each other, the convolution processing unit 32 of the human body feature extraction unit 30, the all coupling unit 33, and network parameters (for example, weighting factor, bias) of all coupling units 34 are adjusted.
 又、学習部70は、例えば、正規化された人周辺の物体の画像と物体の特徴(例えば、人体との位置関係)が関連付けられた教師データを用いて、周辺特徴抽出部40の畳み込み処理部42、及び全結合部43のネットワークパラメータ(例えば、重み係数、バイアス)を調整する。 In addition, the learning unit 70 uses, for example, teacher data in which a normalized image of an object around a person and an object characteristic (for example, a positional relationship with the human body) are associated with each other, the convolution processing of the peripheral feature extraction unit 40 The network parameters (for example, weighting factor, bias) of the unit 42 and the total coupling unit 43 are adjusted.
 又、学習部70は、例えば、人の姿勢特徴と周辺物体の重要度が関連付けられた教師データを用いて、周辺特徴フィルター部50の全結合部51、及び全結合部54のネットワークパラメータ(例えば、重み係数、バイアス)を調整する。 In addition, the learning unit 70 uses, for example, teacher data in which the posture characteristics of a person and the importance of surrounding objects are associated with each other, the network parameters (for example, the total coupling unit 51 of the peripheral feature filter unit 50 and the total coupling unit 54). , Weight coefficient, bias).
 又、学習部70は、例えば、人の姿勢特徴及び周辺特徴の時系列データと、正解となる行動クラスが関連付けられた教師データを用いて、行動判別部60の中間層61及び全結合部63のネットワークパラメータ(例えば、重み係数、バイアス)を調整する。 In addition, the learning unit 70 uses, for example, time series data of human posture features and peripheral features, and teacher data in which a correct behavior class is associated, and the intermediate layer 61 and the all coupling unit 63 of the behavior determination unit 60. Network parameters (e.g., weighting factor, bias).
 尚、学習部70は、例えば、公知の誤差逆伝搬法等を用いて、これらの学習処理を行えばよい。そして、学習部70は、学習処理によって調整したネットワークパラメータを記憶部(例えば、外部記憶装置104)に格納する。 The learning unit 70 may perform these learning processes using, for example, a known error back propagation method. Then, the learning unit 70 stores the network parameter adjusted by the learning process in the storage unit (for example, the external storage device 104).
 但し、汎用性の高い行動認識を行うためには、周辺環境の変化による精度の低下を防ぐ必要がある。換言すると、周辺特徴のデータは、人体の姿勢特徴のデータを補完する一方で、学習処理の際の教師データの環境に特化された特徴のみを示すものになってしまうおそれがある。 However, in order to perform highly versatile action recognition, it is necessary to prevent deterioration in accuracy due to changes in the surrounding environment. In other words, the peripheral feature data may complement only the posture feature data of the human body, while indicating only the features specialized for the environment of the teacher data in the learning process.
 そこで、本実施形態に係る学習部70は、少なくとも行動判別部60については、人体の姿勢特徴の方が周辺特徴よりも行動判別の際の主な決定要因となるように、学習処理を実行する。 Therefore, the learning unit 70 according to the present embodiment executes the learning process so that at least the behavior determination unit 60 has the posture characteristic of the human body as a main determinant in determining the behavior than the peripheral feature. .
 図9は、学習部70の学習処理について説明する図である。 FIG. 9 is a diagram for explaining the learning process of the learning unit 70.
 学習部70は、行動判別部60の中間層61及び全結合部63のネットワークパラメータ(例えば、重み係数、バイアス)を調整する際、人の姿勢特徴の時系列データD6a、周辺特徴の時系列データD6b、及び正解となる行動クラスD6cが関連付けられた教師データを用いて、正解に対する出力データ(ここでは、全結合部63の出力)の誤差分を示す損失Lossが小さくなるように、学習を行う。 When the learning unit 70 adjusts the network parameters (for example, weighting factor, bias) of the intermediate layer 61 and the total coupling unit 63 of the behavior determination unit 60, time series data D6a of human posture characteristics, time series data of peripheral characteristics Learning is performed using teacher data associated with D6b and the correct behavior class D6c so that the loss Loss indicating the error of the output data (here, the output of all the coupling units 63) with respect to the correct answer is reduced. .
 尚、損失関数は、例えば、softmax cross entropy関数等を用いて、次式(4)のように表される。
Figure JPOXMLDOC01-appb-M000004
The loss function is expressed as in the following equation (4) using, for example, a softmax cross entropy function.
Figure JPOXMLDOC01-appb-M000004
 本実施形態に係る学習部70は、この際に、人の姿勢特徴の時系列データD6aのみを用いて階層型LSTMに入力した場合の損失Loss1と、人の姿勢特徴の時系列データD6aと周辺特徴の時系列データD6bの両方を用いて階層型LSTMに入力した場合の損失Loss2の二つを用意する。 At this time, the learning unit 70 according to the present embodiment uses the loss Loss1 when only the time series data D6a of the human posture feature is input to the hierarchical LSTM, the time series data D6a of the human posture feature, and the surroundings. Two of loss Loss2 are prepared when both are input to the hierarchical LSTM using both of the feature time series data D6b.
 そして、学習部70は、例えば、誤差逆伝搬法等を用いて、損失Loss1と損失Loss2の和(次式(5)を参照)が最小化するように、階層型LSTMのネットワークパラメータの重み係数及びバイアス等の調整を行う。これによって、人体の姿勢特徴の方が周辺特徴よりも行動判別の際の主な決定要因となるように、階層型LSTMのネットワークパラメータを調整することができる。
Figure JPOXMLDOC01-appb-M000005
Then, the learning unit 70 uses, for example, an error back propagation method or the like, so that the sum of the loss Loss1 and the loss Loss2 (see the following equation (5)) is minimized, so that the weighting coefficient of the network parameter of the hierarchical LSTM And adjust the bias. This makes it possible to adjust the network parameters of the hierarchical LSTM so that the posture feature of the human body is the main determinant in the action discrimination than the peripheral feature.
Figure JPOXMLDOC01-appb-M000005
 又、損失関数は、更に、重要度の正則化項も加えて、次式(6)のように表されてもよい。そうすることで、過学習を抑制することができる。
Figure JPOXMLDOC01-appb-M000006
Further, the loss function may be expressed as in the following equation (6) by adding a regularization term of importance. By doing so, overlearning can be suppressed.
Figure JPOXMLDOC01-appb-M000006
 尚、学習部70は、機能ブロック毎に学習処理を行う態様に代えて、動画のデータと正解となる行動クラスを教師データとして、各機能ブロックを一括して、学習処理を行ってもよい。 Note that the learning unit 70 may perform learning processing for each functional block collectively using the moving image data and the correct action class as teacher data instead of performing the learning processing for each functional block.
(画像処理装置の動作)
 以下、図10~図14を参照して、本実施形態に係る画像処理装置100の動作の一例を説明する。
(Operation of image processing device)
Hereinafter, an example of the operation of the image processing apparatus 100 according to the present embodiment will be described with reference to FIGS.
 図10~図12は、画像処理装置100が行う動作のフローチャートの一例である。尚、図10~図12に示す動作フローは、例えば、CPUがコンピュータプログラムに従って実行するものである。 10 to 12 are examples of flowcharts of operations performed by the image processing apparatus 100. FIG. Note that the operation flows shown in FIGS. 10 to 12 are executed by the CPU according to the computer program, for example.
 図13及び図14は、画像処理装置100が行う画像処理の各プロセスを模式的に説明する図である。図13及び図14では、図1に示した画像に映る人B1の行動判別を行う際の処理を、イメージ化して示している。尚、ここでは、撮像装置200が撮影する画像中には、人B1以外に、ベッドB2、ゴミ箱B3、テレビB4、照明B5が映っているものとする。 FIG. 13 and FIG. 14 are diagrams schematically illustrating each process of image processing performed by the image processing apparatus 100. In FIGS. 13 and 14, the process for determining the behavior of the person B1 shown in the image shown in FIG. 1 is shown as an image. Here, it is assumed that the image captured by the imaging apparatus 200 includes a bed B2, a trash can B3, a television B4, and an illumination B5 in addition to the person B1.
 まず、画像処理装置100(画像取得部10)は、撮像装置200から画像データを取得する(ステップS1)。 First, the image processing apparatus 100 (image acquisition unit 10) acquires image data from the imaging apparatus 200 (step S1).
 次に、画像処理装置100(人領域検出部20、人体特徴抽出部30、周辺特徴抽出部40及び周辺特徴フィルター部50)は、特徴抽出処理を行う(ステップS2)。 Next, the image processing apparatus 100 (human region detection unit 20, human body feature extraction unit 30, peripheral feature extraction unit 40, and peripheral feature filter unit 50) performs feature extraction processing (step S2).
 ステップS2において、画像処理装置100(人領域検出部20)は、取得した画像データの画像から人領域を検出する(ステップS21)。次に、画像処理装置100(人体特徴抽出部30)は、図13Aのように、当該人領域の画像に映る人B1の姿勢特徴を抽出する(ステップS22)。次に、画像処理装置100(周辺特徴抽出部40)は、図13Bのように、当該人領域の画像に映る人B1の周辺特徴(ここでは、ベッドB2、ゴミ箱B3、照明B5)を抽出する(ステップS23)。次に、画像処理装置100(周辺特徴フィルター部50)は、図13Cのように、周辺特徴のフィルタリングを行う(ステップS24)。 In step S2, the image processing apparatus 100 (human area detecting unit 20) detects a human area from the image of the acquired image data (step S21). Next, the image processing apparatus 100 (human body feature extraction unit 30) extracts the posture feature of the person B1 shown in the image of the person region as shown in FIG. 13A (step S22). Next, the image processing apparatus 100 (peripheral feature extraction unit 40) extracts the peripheral features (here, the bed B2, the trash can B3, and the lighting B5) of the person B1 shown in the image of the person area as shown in FIG. 13B. (Step S23). Next, the image processing apparatus 100 (peripheral feature filter unit 50) performs filtering of the peripheral features as shown in FIG. 13C (step S24).
 尚、図13Cでは、人B1が横になった姿勢であるから、ベッドB2は人B1の行動と関連が深い周辺物体である一方で、ベッドB2以外の周辺物体(ゴミ箱B3や照明B5)は人B1の行動と無関係な周辺物体とみなせる。かかる観点から、画像処理装置100(周辺特徴フィルター部50)は、ベッドB2以外の周辺物体(ゴミ箱B3や照明B5)をフィルタリングによって除去した状態を示している。 In FIG. 13C, since the person B1 is in a lying posture, the bed B2 is a peripheral object closely related to the action of the person B1, while the peripheral objects other than the bed B2 (the trash can B3 and the lighting B5) are It can be regarded as a peripheral object unrelated to the action of the person B1. From this point of view, the image processing apparatus 100 (peripheral feature filter unit 50) shows a state where peripheral objects other than the bed B2 (trash box B3 and illumination B5) are removed by filtering.
 次に、画像処理装置100(行動判別部60)は、ステップS2の特徴抽出処理で抽出された特徴データに基づいて、画像に映る人の行動を判別する(ステップS3)。 Next, the image processing apparatus 100 (behavior determination unit 60) determines the action of the person shown in the image based on the feature data extracted in the feature extraction process in step S2 (step S3).
 ステップS3において、画像処理装置100(行動判別部60)は、人の姿勢特徴に係るデータD3と周辺特徴に係るデータD4aとを入力する(ステップS31)。次に、画像処理装置100(行動判別部60)は、階層型LSTM60を用いて、行動クラス毎の確率を算出する(ステップS32)。次に、画像処理装置100(行動判別部60)は、確率が最大の行動クラスを画像に映る人の行動であると決定して、当該行動クラスを示すデータを出力する(ステップS33)。 In step S3, the image processing apparatus 100 (behavior determination unit 60) inputs data D3 related to the human posture feature and data D4a related to the peripheral feature (step S31). Next, the image processing apparatus 100 (behavior determination unit 60) uses the hierarchical LSTM 60 to calculate the probability for each action class (step S32). Next, the image processing apparatus 100 (behavior determination unit 60) determines that the action class having the maximum probability is the action of the person shown in the image, and outputs data indicating the action class (step S33).
 このステップS3において、画像処理装置100(行動判別部60)は、例えば、図14A、図14B、図14Cの順に、人B1の姿勢が、ベッドB2に対して横になった状態から、ベッドB2に対して起き上がった状態に経時的に変化する状態を抽出する。画像処理装置100(行動判別部60)は、かかる経時的な変化によって、人B1の行動クラスが、起床に該当すると判別することができる。 In step S3, the image processing apparatus 100 (behavior determination unit 60), for example, in the order of FIG. 14A, FIG. 14B, and FIG. 14C, from the state where the posture of the person B1 lies sideways with respect to the bed B2. A state that changes over time to a state in which it rises is extracted. The image processing apparatus 100 (behavior determination unit 60) can determine that the action class of the person B1 corresponds to a wake-up by such a change over time.
 ステップS4において、画像処理装置100は、一連の行動認識の処理を終了すると判定した場合(ステップS4:YES)、処理を終了する。一方、行動認識の処理を継続すると判定した場合(ステップS4:NO)、ステップS1に戻って、画像処理装置100は、処理を継続する。 In step S4, when the image processing apparatus 100 determines to end the series of action recognition processes (step S4: YES), the process ends. On the other hand, when it determines with continuing the process of action recognition (step S4: NO), it returns to step S1 and the image processing apparatus 100 continues a process.
 以上、本実施形態に係る画像処理装置100によれば、人体の姿勢特徴と関連付けて周辺特徴の重要度を設定しておき、周辺特徴をフィルタリングすることによって、人の行動に関連する周辺物体のみを抽出することが可能である。これによって、本実施形態に係る画像処理装置100は、周辺物体の種別、位置又は見え方が種々に異なる環境下においても、高精度に人の行動クラスを推定することができる。 As described above, according to the image processing apparatus 100 according to the present embodiment, the importance of the peripheral feature is set in association with the posture feature of the human body, and the peripheral feature is filtered, so that only the peripheral object related to the human behavior is filtered. Can be extracted. Accordingly, the image processing apparatus 100 according to the present embodiment can estimate a human action class with high accuracy even in an environment in which the types, positions, and appearances of surrounding objects are different.
 特に、本実施形態に係る画像処理装置100は、姿勢特徴と周辺特徴の時系列データに基づいて、人体の姿勢と関連する周辺物体の位置関係の時間的変化を抽出し、人の行動クラスを推定する構成となっているため、より高精度に人の行動クラスを推定することができる。 In particular, the image processing apparatus 100 according to the present embodiment extracts temporal changes in the positional relationship of peripheral objects related to the posture of the human body based on time-series data of the posture features and the peripheral features, and determines the human action class. Since it is configured to estimate, it is possible to estimate a human action class with higher accuracy.
 又、本実施形態に係る画像処理装置100は、再帰型ニューラルネットワーク(特に、階層型LSTM方式)を用いて、長い時間間隔と短い時間間隔の両方の観点から、姿勢特徴と周辺特徴の時系列データの時間的解析を行う構成となっているため、より高精度に人の行動クラスを推定することができる。 In addition, the image processing apparatus 100 according to the present embodiment uses a recursive neural network (particularly, a hierarchical LSTM method), and from the viewpoint of both a long time interval and a short time interval, a time series of posture features and peripheral features. Since it is configured to perform temporal analysis of data, it is possible to estimate a human action class with higher accuracy.
(その他の実施形態)
 本発明は、上記実施形態に限らず、種々に変形態様が考えられる。
(Other embodiments)
The present invention is not limited to the above embodiment, and various modifications can be considered.
 上記実施形態では、画像処理装置100の構成の一例として、画像取得部10、人領域検出部20、人体特徴抽出部30、周辺特徴抽出部40、周辺特徴フィルター部50、行動判別部60、及び学習部70の機能が一のコンピュータによって実現されるものとして記載したが、複数のコンピュータによって実現されてもよいのは勿論である。又、当該コンピュータに読み出されるプログラムやデータも、複数のコンピュータに分散して格納されてもよい。 In the above embodiment, as an example of the configuration of the image processing apparatus 100, the image acquisition unit 10, the human region detection unit 20, the human body feature extraction unit 30, the peripheral feature extraction unit 40, the peripheral feature filter unit 50, the behavior determination unit 60, and Although the function of the learning unit 70 has been described as being realized by one computer, it is needless to say that the learning unit 70 may be realized by a plurality of computers. Moreover, the program and data read by the computer may be distributed and stored in a plurality of computers.
 又、上記実施形態では、画像処理装置100の動作の一例として、画像取得部10、人領域検出部20、人体特徴抽出部30、周辺特徴抽出部40、周辺特徴フィルター部50及び行動判別部60の処理を一連のフローの中で実行されるものとして示したが、これらの処理の一部又は全部が並列で実行されるものとしてもよいのは勿論である。 In the above embodiment, as an example of the operation of the image processing apparatus 100, the image acquisition unit 10, the human region detection unit 20, the human body feature extraction unit 30, the peripheral feature extraction unit 40, the peripheral feature filter unit 50, and the behavior determination unit 60. Although the above processing is shown as being executed in a series of flows, it is needless to say that some or all of these processing may be executed in parallel.
 以上、本発明の具体例を詳細に説明したが、これらは例示にすぎず、請求の範囲を限定するものではない。請求の範囲に記載の技術には、以上に例示した具体例を様々に変形、変更したものが含まれる。 Although specific examples of the present invention have been described in detail above, these are merely examples and do not limit the scope of the claims. The technology described in the claims includes various modifications and changes of the specific examples illustrated above.
 2017年3月7日出願の特願2017-043072の日本出願に含まれる明細書、図面および要約書の開示内容は、すべて本願に援用される。 The disclosure of the specification, drawings and abstract contained in the Japanese application of Japanese Patent Application No. 2017-043072 filed on March 7, 2017 is incorporated herein by reference.
 本開示に係る画像処理装置によれば、処理負荷を増大させることなく、より高精度な行動認識が可能である。 The image processing apparatus according to the present disclosure enables more accurate action recognition without increasing the processing load.
  10 画像取得部
  20 人領域検出部
  30 人体特徴抽出部
  40 周辺特徴抽出部
  50 周辺特徴フィルター部
  60 行動判別部
  70 学習部
 100 画像処理装置
 200 撮像装置
 300 ネットワーク
  D1 画像データ
  D2 人領域データ
  D3 姿勢特徴データ
  D4 周辺特徴データ
  D4a フィルタリング後の周辺特徴データ
  D5 行動クラス結果データ
  D6 教師データ
  Da 重要度データ
DESCRIPTION OF SYMBOLS 10 Image acquisition part 20 Human area | region detection part 30 Human body feature extraction part 40 Peripheral feature extraction part 50 Peripheral feature filter part 60 Behavior discrimination | determination part 70 Learning part 100 Image processing apparatus 200 Imaging apparatus 300 Network D1 Image data D2 Human area data D3 Posture characteristic Data D4 Peripheral feature data D4a Peripheral feature data after filtering D5 Action class result data D6 Teacher data Da Importance data

Claims (14)

  1.  撮像装置が生成した画像を取得する画像取得部と、
     前記画像に映る人の姿勢特徴を抽出する人体特徴抽出部と、
     前記画像に映る人の周辺物体の形状、位置又は種別を示す周辺特徴を抽出する周辺特徴抽出部と、
     前記姿勢特徴と当該姿勢特徴に関連付けて設定された周辺特徴の重要度とに基づいて、前記周辺特徴をフィルタリングする周辺特徴フィルター部と、
     前記姿勢特徴と前記周辺特徴フィルター部にフィルタリングされた前記周辺特徴とに基づいて、前記画像に映る人の行動クラスを推定する行動判別部と、
     を備える、画像処理装置。
    An image acquisition unit for acquiring an image generated by the imaging device;
    A human body feature extraction unit for extracting posture characteristics of a person shown in the image;
    A peripheral feature extraction unit that extracts a peripheral feature indicating the shape, position, or type of a peripheral object of a person shown in the image;
    A peripheral feature filter unit that filters the peripheral feature based on the posture feature and the importance of the peripheral feature set in association with the posture feature;
    Based on the posture feature and the peripheral feature filtered by the peripheral feature filter unit, an action determination unit that estimates a person's action class shown in the image;
    An image processing apparatus comprising:
  2.  前記行動判別部は、前記姿勢特徴と前記周辺特徴フィルター部にフィルタリングされた前記周辺特徴の時系列データに基づいて、前記画像に映る人の行動クラスを推定する、
     請求項1に記載の画像処理装置。
    The behavior determination unit estimates a behavior class of a person shown in the image based on time series data of the peripheral feature filtered by the posture feature and the peripheral feature filter unit.
    The image processing apparatus according to claim 1.
  3.  前記行動判別部は、再帰型ニューラルネットワークを用いて、前記画像に映る人の行動クラスを推定する、
     請求項2に記載の画像処理装置。
    The behavior determination unit estimates a behavior class of a person shown in the image using a recursive neural network.
    The image processing apparatus according to claim 2.
  4.  前記再帰型ニューラルネットワークは、現時点のデータと第1の時間前のデータを入力とする第1のLSTMと、現時点のデータと第2の時間前のデータを入力とする第2のLSTMと、前記第1及び第2のLSTMからの出力を全結合する全結合部と、を備える、
     請求項3に記載の画像処理装置。
    The recurrent neural network includes a first LSTM that receives current data and data before a first time, a second LSTM that receives current data and data before a second time, A full coupling part for fully coupling the outputs from the first and second LSTMs,
    The image processing apparatus according to claim 3.
  5.  行動クラスと関連付けて入力された画像の教師データに基づいて、前記再帰型ニューラルネットワークのネットワークパラメータを調整する学習部を、更に備える
     請求項3又は4に記載の画像処理装置。
    The image processing apparatus according to claim 3, further comprising a learning unit that adjusts network parameters of the recursive neural network based on teacher data of an image input in association with an action class.
  6.  前記学習部は、前記教師データの画像から抽出される前記姿勢特徴及び前記周辺特徴を用いたときに算出される前記再帰型ニューラルネットワークの第1の損失と、前記教師データの画像から抽出される前記姿勢特徴及び前記周辺特徴のうち、前記姿勢特徴のみを用いたときに算出される前記再帰型ニューラルネットワークの第2の損失と、を算出し、
     当該第1の損失と第2の損失の和が小さくなるように、前記再帰型ニューラルネットワークのネットワークパラメータを調整する、
     請求項5に記載の画像処理装置。
    The learning unit extracts the first loss of the recurrent neural network calculated when using the posture feature and the peripheral feature extracted from the teacher data image, and the teacher data image. A second loss of the recursive neural network calculated when using only the posture feature among the posture feature and the peripheral feature;
    Adjusting the network parameters of the recurrent neural network so that the sum of the first loss and the second loss is reduced;
    The image processing apparatus according to claim 5.
  7.  前記画像から人領域を検出する人領域検出部、を更に備え、
     前記人体特徴抽出部は、前記人領域検出部により検出された人領域を、前記姿勢特徴を抽出する領域として設定し、
     前記周辺特徴抽出部は、前記人領域検出部により検出された人領域を拡大した領域を、前記周辺特徴を抽出する領域として設定する、
     請求項1乃至6のいずれか一項に記載の画像処理装置。
    A human area detecting unit for detecting a human area from the image,
    The human body feature extraction unit sets the human region detected by the human region detection unit as a region for extracting the posture feature,
    The peripheral feature extraction unit sets an area obtained by enlarging the human region detected by the human region detection unit as a region for extracting the peripheral feature.
    The image processing apparatus according to claim 1.
  8.  前記人体特徴抽出部は、前記人領域検出部により検出された人領域の画像を所定の形状に正規化して、前記画像に映る人の姿勢特徴を抽出する、
     請求項7に記載の画像処理装置。
    The human body feature extraction unit normalizes the human region image detected by the human region detection unit to a predetermined shape, and extracts a human posture feature reflected in the image;
    The image processing apparatus according to claim 7.
  9.  前記姿勢特徴は、前記画像に映る人の関節位置、前記画像に映る人の各部位の位置、又は、前記画像に映る人の姿勢の種別を含む、
     請求項1乃至8のいずれか一項に記載の画像処理装置。
    The posture feature includes a joint position of a person reflected in the image, a position of each part of the person reflected in the image, or a type of posture of the person reflected in the image.
    The image processing apparatus according to claim 1.
  10.  前記周辺特徴は、前記画像に映る人と周辺物体との位置関係を含む、
     請求項1乃至9のいずれか一項に記載の画像処理装置。
    The peripheral feature includes a positional relationship between a person reflected in the image and a peripheral object.
    The image processing apparatus according to claim 1.
  11.  前記人体特徴抽出部は、畳み込みニューラルネットワークを用いて、前記姿勢特徴を抽出する、
     請求項1乃至10のいずれか一項に記載の画像処理装置。
    The human body feature extraction unit extracts the posture feature using a convolutional neural network.
    The image processing apparatus according to claim 1.
  12.  前記周辺特徴の重要度は、前記姿勢特徴の時間的変化と関連付けて設定された、
     請求項1乃至11のいずれか一項に記載の画像処理装置。
    The importance of the peripheral feature is set in association with the temporal change of the posture feature.
    The image processing apparatus according to claim 1.
  13.  撮像装置が生成した画像を取得し、
     前記画像に映る人の姿勢特徴を抽出し、
     前記画像に映る人の周辺物体の形状、位置又は種別を示す周辺特徴を抽出し、
     前記姿勢特徴と当該姿勢特徴に関連付けて設定された周辺特徴の重要度とに基づいて、前記周辺特徴をフィルタリングし、
     前記姿勢特徴とフィルタリングされた前記周辺特徴とに基づいて、前記画像に映る人の行動クラスを推定する、
     画像処理方法。
    Obtain the image generated by the imaging device,
    Extract the posture characteristics of the person reflected in the image,
    Extract peripheral features that indicate the shape, position or type of human peripheral objects shown in the image,
    Filtering the peripheral feature based on the posture feature and the importance of the peripheral feature set in association with the posture feature;
    Estimating an action class of a person shown in the image based on the posture feature and the filtered peripheral feature;
    Image processing method.
  14.  コンピュータに、
     撮像装置が生成した画像を取得する処理と、
     前記画像に映る人の姿勢特徴を抽出する処理と、
     前記画像に映る人の周辺物体の形状、位置又は種別を示す周辺特徴を抽出する処理と、
     前記姿勢特徴と当該姿勢特徴に関連付けて設定された周辺特徴の重要度とに基づいて、前記周辺特徴をフィルタリングする処理と、
     前記姿勢特徴とフィルタリングされた前記周辺特徴とに基づいて、前記画像に映る人の行動クラスを推定する処理と、
     を実行させる、画像処理プログラム。
     
     
    On the computer,
    Processing to acquire an image generated by the imaging device;
    A process of extracting the posture characteristics of the person shown in the image;
    A process of extracting a peripheral feature indicating the shape, position or type of a peripheral object of a person shown in the image;
    Processing for filtering the peripheral feature based on the posture feature and the importance of the peripheral feature set in association with the posture feature;
    A process for estimating a behavior class of a person shown in the image based on the posture feature and the filtered peripheral feature;
    An image processing program for executing

PCT/JP2017/045011 2017-03-07 2017-12-15 Image processing device, image processing method, and image processing program WO2018163555A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2017-043072 2017-03-07
JP2017043072 2017-03-07

Publications (1)

Publication Number Publication Date
WO2018163555A1 true WO2018163555A1 (en) 2018-09-13

Family

ID=63449074

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2017/045011 WO2018163555A1 (en) 2017-03-07 2017-12-15 Image processing device, image processing method, and image processing program

Country Status (1)

Country Link
WO (1) WO2018163555A1 (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110889335A (en) * 2019-11-07 2020-03-17 辽宁石油化工大学 Human skeleton double-person interaction behavior recognition method based on multi-channel space-time fusion network
JP2020086819A (en) * 2018-11-22 2020-06-04 コニカミノルタ株式会社 Image processing program and image processing device
JP2020119507A (en) * 2019-01-25 2020-08-06 富士通株式会社 Deep learning model used for driving behavior recognition, training device and method
JP2020190960A (en) * 2019-05-22 2020-11-26 株式会社東芝 Recognition device, recognition method, and program
JP2021043666A (en) * 2019-09-10 2021-03-18 パナソニックi−PROセンシングソリューションズ株式会社 Monitoring camera and detection method
WO2021090777A1 (en) * 2019-11-05 2021-05-14 日本電信電話株式会社 Behavior recognition learning device, behavior recognition learning method, behavior recognition device, and program
JPWO2020175692A1 (en) * 2019-02-28 2021-10-07 旭化成株式会社 Learning device and judgment device
WO2021250901A1 (en) * 2020-06-12 2021-12-16 Nec Corporation Intention detection device, intention detection method computer-readable storage medium
WO2022249635A1 (en) * 2021-05-26 2022-12-01 コニカミノルタ株式会社 Action sensing system and action sensing program
US11847859B2 (en) 2019-09-02 2023-12-19 Nec Corporation Information processing device, method, and program recording medium
JP7467300B2 (en) 2020-09-17 2024-04-15 京セラ株式会社 SYSTEM, ELECTRONIC DEVICE, CONTROL METHOD FOR ELECTRONIC DEVICE, AND PROGRAM

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000293685A (en) * 1999-04-06 2000-10-20 Toyota Motor Corp Scene recognizing device
JP2005242759A (en) * 2004-02-27 2005-09-08 National Institute Of Information & Communication Technology Action/intention presumption system, action/intention presumption method, action/intention pesumption program and computer-readable recording medium with program recorded thereon
JP2010123019A (en) * 2008-11-21 2010-06-03 Fujitsu Ltd Device and method for recognizing motion
WO2015186436A1 (en) * 2014-06-06 2015-12-10 コニカミノルタ株式会社 Image processing device, image processing method, and image processing program

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000293685A (en) * 1999-04-06 2000-10-20 Toyota Motor Corp Scene recognizing device
JP2005242759A (en) * 2004-02-27 2005-09-08 National Institute Of Information & Communication Technology Action/intention presumption system, action/intention presumption method, action/intention pesumption program and computer-readable recording medium with program recorded thereon
JP2010123019A (en) * 2008-11-21 2010-06-03 Fujitsu Ltd Device and method for recognizing motion
WO2015186436A1 (en) * 2014-06-06 2015-12-10 コニカミノルタ株式会社 Image processing device, image processing method, and image processing program

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2020086819A (en) * 2018-11-22 2020-06-04 コニカミノルタ株式会社 Image processing program and image processing device
JP7271915B2 (en) 2018-11-22 2023-05-12 コニカミノルタ株式会社 Image processing program and image processing device
JP2020119507A (en) * 2019-01-25 2020-08-06 富士通株式会社 Deep learning model used for driving behavior recognition, training device and method
JP7500958B2 (en) 2019-01-25 2024-06-18 富士通株式会社 Deep learning model, training device and method for driving behavior recognition
JPWO2020175692A1 (en) * 2019-02-28 2021-10-07 旭化成株式会社 Learning device and judgment device
JP7213948B2 (en) 2019-02-28 2023-01-27 旭化成株式会社 Learning device and judgment device
JP7048540B2 (en) 2019-05-22 2022-04-05 株式会社東芝 Recognition device, recognition method and program
US11620498B2 (en) 2019-05-22 2023-04-04 Kabushiki Kaisha Toshiba Recognition apparatus, recognition method, and program product
JP2020190960A (en) * 2019-05-22 2020-11-26 株式会社東芝 Recognition device, recognition method, and program
US11847859B2 (en) 2019-09-02 2023-12-19 Nec Corporation Information processing device, method, and program recording medium
JP7452832B2 (en) 2019-09-10 2024-03-19 i-PRO株式会社 Surveillance camera and detection method
JP2021043666A (en) * 2019-09-10 2021-03-18 パナソニックi−PROセンシングソリューションズ株式会社 Monitoring camera and detection method
JP2021076903A (en) * 2019-11-05 2021-05-20 日本電信電話株式会社 Behavior recognition learning apparatus, behavior recognition learning method, behavior recognition apparatus, and program
WO2021090777A1 (en) * 2019-11-05 2021-05-14 日本電信電話株式会社 Behavior recognition learning device, behavior recognition learning method, behavior recognition device, and program
JP7188359B2 (en) 2019-11-05 2022-12-13 日本電信電話株式会社 Action recognition learning device, action recognition learning method, action recognition device, and program
CN110889335B (en) * 2019-11-07 2023-11-24 辽宁石油化工大学 Human skeleton double interaction behavior identification method based on multichannel space-time fusion network
CN110889335A (en) * 2019-11-07 2020-03-17 辽宁石油化工大学 Human skeleton double-person interaction behavior recognition method based on multi-channel space-time fusion network
JP7396517B2 (en) 2020-06-12 2023-12-12 日本電気株式会社 Intention detection device, intention detection method and program
WO2021250901A1 (en) * 2020-06-12 2021-12-16 Nec Corporation Intention detection device, intention detection method computer-readable storage medium
JP7467300B2 (en) 2020-09-17 2024-04-15 京セラ株式会社 SYSTEM, ELECTRONIC DEVICE, CONTROL METHOD FOR ELECTRONIC DEVICE, AND PROGRAM
WO2022249635A1 (en) * 2021-05-26 2022-12-01 コニカミノルタ株式会社 Action sensing system and action sensing program

Similar Documents

Publication Publication Date Title
WO2018163555A1 (en) Image processing device, image processing method, and image processing program
Qian et al. Artificial intelligence internet of things for the elderly: From assisted living to health-care monitoring
JP2018206321A (en) Image processing device, image processing method and image processing program
Planinc et al. Introducing the use of depth data for fall detection
EP3689236A1 (en) Posture estimation device, behavior estimation device, posture estimation program, and posture estimation method
Ghazal et al. Human posture classification using skeleton information
WO2019064375A1 (en) Information processing device, control method, and program
JP7185805B2 (en) Fall risk assessment system
US11334759B2 (en) Information processing apparatus, information processing method, and medium
JP2019003565A (en) Image processing apparatus, image processing method and image processing program
JP2019016268A (en) Image processing apparatus, image processing method and image processing program
JP2016170605A (en) Posture estimation device
Lu et al. Design of a multistage radar-based human fall detection system
Hafeez et al. Multi-Sensor-Based Action Monitoring and Recognition via Hybrid Descriptors and Logistic Regression
Rastogi et al. Human fall detection and activity monitoring: a comparative analysis of vision-based methods for classification and detection techniques
CN111144167A (en) Gait information identification optimization method, system and storage medium
CN117593792A (en) Abnormal gesture detection method and device based on video frame
Jain et al. Privacy-Preserving Human Activity Recognition System for Assisted Living Environments
Raja et al. Design and implementation of facial recognition system for visually impaired using image processing
Kim et al. Continuous gesture recognition using HLAC and low-dimensional space
Rege et al. Vision-based approach to senior healthcare: Depth-based activity recognition with convolutional neural networks
Singh et al. Vision based patient fall detection using deep learning in smart hospitals
Gharghabi et al. Person recognition based on face and body information for domestic service robots
Paul Ijjina Human fall detection in depth-videos using temporal templates and convolutional neural networks
KR102636549B1 (en) Apparatus and method for recognizing gait using noise reduction network

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17899866

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17899866

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: JP