CN111259723A

CN111259723A - Action recognition device and action recognition method

Info

Publication number: CN111259723A
Application number: CN201911181304.1A
Authority: CN
Inventors: 关海克
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2018-11-30
Filing date: 2019-11-27
Publication date: 2020-06-09
Also published as: JP2020087312A; JP7222231B2

Abstract

The invention relates to a movement recognition device and a movement recognition method, which aim to recognize a series of movements of an operator during operation with high precision, wherein the movement recognition device recognizes a standard operation which is predetermined as a monitoring object from an image of the operator, and comprises an image acquisition part (101) for acquiring a multi-frame image contained in the image; an action recognition unit (103) for recognizing a plurality of element actions included in the standard work from the characteristic change of each frame image and determining the reliability of the element actions; and an action determination unit (104a) for comprehensively processing the reliability of each element action and determining the operation action of the operator in the element action.

Description

Action recognition device and action recognition method

Technical Field

The present invention relates to a motion recognition device and a motion recognition method.

Background

In a work place such as an office or a factory, it is important to visually analyze a work operation of an operator using an image captured by a camera to improve work efficiency.

A conventional motion recognition method is disclosed in, for example, patent document 1 (japanese patent application laid-open No. 2011-100175), in which a person is recognized using images of a plurality of frames continuously obtained by a camera, a position trajectory of the center of gravity of the person is extracted as a feature amount, and then the feature amount is compared with a motion center trajectory registered in advance, thereby recognizing the motion of the person.

However, the operator works in more than one action, and there are many actions, such as walking with a hand or the like. These actions cannot be identified by tracking the position trajectory of the center of gravity.

Disclosure of Invention

The present invention aims to provide a behavior recognition device and a behavior recognition method capable of recognizing the behavior of an operator at the time of work with high accuracy.

In view of the above-described object, the present invention provides a behavior recognizing device for recognizing a standard task predetermined as a monitoring target from a video of a worker, the behavior recognizing device including an image acquiring unit for acquiring a plurality of frame images included in the video; an action recognition unit configured to recognize a plurality of element actions included in the standard work from characteristic changes of the frame images and to obtain reliability of the element actions; and an action determination unit for comprehensively processing the reliability of each element action and determining the operation action of the operator in the element actions.

The present invention has an effect of recognizing a series of actions of an operator during work with high accuracy.

Drawings

Fig. 1 is a functional block diagram of a motion recognition device according to a first embodiment.

Fig. 2 is a schematic diagram of an example of the work action of the operator at the work site.

Fig. 3 is a schematic diagram illustrating an example of a commodity warehousing operation.

Fig. 4 is a schematic diagram of an example of the operation action related to the commodity packaging operation.

FIG. 5 is a schematic diagram of spatiotemporal image data composition.

FIG. 6 is a schematic illustration of spatiotemporal image data for an F frame.

FIG. 7 is a schematic diagram of a method for updating spatiotemporal image data for F frames.

Fig. 8 is a variation characteristic diagram of the reliability of each element action.

FIG. 9 is a block segmentation schematic of spatiotemporal image data.

Fig. 10 is a schematic diagram illustrating an example of the feature points of time t extracted from the spatio-temporal image data when the operator is imaged at the work place in fig. 2.

Fig. 11 is a schematic diagram illustrating an example of feature points after time t + Δ t extracted from the spatio-temporal image data when the operator is imaged at the work site in fig. 2.

Fig. 12 is a flowchart of the processing operation for creating the identification dictionary in the action recognition device according to the first embodiment.

Fig. 13 is an operation flowchart of the action recognition processing of the action recognition device according to the first embodiment.

Fig. 14 is a flowchart of the integrated processing operation executed by the integrated processing unit of the action recognition device according to the first embodiment.

Fig. 15 is a flowchart of the integrated processing operation executed by the integrated processing unit of the action recognition device according to the second embodiment.

Fig. 16 is a flowchart of the integrated processing operation executed by the integrated processing unit of the action recognition device according to the third embodiment.

Fig. 17 is a flowchart of the integrated processing operation executed by the integrated processing unit of the action recognition device according to the fourth embodiment.

Fig. 18 is a flowchart of the integrated processing operation executed by the integrated processing unit of the action recognition device according to the fifth embodiment.

Fig. 19 is a flowchart of the integrated processing operation executed by the integrated processing unit of the action recognition device according to the sixth embodiment.

FIG. 20 is a diagram of an example of an action recognition system.

Fig. 21 is a diagram illustrating an example of a hardware configuration of the camera.

Fig. 22 is a schematic diagram showing an exemplary hardware configuration of the action recognition device.

Detailed Description

Embodiments of the present invention are described below with reference to the drawings.

The action recognition device of the present invention recognizes an operation action (referred to as a standard operation) that is locked as a monitoring target in advance from an image of a camera that captures an operator. Including, for example, offices, factories, etc., although the present invention is not so limited.

First embodiment

The action recognition device 100 in the present embodiment includes an image acquisition unit 101, a spatio-temporal feature extraction unit 102, an action recognition unit 103, an integration processing unit 104, a dictionary creation unit 105, and a recognition result output unit 106.

The image acquisition unit 101 acquires images from a camera 203 installed in the work place 201 as shown in fig. 2, for example, in real time or offline. The installation location of the camera 203 is arbitrary, and any location may be used as long as it can capture an image of the action of the worker 202 when working on the work location 201. The image of the camera 203 may be sent directly from the camera 203 to the motion recognition apparatus 100 by, for example, wired or wireless, or may be transmitted to the motion recognition apparatus 100 via a recording medium.

The spatio-temporal feature extraction unit 102 extracts spatio-temporal feature points from each frame image included in the image obtained by the image acquisition unit 101. The "spatio-temporal feature point" is a feature point indicating a spatial change and a temporal change of an image, and indicates a change in a motion of a person. The method of extracting the spatial feature point will be described in detail below.

The action recognition unit 103 searches the action recognition dictionary 105a based on the spatio-temporal feature points extracted by the spatio-temporal feature extraction unit 102, recognizes a plurality of element actions included in the standard work, and calculates the reliability of the element actions. The "element action" refers to a characteristic action included in a standard job, and a plurality of types exist according to the contents of the job. The reliability is an index indicating the accuracy of the recognition result, and is a value in the range of 0.0 to 1.0, and a larger value means a higher reliability.

The action recognition dictionary 105a is created by the dictionary creating unit 105. Information for identifying various actions (element actions) included in the standard work is registered in advance in the action identification dictionary 105 a. The method for creating the action recognition dictionary 105a will be described in detail below.

The integrated processing unit 104 includes an action determination unit 104a and an action time calculation unit 104 b. The action determination unit 104a comprehensively processes the reliability of each element action, and determines the operation action of the operator 202 in chronological order from each element action. The "integration processing" is processing for comparing the reliability of each element action in a predetermined time unit and deriving a final recognition result from each element action. The action time calculation unit 104b obtains the start time and the duration time of the element action determined as the work action of the worker 202 by the action determination unit 104a at the frame rate.

The recognition result output unit 106 executes an output process of the recognition result finally obtained by the integrated processing unit 104.

Fig. 2 is a schematic diagram of an example of work actions of an operator at a work site, and shows a state where the operator 202 performs a commodity stocking work at a work site 201 such as a factory.

The work actions of the worker 202 at the time of the warehousing work of the product include, for example, "an action of placing the box 204 on the floor (temporary placing action)", "an action of verifying the product in the box 204 (product verifying action)", "an action of taking out the product from the box 204 (product taking action)", "an action of putting into the shelf (product warehousing action)", and the like. The action recognition device 100 in the present embodiment recognizes these actions as element actions using the action recognition dictionary 105 a.

Fig. 3 is a schematic diagram showing an example of the operation action of the article warehousing operation, and shows a state where the operator 202 temporarily places 204 the box on the floor. FIG. 4 is a diagram illustrating an example of walking action. Since the worker 202 often performs the commodity warehousing operation while walking, recognition including walking action is required.

The camera 203 provided in the work place 201 has a function of continuous shooting (video shooting function). The motion recognition device 100 reads a frame image included in the image of the camera 203 at a rate of 30 frames per second, for example. In this case, the time of each frame is 1/30 seconds.

FIG. 5 is a schematic diagram schematically illustrating the composition of spatiotemporal image data. The x-axis is the width of the image and the y-axis is the height of the image. The depth is the length of time t and is determined by the number of frames. The spatial coordinates of the frame image are (x, y). If the time of one frame is set to t, the spatiotemporal image data may be represented by a three-dimensional cube as shown in FIG. 5. That is, the coordinates of the spatio-temporal image data are (x, y, t), and one pixel value I of the spatio-temporal image data becomes a function of the spatial coordinates (x, y) and time t.

FIG. 6 is a schematic illustration of spatiotemporal image data for an F frame. The vertical axis represents the reliability P, and the horizontal axis represents the time t, proportional to the frame number. Q represents the center position when the spatiotemporal image data of F frames is converted into time.

In the present embodiment, spatio-temporal image data of F frames is set as one identification target unit. F is the number of frames and can be arbitrarily set. For example, if it is necessary to recognize the work action of the worker 202 within 2 seconds, F may be set to 60 frames.

As described above, the operation action of the worker 202 includes a plurality of element actions, and the reliability P is calculated when recognizing the element actions_i(Q). That is, when the commodity warehousing operation action shown in FIG. 3 is recognized, the reliability P is obtained₀(Q). As shown in fig. 4, reliability P1(Q) is obtained when the walking action of worker 202 is recognized. When there are I elements acting, P can be obtained₀(Q)～P_I-1I confidence levels of (Q).

Here, the reliability P of element action obtained from the spatio-temporal image data of F frame is used_i(Q) as a result of the center frame of the spatio-temporal image data. That is, as shown in fig. 6, when the center position is Q, the reliability pi (Q) is the reliability of the i-th element action at the Q-th frame position.

Fig. 7 is a schematic diagram for explaining an update method of spatiotemporal image data of F frames. The vertical axis represents the reliability P, and the horizontal axis represents the time t.

The example of fig. 7 shows a state in which the spatiotemporal image data of F frames is moved frame by frame in the time axis direction. As described above, the space-time of the F frameImage data as a unit of recognition object, reliability P of the action of the element_i(Q) as a result of the center frame of the spatio-temporal image data. Q, Q +1 and … Q + n (n is an integer of 0 or more) each indicate the center position when moving the spatio-temporal image data of F frames. The interval of moving the spatiotemporal image data is not limited to one frame, and may be in units of a plurality of frames.

Fig. 8 is a reliability variation characteristic diagram of each element action. In the example of fig. 8, the change in the reliability when two element actions a, b obtained when spatiotemporal image data of F frames are moved frame by frame in the time axis direction are identified. The horizontal axis is time and the unit is the number of frames. The vertical axis represents the reliability P.

Tw in the figure indicates a time (determination time) during which the monitoring element acts, and thru indicates a threshold value of reliability. The determination time Tw and the threshold value Thre will be described in detail later in other embodiments.

In general, in the commodity warehousing work, the worker 202 often works while walking, and therefore recognition including walking action is necessary. The solid line indicates a variation characteristic of the reliability of the element action a (e.g., warehousing action of the product). The broken line represents a variation characteristic of the reliability of the element action b (e.g., walking action). In the example of fig. 8, only two element actions a and b are shown for the sake of simplicity of explanation, but actually, there are a plurality of other element actions in the product warehousing work, and the fluctuation characteristics of the reliability of each of these element actions can be obtained.

Next, a method of recognizing a motion using spatio-temporal image data will be described with reference to fig. 9 to 11.

FIG. 9 is a block segmentation diagram representing spatiotemporal image data. The horizontal axis is a spatial coordinate x and the vertical axis is a spatial coordinate y. The time axis is denoted by t. That is, the direction of the time axis t is a time-series axis of a picture input at a frame rate of 30 within, for example, 1 second. The actual time can be obtained by converting the number of frames. When the worker 202 performs a certain action, a change point occurs in the spatio-temporal image data. By finding out the characteristic point of the change point, which is empty, the element action can be identified.

As shown in fig. 9, the spatio-temporal image data is divided into blocks of a predetermined size (Mp, Np, T). Mp and Np are the number of pixels, and T is the time width for extracting feature points. That is, one block size is horizontal Mp pixels, vertical Np pixels, and vertical T frame. When a worker performs a certain motion, the feature amount of the space-time image data block corresponding to the motion becomes large. That is, a large amount of variation is generated in time space. The block having a large amount of change is extracted as a feature point.

First, in order to remove noise in the (x, y) direction as the spatial direction, smoothing processing of the equation (1) is performed.

L (x, y, t) ═ I (x, y, t) × g (x, y) formula (1)

Here, L (x, y, t) is a frame image of time t, and a pixel value in (x, y) coordinates. g (x, y) is the kernel for smoothing. Is a convolution process. The smoothing process may be a pixel averaging process or an existing Gaussian smoothing filter process.

Then, a filtering process is performed on the time axis. Here, Gabor filtration processing shown in formula (2) is performed.

R(x，y，t)＝(L(x，y，t)*g_ev)²+(L(x，y，t)*g_od)²Formula (2)

Here, gev and god are kernels of Gabor filters shown in formulas (3) and (4). Is a convolution process. τ and ω are parameters of the Gabor filter kernel.

After the filtering process of the above equation (2) for all pixels of the spatio-temporal image data shown in fig. 9, the average value M of the blocks R in the spatio-temporal coordinates (x, y, t) is obtained by equation (5).

Then, as shown in equation (6), when the average value M (x, y, t) of the block R is higher than a certain threshold value Thre _ M, the block R is extracted as a feature point.

M (x, y, t) > Thre _ M formula (6)

Fig. 10 shows an example of a feature point at time t extracted from the spatial image data when the worker 202 is imaged at the work site 201 in fig. 2. For example, when the worker 202 squats, a portion of the spatiotemporal image data having an action is extracted as a feature point. Fig. 11 shows an example of feature points extracted from the time-space image data after time t + Δ t.

Next, a description method of feature points extracted from spatio-temporal image data will be explained.

When a block as a feature point is extracted from spatio-temporal image data, spatio-temporal edge information E (x, y, t) of a pixel I (x, y, t) within the block is obtained. Specifically, the following differential operation (7) is performed.

Since there are Mp × Np × T pixels in 1 block, Mp × Np × T × 3 differential values can be obtained. The block may be described by an Mp × Np × T × 3 dimensional vector. That is, the feature point may be described by an Mp × Np × T × 3-dimensional vector.

Next, the processing operation of the action recognition device 100 according to the present embodiment will be described in two parts, namely, (a) identification dictionary creation processing and (b) action recognition processing. The processing shown in the flowcharts below is executed by the action recognition device 100 as a computer reading a designated program.

(a) Identification dictionary creation process

Fig. 12 is a flowchart for explaining the operation of the identification dictionary creation process in the action recognition device 100 according to the first embodiment.

When the work place 201 shown in fig. 2 recognizes the work action of the worker 202, the action recognition dictionary 105a needs to be created in advance. The action recognition device 100 creates an action recognition dictionary 105a in the following manner.

First, the action recognition device 100 acquires the learning video data by the image acquisition unit 101 (step S11). That is, a video sample on a standard job is collected and used as learning video data. In this case, spatiotemporal image data of F frames in the video sample is taken as one piece of learning video data. The spatio-temporal feature extraction unit 102 included in the action recognition device 100 extracts feature points corresponding to the above-described method as element actions from the learning video data (spatio-temporal image data of F frames) (step S12).

In this way, a plurality of learning video data are collected, and feature points corresponding to the motion changes are extracted from each learning video data. In this case, if the same action is performed, the systems for action recognition may be different. In short, data on the actions of the K-class elements described later may be acquired. The action recognition unit 103 performs a differentiation process on each feature point extracted from each piece of learning video data by the above expression (7) to obtain an Mp × Np × T × 3-dimensional vector, and classifies an action for each element according to the K-means method (step S13).

Here, if the number of categories after classification is K, the feature points extracted from the plurality of learning video data are classified into K types of element actions. The feature points of the same kind have similar features. The action recognition unit 103 averages Mp × Np × T × 3-dimensional vectors of feature points of the same type for each feature point classified into the K-type element action, and creates an average vector Vk (step S14). The average vector Vk is a vector representing the action feature point of each element.

The action recognition unit 103 calculates the total number of blocks corresponding to the K-class feature points, and obtains a histogram h (K) for learning (step S15). The histogram h (K) represents the frequency of class K feature points.

In this way, the average vector Vk and histogram h (k) of each feature point are obtained. The dictionary creating unit 105 registers these pieces of information as learning information in the action recognition dictionary 105a (step S16).

(b) Action recognition processing

Fig. 13 is a flowchart for explaining the operation of the action recognition processing of the action recognition device 100 according to the first embodiment.

First, the action recognition device 100 acquires, in chronological order, a plurality of frame images included in a video captured by the camera 203 installed in the work place 201 (step S21). The spatio-temporal feature extraction unit 102 of the action recognition device 100 generates spatio-temporal image data in units of F frames from these frame images, and extracts a plurality of feature points corresponding to the action change from the spatio-temporal image data as the element action by the above-described method (step S22).

Here, the action recognition unit 103 obtains an Mp × Np × T × 3-dimensional vector of each feature point extracted from the spatio-temporal image data of the F frame (step S23). The vector having the closest distance between the 3-dimensional vector and the average vector Vk of the K types registered in the action recognition dictionary 105a becomes a feature point of the same element action.

Next, the action recognition unit 103 classifies each feature point extracted from the spatio-temporal image data of F frames into K types of element actions, and creates a histogram ts (K) (step S24). The action recognition unit 103 compares these histograms Ts (k) with the histogram H (k) of each feature point registered in the action recognition dictionary 105a, and obtains the similarity S (Ts, H) between them by the formula (8). The action recognition unit 103 calculates the similarity S (Ts, H) as the reliability P (step S25).

The action recognition unit 103 repeats the above-described processing for each element action included in the F-frame spatio-temporal image data to obtain the reliability P of each element action (step S26). The integrated processing unit 104 determines the operation action of the operator 202 from each element action based on the reliability P of each element action obtained by the action recognition unit 103 (step S27). This integration processing will be explained later with reference to fig. 14. The recognition result output unit 106 executes an output process of the recognition result finally obtained by the integration processing unit 104 (step S28). The output processing includes, for example, visually displaying the element action type and time in time series on a not-illustrated terminal device of a monitor, and transmitting to an external monitoring center or the like through a communication network.

For convenience of explanation, the flowchart of fig. 13 shows only the processing of the F frame. Actually, as shown in fig. 7, the same processing is repeated while moving the spatiotemporal image data of F frames frame by frame in the time axis direction, a series of actions in the standard operation period is recognized on a frame-by-frame basis, and the recognition result of the action is output.

The process shown in the flow chart of fig. 13 may be performed off-line or in real time. For convenience of explanation, the following assumes offline processing.

Comprehensive treatment

Fig. 14 is a flowchart of the integrated processing operation executed by the integrated processing unit 104 of the action recognition device 100 according to the first embodiment. The integration processing shown in this flowchart is executed in step S27 of fig. 13. In the work place 201, there are not only one work action of the worker 202 but various actions (element actions) including a walking action. These actions sometimes overlap in time, and it is important to accurately determine what action is primarily taken at that time. The integrated processing unit 104 of the action recognition device 100 performs integrated processing on the reliability P of each element action recognized by the action recognition unit 103, and discriminates and judges a series of actions of the operator 202 in chronological order.

That is, first, the integrated processing unit 104 acquires the element actions of the F frame and the reliability P of the element actions from the action recognition unit 103 at the same time, and stores the element actions in a working memory (not shown) (step S31). The integrated processing unit 104 searches the work memory and selects an element action having the maximum reliability P in the F frame (step S32).

The integrated processing unit 104 determines the selected element action as the target work action (step S33), and determines the start time and duration of the target work action (step S34). The "target work behavior" refers to the work behavior of the worker 202 to be monitored.

Specifically, N element actions are set to be obtained from the F-frame spatio-temporal image data shown in fig. 6. The integrated processing unit 104 integrates the reliability P of these element actions and determines the target job behavior therein. Here, the first embodiment takes, as the target job behavior, an element behavior with the maximum reliability P, of N element behaviors obtained from the spatiotemporal image data of F frames.

The element action selected as the target work action is used as the recognition result of the center position Q of the F frame shown in fig. 6. The start time of the element action is calculated from the frame rate. That is, each time a frame of a video is acquired as a start time, it can be determined by considering the judgment of the element action. The duration that a frame exists in the center position Q of the F frame is the duration of the element action.

Specifically, in the case when the frame rate is F _ rat, each second is an F _ rat frame, so the time per frame is 1/F _ rat second. For example, when the frame rate is "30", the time per 1 frame is 1/30 seconds. The start frame when the element action is first detected is Q, and the start time is Q/F _ rat seconds. When the number of continuous frames in which element actions are continuously detected is Q _ act frames, the duration is Q _ act/F _ rat seconds. If the start time and the duration of the element action are visualized, it is helpful to analyze a job that is burdened by an operator, for example.

In this way, the integrated processing unit 104 determines the element action with the highest reliability P in the F frame as the target job action, and obtains the start time and duration of the element action with reference to the center position Q of the F frame (step S35).

Thereafter, as shown in fig. 7, the same processing as described above is repeatedly executed every time the F frame is updated by one frame. Thus, in the example of fig. 8, the following recognition results are output. The element action a is, for example, a warehousing action of a commodity. The element action b is, for example, a walking action.

T 0-t 4 element actions b

T 4-t 5 elemental actions a

T 5-t 8 element actions b

T 8-t 12 elemental actions a

As described above, according to the first embodiment, the reliability P of each element action obtained from each frame image is comprehensively processed, and the element action with the highest reliability P is selected for each element action, whereby a series of actions of the operator can be distinguished and accurately recognized in chronological order.

Second embodiment

The second embodiment will be explained next.

Second embodiment in the first embodiment, the integration process is performed under the condition that the reliability P of the element action is higher than the threshold value Thre set as the criterion in advance.

The basic configuration of the behavior recognizing device 100 in the present embodiment is the same as that of fig. 1 of the above-described embodiment 1. In the second embodiment, the action determination unit 104a of the integrated processing unit 104 has a function of comparing the reliability P of each element action on a frame basis, and determining that the element action having the reliability P higher than the threshold value Thre and the highest reliability P is the target work action. The processing operation of the second embodiment is described in detail below.

Fig. 15 is a flowchart for explaining the integrated processing operation performed by the integrated processing unit 104 of the motion recognition device 100 according to the second embodiment. The integration processing shown in this flowchart is executed in step S27 of fig. 13.

As in the first embodiment, first, the integrated processing unit 104 acquires the element actions of the F frame and the reliability P of the element actions from the action recognition unit 103, and stores them in a work memory (not shown) (step S41). The integrated processing unit 104 searches the work memory and selects an element action having the maximum reliability P in the F frame (step S42).

In the second embodiment, the integrated processing unit 104 has a threshold value Thre set in advance as a criterion for determining an element action. The threshold value thru is arbitrarily set according to the environment of the work place 401, the work content, and the like.

The integrated processing unit 104 determines whether the reliability P of the element action selected in step S42 is higher than a threshold value Thre (step S43). When the reliability P of the element action is higher than the threshold value thru (yes in step S43), the integrated processing unit 104 determines that the element action is the target work action (step S44), and determines the start time and duration of the target work action (step S46).

On the other hand, if the reliability P of the element action selected in step S42 is not more than the threshold value Thre (no in step S43), the integrated processing unit 104 determines that the element action is likely to have the target job action (step S45).

When the reliability P of the element action is equal to or less than the threshold value Thre, it may be determined that the element action is not the target job action. However, if the reliability P is a certain level, the target job action may be possible, and therefore it is preferable to determine that there is a possibility. When it is determined that there is a possibility of the target work action, the start time and the duration of the target work action are also determined. When it is determined that the target work action is likely, it is preferable to output a recognition result or the like in a specific color, for example, so as to distinguish the target work action from the target work action determined in step S44.

In this way, the integrated processing unit 104 determines the target job action from the F frame on the condition that the reliability P of the job action is higher than the threshold value Thre.

Thereafter, as shown in fig. 7, the same processing as described above is repeatedly executed every time the F frame is updated by one frame (step S47). Thus, in the example of fig. 8, the following recognition results are output. The element action a is, for example, a warehousing action of a commodity. The element action b is, for example, a walking action.

T 1-t 4 element actions b

T 4-t 5 elemental actions a

T 5-t 7 element actions b

T 9-t 10 elemental actions a

Including having the possibility, as follows.

T 0-t 1 probability of having element action b

T 7-t 8 probability of having element action b

T 8-t 9 probability of having element action a

T 10-t 12 probability of having element action a

As described above, according to the second embodiment, the threshold value thru of the element action reliability P is set, and the element action having the reliability P higher than the threshold value thru is determined as the target work action, whereby the series of actions of the operator can be recognized more accurately.

Third embodiment

The third embodiment will be explained next.

In the third embodiment, a determination time Tw for determining an element action is provided, and the element action is determined at intervals of the determination time Tw.

The basic configuration of the action recognition device 100 is the same as that of fig. 1 of the first embodiment. In the third embodiment, the action determining unit 104a of the integrated processing unit 104 has a function of comparing the reliability P of each element action at the determination time Tw, and determining the element action with the highest reliability P as the target job action. The processing operation of the third embodiment is described in detail below.

Fig. 16 is a flowchart for explaining the integrated processing operation performed by the integrated processing unit 104 of the action recognition device 100 according to the third embodiment. The integration processing shown in this flowchart is executed in step S27 of fig. 13.

First, the integrated processing unit 104 sets a determination time Tw for determining the element action (step S51). The determination time Tw is equal to or longer than the time of F frames, and is arbitrarily set according to the environment of the work field 401.

The integrated processing unit 104 obtains the element actions of all the frames in the determination time Tw and the reliability P of the element actions from the action recognition unit 103 at the same time, and stores the element actions in a working memory (not shown) (step S52). The integrated processing unit 104 searches the work memory, and selects an element action having the maximum reliability P from all frames of the determination time Tw (step S53).

The integrated processing unit 104 determines the selected element action as the target work action (step S54), and determines the start time and duration of the target work action (step S55). The start time of the element action is calculated with reference to a frame including the element action.

The start time of the element action is measured in units of the determination time Tw when the element action is first detected. The duration of the element action is the duration of the frame including the element action at the determination time Tw. In this case, the judgment time Tw is a multiple of the judgment time Tw during which the element action is continuously detected. For example, the duration of the element action detected in the determination time Tw is 3 × Tw for 3 consecutive times.

In this way, the integrated processing unit 104 repeats the same processing for all frames of the determination time Tw, determines the element action with the highest reliability P as the target job action, and obtains the start time and duration of the element action with reference to the frame including the element action (step S56).

Thus, in the example of fig. 8, element actions within the intervals (t0 to t3, t3 to t6, t6 to t11) of the determination time Tw are determined, and the following recognition results are output. The element action a is, for example, a warehousing action of a commodity. The element action b is, for example, a walking action.

T 0-t 3 element actions b

T 3-t 6 elemental actions a

T 6-t 11 element actions b

As described above, according to the third embodiment, a certain width is given to the time for determining the element action, and the element action is determined in the time interval. This prevents erroneous judgment due to noise when the image contains the noise, for example, and thus can accurately recognize the action of the worker 202.

In addition, the operator may move irregularly during the operation, for example, turn in another direction. In this case, by setting the determination time to a certain range, erroneous determination due to irregular action can be prevented.

Fourth embodiment

The fourth embodiment will be explained next.

In the fourth embodiment, the third embodiment is configured to add a condition that the reliability P is higher than the threshold value Thre to the integration processing.

The basic configuration of the action recognition device 100 is the same as that of fig. 1 of the first embodiment. In the fourth embodiment, the action determination unit 104a of the integrated processing unit 104 has a function of comparing the reliability P of each element action at the determination time Tw, and determining the element action having the reliability P higher than the threshold value Thre and the highest reliability P as the target work action. The processing operation of the fourth embodiment is described in detail below.

Fig. 17 is a flowchart for explaining the operation of the integration processing performed by the integration processing unit 104 of the action recognition device 100 according to the fourth embodiment. The integration processing shown in this flowchart is executed in step S27 of fig. 13.

As in the third embodiment, first, the integrated processing unit 104 sets a determination time Tw for determining the element action (step S61). This determination time Tw is equal to or longer than the time of F frames, and is arbitrarily set according to the environment of the work place 401, the work content, and the like.

The integrated processing unit 104 obtains the element actions of all the frames of the determination time Tw and the reliability P of the element actions from the action recognition unit 103 at the same time, and stores the element actions in a working memory (not shown) (step S62). The integrated processing unit 104 searches the work memory, and selects an element action having the maximum reliability P from all frames of the determination time Tw (step S63).

In the fourth embodiment, the integrated processing unit 104 has a threshold value Thre set in advance as a criterion for determining an element action. The threshold value thru is arbitrarily set according to the environment of the work place 401, the work content, and the like.

The integrated processing unit 104 determines whether the reliability P of the element action selected in step S63 is higher than a threshold value Thre (step S64). When the reliability P of the element action is higher than the threshold value thru (yes in step S64), the integrated processing unit 104 determines that the element action is the target job action (step S65), and determines the start time and duration of the target job action (step S67).

On the other hand, if the reliability P of the element action selected in step S63 is equal to or less than the threshold value Thre (No in step S64), the integrated processing unit 104 determines that the element action is likely to be the target job action (step S66).

When the reliability P of the element action is equal to or less than the threshold value Thre, it may be determined that the element action is not the target job action. However, if the reliability P is a certain level, the target job action may be likely, and therefore it is preferable to determine that there is a possibility. When it is determined that there is a possibility of the target work action, the start time and the duration of the target work action are also determined. When it is determined that the target work action is likely, it is preferable to output a recognition result or the like in a specific color, for example, so as to distinguish the target work action from the target work action determined in step S65.

In this way, the integrated processing unit 104 repeatedly performs the same processing for all frames of the determination time Tw, determines the element action with the highest reliability P as the target job action with reference to the threshold value Thre, and obtains the start time and the duration of the element action with reference to the frame including the element action (step S68).

Thus, in the example of fig. 8, the element action in the interval (t0 to t3, t3 to t6, t6 to t11) of the determination time Tw is determined, and the following recognition result is output. The element action a is, for example, a warehousing action of a commodity, and the element action b is, for example, a walking action.

T 0-t 3 element actions b

T 3-t 6 elemental actions a

T 6-t 11 element actions b

When the reliability P of one of the element action a and the element action b is equal to or less than the threshold value Thre in the determination time Tw, the element action having high reliability P is selected as having a possibility of the target job action.

As described above, according to the fourth embodiment, the threshold value thru of the reliability P of the element action is added to the condition, and the series of actions of the operator can be recognized more accurately than in the third embodiment.

Fifth embodiment

The fifth embodiment will be explained next.

In the fifth embodiment, as in the third embodiment, a determination time Tw for determining an element action is provided, and the element action is determined at intervals of the determination time Tw. However, the fifth embodiment differs from the third embodiment in that the third embodiment selects the element action having the highest reliability P in the judgment time Tw, and the fifth embodiment counts the number of element actions having the highest reliability P in chronological order in the judgment time Tw, and selects the element action having the largest count value.

The basic configuration of the action recognition device 100 is the same as that of fig. 1 of the first embodiment. In the fifth embodiment, the action determining unit 104a of the integrated processing unit 104 has a function of comparing the reliability P of each element action with the interval of the determination time Tw, and determining the element action having the highest frequency of the reliability P as the target job action. The processing operation of the fifth embodiment is described in detail below.

Fig. 18 is a flowchart for explaining the operation of the integration processing performed by the integration processing unit 104 of the motion recognition device 100 according to the fifth embodiment. The integration process shown in this flowchart is executed in step S27 of fig. 13.

First, the integrated processing unit 104 sets a determination time Tw for determining the element action (step S71). The determination time Tw is equal to or longer than the time of F frames, and is arbitrarily set according to the environment of the work place 401, the work content, and the like.

The integrated processing unit 104 obtains the element actions of all the frames in the determination time Tw and the reliability P of the element actions from the action recognition unit 103 at the same time, and stores the element actions in a working memory (not shown) (step S72).

In the fifth embodiment, the integrated processing unit 104 searches the work memory and counts the number of element actions having the maximum reliability P in the determination time Tw (step S73). The count value indicates the frequency of the element action with the maximum reliability P between the determination times Tw.

The integrated processing unit 104 selects the element action with the largest count value, that is, the element action with the highest frequency at which the reliability P is determined to be the maximum between the time tws (step S74). The integrated processing unit 104 determines the selected element action as the target work action (step S75), and determines the start time and duration of the target work action (step S76).

The start time of the element action is calculated with reference to a frame including the element action. The start time of the element action is measured in units of the determination time Tw when the element action is first detected. The duration of the element action is the duration of a frame including the element action in the determination time Tw. A multiple time of the determination time Tw continuously detected for the element action. For example, when an element action is detected within three consecutive judgment times Tw, the duration is 3 × Tw. In this way, the integrated processing unit 104 repeatedly performs the same processing for all frames of the determination time Tw, determines the element action having the frequency with the highest reliability P as the maximum target job action, and obtains the start time and duration of the element action with reference to the frame including the element action (step S77).

Thus, in the example of fig. 8, the element action is determined at the intervals (t0 to t3, t3 to t6, t6 to t11) of the determination time Tw, and the following recognition results are output. The element action a is, for example, a warehousing action of a commodity, and the element action b is, for example, a walking action.

T 0-t 3 element actions b

T 3-t 6 elemental actions a

T 6-t 11 elemental actions a

As described above, according to the fifth embodiment, the determination time Tw for determining the element action is set, and the element action having the highest frequency of the reliability P within the interval of the determination time Tw is determined as the target work action. This prevents a determination error due to noise in a working environment in which noise is likely to appear in an image, for example, and thus can accurately recognize the action of the worker 202. In addition, the device can also cope with irregular actions of the operator at a moment. In particular, since the fifth embodiment determines the element action based on the frequency at which the reliability P is the maximum, it is possible to prevent a determination error due to noise or irregular behavior and obtain a more accurate recognition result than the third embodiment.

Sixth embodiment

The sixth embodiment will be explained next.

In the sixth embodiment, the integration processing is performed under the condition that the reliability P is higher than the threshold value Thre in the fifth embodiment. The basic configuration of the action recognition device 100 is the same as that of fig. 1 of the first embodiment. In the sixth embodiment, the action determining unit 104a of the integrated processing unit 104 has a function of comparing the reliability P of each element action in the interval of the determination time Tw, and determining the element action having the highest frequency of reliability P and higher reliability P than the threshold value Thre as the target work action. The processing operation of the sixth embodiment is described in detail below.

Fig. 19 is a flowchart for explaining the integrated processing operation performed by the integrated processing unit 104 of the action recognition device 100 according to the sixth embodiment. The integration processing shown in this flowchart is executed in step S27 of fig. 13.

As in the fifth embodiment, first, the integrated processing unit 104 sets a determination time Tw for determining the element action (step S81). The determination time Tw is equal to or longer than the time of F frames, and is arbitrarily set according to the environment of the work place 401, the work content, and the like.

The integrated processing unit 104 collectively acquires the element actions of all frames within the determination time Tw and the reliability P of the element actions from the action recognition unit 103, and stores the element actions in a working memory (not shown) (step S82). In the sixth embodiment, the integrated processing unit 104 has a threshold value Thre set in advance as a criterion for determining an element action. The threshold value thru is arbitrarily set according to the environment of the work place 401, the work content, and the like.

The integrated processing unit 104 searches the work memory, and counts the number of element actions having a maximum reliability P and being larger than the threshold value Thre in the determination time Tw (step S83). The count value indicates the frequency of the element action having the maximum reliability P and being greater than the threshold value thru between the determination times Tw.

The integrated processing unit 104 selects an element action having the largest count value, that is, an element action having a frequency which is greater than the threshold value Thre between the determination times Tw and has the largest reliability P (step S84). The integrated processing unit 104 determines the selected element action as the target job action (step S85), and obtains the start time and duration of the target job action (step S86).

In this way, the integrated processing unit 104 repeats the same processing for all frames of the determination time Tw, determines the element action having a frequency greater than the threshold value Thre and having the maximum reliability P as the target job action, and obtains the start time and the duration of the element action with reference to the frame including the element action (step S87).

Thus, in the example of fig. 8, the element action of the interval between the determination times Tw (t0 to t3, t3 to t6, t6 to t11) is determined, and the following recognition result is output. The element action a is, for example, a warehousing action of a commodity. The element action b is, for example, a walking action.

T 0-t 3 element actions b

T 3-t 6 elemental actions a

T 6-t 11 elemental actions a

As described above, according to the sixth embodiment, the threshold value thru of the reliability P of the element action is added to the condition, and thus, in a work environment with a lot of noise, for example, a series of actions of the operator can be recognized more accurately than in the fifth embodiment.

Combination of various embodiments

The methods described in the above embodiments can be switched and used as appropriate depending on the work environment and the like. In this case, each function corresponding to all the embodiments may be embedded in the action recognition device 100 and selected according to the situation. "according to circumstances" includes, for example, a change in an article of manufacture of a production line such as a factory.

As a method of switching the respective functions, for example, a monitor can operate a mode switch, not shown, provided in the motion recognition device 100 to switch the respective functions. Further, for example, the functions may be switched based on a signal from a sensor for detecting an environment in which noise is likely to enter a video, such as an influence of illumination light or an influence of crowding.

The above embodiments can be used in the following different ways.

In the case where the recognition result needs to be output even if the reliability of the action recognition is low, that is, if it is desired to output the start time and the duration of the element action in a smaller time unit (for example, 1/30 seconds), the first embodiment is adopted.

The second embodiment is employed when it is desired to accurately output the recognition result in a detailed time unit.

The third embodiment is employed when it is desired to output the recognition result with importance placed on the maximum value of the reliability P in the time unit of the determination time Tw without outputting the recognition result in a detailed time unit.

The fourth embodiment is adopted when it is desired to output a recognition result more accurately than the third embodiment while emphasizing the maximum value of the reliability P in the time unit of the determination time Tw, without outputting the recognition result in a detailed time unit.

The fifth embodiment is employed when it is desired to output the recognition result with importance placed on the frequency of occurrence of the maximum value of the reliability P within the determination time Tw, without outputting the recognition result in a detailed time unit.

The sixth embodiment is adopted when it is desired to place importance on the frequency of occurrence of the maximum value of the reliability P within the determination time Tw without outputting the recognition result in a detailed time unit, and to output a more accurate recognition result than the fifth embodiment.

Composition of System

Fig. 20 is a schematic diagram illustrating an example of a motion recognition system using the motion recognition device 100.

The action recognition device 100 is incorporated in the information processing device 301. The information processing apparatus 301 may be installed in the work place 201 or may be installed outside the work place 201. A camera 203 capable of capturing a video is provided in the work place 201, and the worker 202 at the work place 201 is captured. The image (video) captured by the camera 203 is transmitted to the information processing device 301 by wire or wirelessly, and is transmitted to the action recognition device 100 via an I/F (interface) 302.

The action recognition device 100 recognizes the operation action of the operator 202 by the method described in the first to sixth embodiments. The information processing device 301 displays the recognition result of the action recognition device 100 in a predetermined format on a display device not shown. The information processing device 301 may also transmit the recognition result of the action recognition device 100 to the external monitoring device 304 via a communication network 303 such as the internet.

Camera hardware Structure

Fig. 21 is a schematic diagram showing an example of the hardware configuration of the camera 203.

The subject light is incident on a ccd (charge Coupled device)3 through the photographing optical system 1. A mechanical shutter 2 is disposed between the photographing optical system 1 and the CCD3, and incident light to the CCD3 can be blocked by the mechanical shutter 2. The motor driver 6 drives the photographing optical system 1 and the mechanical shutter 2.

The CCD3 converts the optical image imaged on the imaging plane into an electric signal, and outputs the electric signal as analog image data. The image information output from the CCD3 is subjected to noise component removal by a CDS (Correlated Double Sampling) circuit 4, converted into a digital value by an a/D converter 5, and output to an image processing circuit 8.

The image processing circuit 8 performs various image processes such as YCrCb conversion processing, white balance control processing, contrast compensation processing, edge emphasis processing, and color conversion processing, using an sdram (synchronous Dynamic random access memory)12 that temporarily stores image data. The white balance process adjusts the color density of the image information, and the contrast compensation process adjusts the contrast of the image information. The edge emphasis process adjusts sharpness of image information, and the color conversion process adjusts color tone of the image information. In addition, the image processing circuit 8 displays the image information subjected to the signal processing and the image processing on an lcd (liquid display) 16.

The image information subjected to the signal processing and the image processing is stored in the memory card 14 via the compression/expansion circuit 13. The compression/expansion circuit 13 is a circuit that compresses the image information output from the image processing circuit 8 in accordance with an instruction obtained from the operation unit 15 and outputs the compressed image information to the memory card 14, and expands the image information read from the memory card 14 and outputs the expanded image information to the image processing circuit 8.

The CCD3, CDS circuit 4, and a/D converter 5 are controlled in timing by the CPU9 via a timing signal generator 7 that generates timing signals. The image processing circuit 8, the compression/expansion circuit 13, and the memory card 14 are also controlled by the CPU 9.

The camera 203 is provided with a CPU9 that performs various arithmetic processes according to a program. In addition, the camera 203 includes a ROM11 storing programs and the like, a work area used in various processes, a RAM10 storing various data, and the like, which are connected to each other via a bus.

Hardware configuration of action recognition device

Fig. 22 is a schematic diagram of an exemplary hardware configuration of the action recognition device 100.

The action recognition device 100 includes a CPU21, a nonvolatile memory 22, a main memory 23, a communication device 24, and the like.

The CPU21 is a hardware processor that controls the operation of various components within the motion recognition device 100. The CPU21 executes various programs uploaded from the nonvolatile memory 22 as a register to the main memory 23.

The programs executed by the CPU21 include a program (hereinafter referred to as an action recognition processing program) for executing various processes shown in the flowcharts shown in fig. 12 to 19, in addition to an Operating System (OS), and the like. The CPU21 also executes a Basic Input Output System (BIOS) or the like, such as a program for hardware control.

The image acquisition unit 101, the spatio-temporal feature extraction unit 102, the action recognition unit 103, the dictionary creation unit 105, the integration processing unit 104, and the recognition result output unit 106 shown in fig. 1 are all or partially realized by the CPU21 (computer) executing an action recognition processing program.

The action recognition processing program may be stored in a computer-readable storage medium or may be downloaded to the action recognition apparatus 100 via a network.

The CPU21 reads the action recognition program to execute various processes corresponding to the above embodiments. For example, the CPU21 classifies each feature point extracted from the spatio-temporal image data of F frames into K types of element actions, and creates a histogram t (K) respectively. The CPU21 obtains the degree of similarity S (Ts, H) between the histogram t (k) of each feature point and the histogram H (k) of each feature point registered in the action recognition dictionary 105a, and obtains the degree of similarity S (Ts, H) as the reliability P of each element action. The CPU21 performs comprehensive processing on the reliability P of each element action, discriminates a series of actions of the operator 202 in time series order, and outputs the start time and duration of each action as a recognition result.

Part or all of the image acquisition unit 101, the spatio-temporal feature extraction unit 102, the action recognition unit 103, the dictionary creation unit 105, the integration unit 104, and the recognition result output unit 106 shown in fig. 1 may be realized by hardware such as an ic (integrated circuit), or may be realized by a combination configuration of the software and the hardware. The communication device 24 performs communication with an external device, such as a wired or wireless device.

As described above, according to at least one embodiment, a series of actions of the operator during work can be recognized with high accuracy. In particular, by adopting a method of comparing the reliability of each element action extracted from the feature change of each frame image in a predetermined time unit to determine the operation action of the operator, even when a plurality of element actions overlap in time, the most accurate element action can be output as a recognition result. In contrast, for example, in a method of tracking the trajectory of the center of gravity of a person in each frame image, when a plurality of element actions overlap in time, the operation action of the operator cannot be accurately recognized.

While certain embodiments of the invention have been described above, these embodiments have been presented by way of example, and are not intended to limit the scope of the invention. These new embodiments may be implemented in other various forms, and various omissions, substitutions, and changes may be made without departing from the spirit of the invention. These embodiments and modifications thereof are included in the scope and gist of the invention, and are also included in the invention within the scope of the claims and the equivalent scope thereof.

100 … action recognition device, 101 … image acquisition unit, 102 … spatio-temporal feature extraction unit, 103 … action recognition unit, 104 … integrated processing unit, 104a … action determination unit, 104b … action time calculation unit, 105 … dictionary creation unit, 105a … action recognition dictionary, 106 … recognition result output unit, 201 … work place, 202 … work person, 203 … camera, 204 box 204 …, 301 … information processing device, 302 … I/F, 303 … communication network, 304 … monitoring device, 21 … CPU, 22 … nonvolatile memory, 23 … main memory, 24 … communication device.

Claims

1. A behavior recognition device for recognizing a standard operation predetermined as a monitoring target from an image of an operator, comprising,

an image acquisition unit configured to acquire a plurality of frame images included in the video;

an action recognition unit configured to recognize a plurality of element actions included in the standard work from characteristic changes of the frame images and to obtain reliability of the element actions; and an action determination unit for comprehensively processing the reliability of each element action and determining the operation action of the operator in the element actions.

2. The action recognition device according to claim 1, wherein the action determination section determines the operation action of the operator in the element action in chronological order.

3. The action recognition device according to claim 1 or 2, further comprising an action time calculation unit configured to calculate a start time and a duration time of the element action determined as the work action of the worker.

4. The action recognition device according to claim 1 or 2, wherein the action determination unit compares the reliability of each of the element actions in units of frames, and determines an element action with high reliability as the operation action of the operator.

5. The action recognition device according to claim 1 or 2, wherein the action determination unit compares the reliability of each of the element actions at regular time intervals, and determines an element action with high reliability as the operation action of the operator.

6. The action recognition device according to claim 1 or 2, wherein the action determination unit compares the reliability of each of the element actions at regular time intervals, and determines an element action with a high frequency of element actions with high reliability as the operation action of the operator.

7. The action recognition device according to any one of claims 4 to 6, wherein the action determination portion determines the work action of the operator on the condition of an element action having a higher degree of reliability than a threshold value preset as a determination reference.

8. An action recognition method for recognizing a standard operation predetermined as a monitoring target from an image taken by an operator, comprising:

an image acquisition step of acquiring a plurality of frame images included in the video;

an action recognition step of recognizing a plurality of element actions included in the standard work from the characteristic change of each frame image and obtaining the reliability of the element actions; and an action judgment step of comprehensively processing the reliability of each element action and judging the operation action of the operator in the element action.