WO2023188264A1

WO2023188264A1 - Information processing system

Info

Publication number: WO2023188264A1
Application number: PCT/JP2022/016510
Authority: WO
Inventors: 隆平安藤
Original assignee: 日本電気株式会社
Priority date: 2022-03-31
Filing date: 2022-03-31
Publication date: 2023-10-05

Abstract

This information processing device 100 comprises: a feature quantity generation unit for generating a first feature quantity based on video data of a first section, which is a section in one part, and a second feature quantity based on video data of a second section, which is a section other than the first section, from section video data divided into prescribed time sections; and a training unit that, when generating a learning model for outputting an action that corresponds to a feature quantity based on video data in response to input of the feature quantity, trains the learning model for an action that corresponds to the first feature quantity and for an action that corresponds to the second feature quantity.

Description

information processing system

The present invention relates to an information processing system, an information processing method, and a program.

Patent Document 1 describes a method for extracting feature amounts of the behavior of a moving object from a video consisting of spatiotemporal information and estimating the behavior. Specifically, Patent Document 1 describes that a motion section from the start to the end of an action in a video is estimated, and the behavior is estimated based on the feature amount of the video of this motion section.

JP2021-179728A

However, the technique disclosed in Patent Document 1 described above has a problem in that erroneous recognition may occur if the estimation of the motion section is not accurate. For example, if the estimated motion interval includes a transition part from one behavior to another, data that is not expected by the behavior estimation model will be mixed in, making behavior estimation difficult and causing erroneous recognition. It can occur.

Therefore, an object of the present invention is to provide an information processing system that can solve the above-mentioned problem that erroneous recognition may occur when recognizing an action from a video.

An information processing system that is one form of the present invention includes:
Among section video data that is video data divided into predetermined time sections, a first feature amount that is a feature amount based on video data of a first section that is a part of the section, and a first feature amount that is a feature amount based on video data of a first section that is a part of the section, and a first feature amount that is a feature amount based on video data of a first section that is a part of the section, and a feature amount generation unit that generates a second feature amount that is a feature amount based on video data of a certain second section;
When generating a learning model that outputs a behavior corresponding to a feature in response to input of a feature based on video data, an action corresponding to the first feature and an action corresponding to the second feature are generated. , a learning department that learns
Equipped with
The structure is as follows.

Further, an information processing method that is one form of the present invention includes:
Among section video data that is video data divided into predetermined time sections, a first feature amount that is a feature amount based on video data of a first section that is a part of the section, and a first feature amount that is a feature amount based on video data of a first section that is a part of the section, and a first feature amount that is a feature amount based on video data of a first section that is a part of the section, and Generate a second feature amount that is a feature amount based on video data of a certain second section,
When generating a learning model that outputs a behavior corresponding to a feature in response to input of a feature based on video data, an action corresponding to the first feature and an action corresponding to the second feature are generated. , learn,
The structure is as follows.

Further, a program that is one form of the present invention is
In the information processing device,
Among section video data that is video data divided into predetermined time sections, a first feature amount that is a feature amount based on video data of a first section that is a part of the section, and a first feature amount that is a feature amount based on video data of a first section that is a part of the section, and a first feature amount that is a feature amount based on video data of a first section that is a part of the section, and Generate a second feature amount that is a feature amount based on video data of a certain second section,
When generating a learning model that outputs a behavior corresponding to a feature in response to input of a feature based on video data, an action corresponding to the first feature and an action corresponding to the second feature are generated. , learn,
execute the process,
The structure is as follows.

By being configured as described above, the present invention can suppress erroneous recognition when recognizing an action from a video.

1 is a block diagram showing the overall configuration of an action recognition system in Embodiment 1 of the present invention. FIG. 2 is a block diagram showing the configuration of the learning device disclosed in FIG. 1. FIG. 2 is a block diagram showing the configuration of the estimation device disclosed in FIG. 1. FIG. FIG. 2 is a diagram showing a state of processing by the learning device disclosed in FIG. 1. FIG. FIG. 2 is a diagram showing a state of processing by the learning device disclosed in FIG. 1. FIG. FIG. 2 is a diagram showing a state of processing by the learning device disclosed in FIG. 1. FIG. 2 is a flowchart showing the operation of the learning device disclosed in FIG. 1. FIG. 2 is a flowchart showing the estimation operation disclosed in FIG. 1. FIG. It is a block diagram showing the composition of the estimation device in Embodiment 2 of the present invention. It is a flowchart which shows the operation of the estimation device in Embodiment 2 of the present invention. FIG. 3 is a block diagram showing the hardware configuration of an information processing system in Embodiment 3 of the present invention. FIG. 3 is a block diagram showing the configuration of an information processing system in Embodiment 3 of the present invention. It is a flowchart which shows the operation of the information processing system in Embodiment 3 of the present invention.

<Embodiment 1>
A first embodiment of the present invention will be described with reference to FIGS. 1 to 8. 1 to 3 are diagrams for explaining the configuration of the behavior recognition system, and FIGS. 4 to 8 are diagrams for explaining the processing operation of the behavior recognition system.

[composition]
The behavior recognition system 1 of the present invention generates a behavior estimation model by machine learning in order to recognize the behavior of a person from video data, and uses the generated behavior estimation model to recognize the behavior of a person in new video data. . For example, the behavior recognition system 1 can be used for safety management at a construction site, and can be used to recognize whether or not a worker has performed a safety confirmation behavior such as a pointing gesture. Specifically, the behavior recognition system 1 uses images captured by surveillance cameras installed at construction sites to create a database of when, where, and how many times workers perform safety confirmation tasks such as pointing and checking. It can be used to record and send alerts to sites where safety confirmation work is not being performed. The behavior recognition system 1 can also be used to manage man-hours at construction sites, and by recording which work the worker in the video did, how many times, and for how long, it can be used to manage work hours as expected. You can check whether the work is being done or not. Note that in this embodiment, actions recognized by the action recognition system 1 will be described using actions such as a pointing action, a walking action, and a crouching action of a person as examples; however, any action may be recognized; It may be used not only for the behavioral recognition of any target.

The behavior recognition system 1 includes a learning device 10, a storage device 20, and an estimation device 30, as shown in FIG. The learning device 10 is a device that performs learning of a learning model used for estimating behavioral information from time-series data based on time-series data (also referred to as learning data) used for learning the learning model. The storage device 20 is a device that can refer to and write data, and is a device that stores learning data, deep learning model parameters, and the like. When input data is input from an external device, the estimation device 30 configures an output (estimation) device by referring to the learned parameters stored in the storage device 20, and outputs information regarding the behavior of the estimation target. It is a device that generates. Each device will be explained in detail below.

The storage device 20 is composed of one or more information processing devices including an arithmetic device and a storage device. The storage device 20 includes a learning data storage section 21 and a parameter storage section 22, as shown in FIG.

The learning data storage unit 21 is a device that stores learning data for performing learning processing of the learning device 10. Here, the learning data stored in the learning data storage section 21 will be explained with reference to FIG. 4. The learning data, as shown by reference numeral D1 in FIG. 4, is video data consisting of a plurality of consecutive frames in time series, and is divided into predetermined time intervals by cutting out a window with a predetermined width Sw. It is generated as a time-series clip that is segmental video data. Then, as shown by reference numeral D2 in FIG. 4, frames are cut out while sliding the window at sliding intervals St, and time-series clips are sequentially generated and used as learning data. Here, this method is called a sliding window method. As an example, the video data has a frame rate of 60 FPS, and time-series clips are sequentially created by sliding a window with a width Sw=120 and a slide interval St=1. Note that, as will be described later, inference data, which is video data input to the estimation device 30 from an external device, also has a similar configuration.

Further, correct information, which is behavioral information (correct behavior) to be estimated in the corresponding learning data, is stored in association with the learning data. Here, the correct answer information includes identification information of an action that is a correct answer. For example, in the case of time-series data of skeletal information extracted from an image sequence in which a person walking is displayed, the correct information associated with the target learning data includes identification information indicating that the person is walking. It will be done.

The parameter storage unit 22 is a device that stores parameters obtained by learning a learning model. Here, the learning model may be a learning model based on a neural network, another type of learning model such as a support vector machine, or a combination thereof. For example, when the learning model is a neural network such as a convolutional neural network, the parameters include the layer structure, the neuron structure of each layer, the number and size of filters in each layer, and the weight of each element of each filter. Note that before learning is executed, initial values of parameters to be applied to the learning model are stored in the parameter storage unit 22, and the parameters are updated each time learning is performed by the learning device 10, as will be described later.

The learning device 10 is composed of one or more information processing devices including an arithmetic device and a storage device. As shown in FIG. 2, the learning device 10 includes a feature extraction section 11, an action section detection section 12, an in-section feature extraction section 13, an out-of-section feature extraction section 14, an identification section 15, and a learning section 16. Each function of the feature extraction unit 11, the action section detection unit 12, the intra-section feature extraction unit 13, the out-of-section feature extraction unit 14, the identification unit 15, and the learning unit 16 is such that the arithmetic unit realizes each function stored in the storage device. This can be achieved by running a program to do this. Each configuration will be explained in detail below.

The feature extraction unit 11 (feature generation unit) acquires the learning data of the window width as described above from the learning data storage unit 21, and converts the acquired learning data of the window width into the feature quantity F. Here, the feature amount F has a width in the time direction, as an example is shown in FIG. The feature amount F is, for example, three-dimensional data in the time direction, the direction of the skeleton points of the person, and the dimensional direction of the vector amount calculated at each time and position, as calculated by a neural network as described in Non-Patent Document 1. Alternatively, it may be two-dimensional data in the dimensional direction of the feature amount in the time direction that is collapsed by taking the maximum value or the average value in the direction of the skeleton point. Furthermore, the time direction may be for each frame, or may be compressed by convolution processing of a neural network.

For example, the feature amount F has a sliding window width Sw = 120, 18 skeleton points, the number of dimensions of the vector amount calculated at each time and each position is 256, and is compressed in half in the time direction by convolution processing. , resulting in 60×18×256 three-dimensional data. At this time, the feature extraction unit 11 applies the parameters stored in the parameter storage unit 22 to the learning model that is trained to output the feature amount F from the input learning data. Configure. Then, the feature extraction unit 11 supplies the feature amount F obtained by inputting the learning data to the feature extractor to the intra-interval feature extraction unit 13 and the out-of-interval feature extraction unit 14, respectively.

The behavior section detection section 12 (section detection section) acquires learning data from the learning data storage section 21, detects an important section (first section) in estimating behavior information from the acquired learning data, and stores section information S. Output. The important interval in estimating behavioral information here is an interval that is useful as a criterion for judgment when estimating behavior, and the error is reduced by comparing it with the correct information linked to the learning data later. It is an interval. In other words, the action section detection unit 12 outputs the action section corresponding to the correct information of the learning data as the section information S. At this time, the action section detection section 12 detects the action section by applying the parameters stored in the parameter storage section 22 to the learning model trained to output section information S from the input learning data. Configure the vessel. Here, the action section detector estimates the action section to be recognized with respect to the time direction of the feature amount F using a neural network that estimates the deformation parameter of the feature amount as described in Non-Patent Document 2. Alternatively, an interval detector that directly learns interval detection by preparing correct values for the interval may be used.

As an example, when the action section 12 sets pointing behavior as the action section to be recognized, the above-mentioned learning outputs the section in which the action is performed as the section information S, and as a result, other The section in which the above action is performed will be detected as outside the action section (second section). For example, in a window W as shown by reference numeral D4 in FIG. 5, a frame section shown in gray is detected as an action section to be recognized, and a frame section indicated by reference numeral Da is detected as outside the action section. Then, the action section detection section 12 supplies the obtained section information S to the intra-section feature extraction section 13 and the out-of-section feature extraction section 14, respectively.

The intra-section feature extraction unit 13 (feature generation unit) and the out-of-section feature extraction unit 14 (feature generation unit) extract the feature F supplied from the feature extraction unit 11 and the interval supplied from the action interval detection unit 12. Using the information S, an in-section feature amount F1 and an out-of-section feature amount F2 are respectively generated. Specifically, as shown in FIG. 6, the intra-section feature extraction unit 13 extracts the section corresponding to the section information S from the feature amount F as in Non-Patent Document 2, and performs resizing processing, warping processing, etc. , the time direction is adjusted to always have a constant size regardless of the size of the section information S, and an intra-section feature amount F1 (first feature amount) is generated. Similar to the feature amount F, the intra-interval feature amount F1 may be three-dimensional data in the time direction, the direction of the skeletal position, the dimensional direction of the vector amount calculated at each time and each position, or the two-dimensional data in the time and dimensional direction. It may be data. The intra-section feature extraction unit 13 supplies the generated intra-section feature amount F1 to the identification unit 15.

Further, as shown in FIG. 6, after extracting the feature amount corresponding to the section corresponding to the section information S from the feature amount F as described above, the out-of-section feature extraction unit 14 extracts the feature amount corresponding to the section corresponding to the section information S from the feature amount F as described above. An out-of-section feature amount F2 (second feature amount) is generated from the feature amount of the section detected as being outside the action section. At this time, as shown in FIG. 5, if there are multiple sections that are not cut out and are divided in the time direction, the out-of-section feature extraction unit 14 connects them in the time direction to create one out-of-section feature. It is generated as a feature amount F2. Then, the out-of-interval feature extraction unit 14 supplies the generated out-of-interval feature amount F2 to the identification unit 15.

Here, in the above example, the feature amount F is generated from the learning data of the window width, and then the in-interval feature amount F1 and the out-of-interval feature amount F2 are generated. The method for generating the amount F2 is not limited to the above method. For example, the intra-interval feature extraction unit 13 may generate the intra-interval feature amount F1 from the learning data portion that corresponds to the interval information S, and the out-of-interval feature extraction unit 14 may generate the intra-interval feature amount F1 from the learning data portion that does not correspond to the interval information S. The out-of-section feature amount F2 may also be generated.

The identification unit 15 generates information regarding the target behavior based on the intra-interval feature amount F1 supplied from the intra-interval feature extraction unit 13 and the out-of-interval feature amount F2 supplied from the out-of-interval feature extraction unit 14. At this time, the identification unit 15 sets the learning model that is trained to output the in-section behavior information If1 and the out-of-section behavior information If2 from the input in-section feature amount F1 and out-of-section feature amount F2, respectively. By applying the parameters stored in the parameter storage unit 22, an action information output device is configured. The in-section behavior information If1 and the out-of-section behavior information If2 are, for example, the identification information of the behavior corresponding to the target learning data or the score value of the estimated behavior, and have the same number of dimensions as the number of defined behavior categories. It is a vector. When the intra-interval feature amount F1 and the out-of-interval feature amount F2 are three-dimensional data, the maximum value or average value is taken in the direction of the skeleton points and collapsed into two dimensions, and identification processing is performed for each dimension in the time direction. After that, averaging processing is performed in the time direction. Therefore, the output of the identification unit 15 is a vector quantity having the same number of dimensions as the number of behavior categories to be recognized. The identification unit 15 supplies the learning unit 16 with the in-section behavior information If1 and the out-of-section behavior information If2 obtained by inputting the in-section feature amount F1 and the out-of-section feature amount F2 to the behavior information output device, respectively.

The learning unit 16 acquires correct answer information corresponding to the learning data input to the feature extraction unit 11 from the learning data storage unit 21. Then, the learning section 16 controls the feature extraction section 11, the action section detection section 12, and the identification section 15 based on the acquired correct answer information and the within-section behavior information If1 and the out-of-section behavior information If2 supplied from the identification section 15. Learn. At this time, the learning unit 16 calculates a loss value L1 calculated from the error between the action information indicated by the intra-section action information If1 and the correct answer information, and a loss value L2 calculated from the out-of-section action information If2, respectively. A loss is calculated from these values, and each parameter of the feature extraction section 11, action section detection section 12, and identification section 15 is updated based on this loss. At this time, the loss value L1 may be calculated using any loss function used in machine learning, such as softmax cross entropy error and mean square error. The loss value L2 is calculated so that the value is equal for all behavior categories. The loss value L2 may be the average of the negative logarithm of the average value of each category of the softmax value of the out-of-section behavior information If2, or a constraint may be imposed so that 0 is output for all behavior categories. For example, if pointing behavior, walking behavior, and crouching behavior are the behaviors to be recognized, and the correct behavior of the input learning data is pointing behavior, the pointing behavior of the vector amount of the intra-section behavior information If1 The error L1 becomes smaller when the value of the corresponding dimension becomes the largest, and the error L2 becomes smaller as the values of all dimensions of the vector quantity of the out-of-section action information If2 are uniform. The learning unit 16 determines each parameter so as to minimize these losses.

In addition, using the section information S obtained so far, in order to prevent the results of section information S from being all large or, conversely, only small, it is necessary to You may add this to your losses by imposing a penalty. For example, if the size of the interval information S in the time direction of the data is 0.9 (1 is the entire data), the threshold value is set to 0.7, and the inference results exceeding 0.7 are Subtract the threshold and add the squared value to the loss. For small results, similar processing is performed if the result is below the threshold. Furthermore, the algorithm for determining the parameters described above so as to minimize the loss may be any learning algorithm used in machine learning, such as gradient descent or error backpropagation. The learning section 16 stores the determined parameters of the feature extraction section 11, action section detection section 12, and identification section 15 in the parameter storage section 22.

As described above, the learning device 10 has the function of detecting an interval useful for behavior estimation in time-series data using the action interval detection unit 12. In the learning process executed by the learning device 10, the intra-interval feature extractor 13 and the extra-interval feature extractor 14 extract the intra-interval feature F1 and the extra-interval feature F2 from the feature F output by the feature extractor 11. Cut out. Thereafter, in the learning device 10, when the in-interval feature F1 is passed through the identification unit 15, learning proceeds so that the score value corresponding to the correct behavior class becomes the largest, and the out-of-interval feature F2 is passed through the identification unit 15. 15, the learning progresses so that the score values of all behavior classes are uniform and no class stands out. In this way, the learning device 10 handles not only information within the section but also information outside the section at the same time, so that if there is sufficient data, when data from a section with low importance is input during behavior estimation, It is possible to ensure that the behavior estimation model operates so that the score values of all behavior classes do not stand out significantly. For example, when the recognition target is a person's pointing, walking, or crouching behavior, when transitioning from walking to pointing, it is necessary to slow down and prepare for the pointing behavior in order to stop. There is. At this time, if the behavior candidate interval approaches the interval where the behavior is transitioning, the previous model does not take this interval into consideration, so results may be output where the score value of a certain behavior class stands out. false positives occur. In contrast, in the present invention, erroneous detection can be suppressed by learning to detect intervals useful for behavior estimation, and proceeding with learning so that all score values are uniform if outside the interval.

Here, the action section detecting section 12 performs section detection based on the learning data received from the learning data storage section 21, but it is also possible to perform section detection using the feature amount F outputted by the feature extracting section 11 as input. good.

Next, the configuration of the estimation device 30 will be explained. The estimation device 30 is configured with one or more information processing devices including an arithmetic device and a storage device. The estimation device 30 includes a feature extraction section 31, an identification section 35, and an output section 36, as shown in FIG. The functions of the feature extraction section 31, the identification section 35, and the output section 36 can be realized by the arithmetic unit executing a program stored in the storage device for realizing each function. Each configuration will be explained in detail below.

The feature extraction unit 31 (target feature generating unit) acquires time series data input from an external device, and converts the acquired time series data into a feature F (target feature). Here, the time-series data input from the external device is inference data to be targeted for behavior identification, and is video data (target video data) similar to the learning data described above. That is, as shown in FIG. 4, the inference data is video data consisting of a plurality of consecutive frames (image sequences) along the time series, and is a time series clip cut out with a window of a predetermined width Sw.

However, the inference data input to the estimation device 30 may be data obtained by further extracting information from the image sequence, such as skeletal information. Furthermore, the external device that inputs the inference data may be a camera if the input is an image sequence, or it may be a camera if the input is a generated image sequence or information extracted from the image sequence, and it may be a camera that stores the generated image sequence or information extracted from the image sequence. It may be a device.

Then, the feature extraction unit 31 configures a feature extractor based on the parameters by referring to the parameters stored in the parameter storage unit 22 and obtained through the learning process by the learning device 10. Then, the feature extraction unit 31 supplies the feature amount F obtained by inputting the inference data to the feature extractor to the identification unit 35.

The identification unit 35 generates behavior information Ifa from the feature amount F supplied from the feature extraction unit 31. The identification unit 35 performs estimation for each dimension of the feature amount F in the time direction. At this time, the identification unit 35 configures the behavior information output device by referring to the parameters stored in the parameter storage unit 22. Then, the identification unit 35 supplies the behavior information Ifa obtained by inputting the feature amount F to the behavior information output device to the output unit 36.

The output unit 36 outputs identification information of the behavior to be extracted to an external device based on the behavior information Ifa. Since the behavior information Ifa is output for each dimension in the time direction, it is compressed in the time direction by averaging or summing the score values, and is then compressed into a vector quantity with the number of dimensions equal to the number of behavior classes to be recognized. Output. This vector quantity becomes the identification information of the action at the time at the center of this window. Using this behavior identification information, when the input time series data is divided at fixed window intervals using a sliding window method, the behavior identification information based on the behavior information Ifa in each window is input. For the score values arranged in chronological order, the time when a certain threshold is exceeded is set as the starting point, and the time when it is below a certain threshold is set as the ending point, and the identification information of the action and its starting and ending points are output. .

Note that the estimation processing performed by the estimation device 30 described above does not perform section detection, which is performed in the learning processing performed by the learning device 10. This is because the behavior estimation model reacts and the score value increases only in the characteristic sections of the behavior during the learning process, so the time series data is scanned in a sliding window manner and the score is scored in time series. This is because when arranging the values and determining the start point and end point by threshold value determination, there is no need to perform section detection within the model.

[motion]
Next, the operation of the behavior recognition system 1 described above will be explained mainly with reference to the flowcharts of FIGS. 7 and 8. First, the operation of the learning mode by the learning device 10 will be explained with reference to the flowchart in FIG.

The feature extraction unit 11 acquires learning data from the learning data storage unit 21 (step S1). At this time, the feature extraction unit 11 acquires learning data that has not yet been used for learning (that is, not acquired in step S1) from among the learning data stored in the learning data storage unit 21. Then, the feature extraction unit 11 generates a feature amount F from the learning data acquired in step S1 by configuring a feature extractor with reference to the parameters stored in the parameter storage unit 22 (step S2).

Subsequently, the action section detection section 12 generates section information S from the learning data by configuring an action section detector with reference to the parameters stored in the parameter storage section 22 (step S3). Then, from the feature amount F and the section information S, the in-section feature extraction section 13 and the out-of-section feature extraction section 14 generate an in-section feature amount F1 and an outside-section feature amount F2, respectively (step S4). Subsequently, the identification unit 15 refers to the parameters stored in the parameter storage unit 22 and configures a behavior information output device to extract behavior information If1, Behavior information If2 is generated from the out-of-interval feature amount F2 generated by the out-of-interval feature extraction unit 14 (step S5).

Then, the learning unit 16, based on the intra-section behavior information If1 and the out-of-section behavior information If2 generated by the identification unit 15, and the correct answer information stored in the learning data storage unit 21 in association with the target learning data, Calculate loss (step S6). Further, the learning unit 16 updates the parameters used by each of the feature extraction unit 11, action segment detection unit 12, and identification unit 15 based on the loss calculated in step S6 (step S7). At this time, the learning section 16 stores the respective parameters used by the feature extraction section 11, the action section detection section 12, and the identification section 15 in the parameter storage section 22.

Subsequently, the learning device 10 determines whether the learning end condition is satisfied (step S8). At this time, the learning device 10 may determine the end condition for learning by determining whether or not a predetermined number of loops has been reached, or by determining whether or not a predetermined number of loops have been reached, or by determining the end condition for learning based on a preset number of learning data. This may be done by determining whether learning has been performed, or by determining whether the loss is below a preset threshold, or whether the change in loss is below a preset threshold. This may be done by determining whether or not it has been completed. Note that step S8 may be a combination of the above-mentioned examples, or may be any other determination method. Then, if the learning end condition is satisfied (Yes in step S8), the learning device 10 ends the flowchart. On the other hand, if the learning end condition is not satisfied (No in step S8), the learning device 10 returns the process to step S1. At this time, the learning device 10 retrieves unused learning data from the learning data storage unit 21 in step S1, and performs the processing from step S2 onwards.

In this way, the learning device 10 learns the learning model used for behavior estimation from the learning data, and records the parameters of the learned learning model in the storage device 20.

Next, the operation of the inference mode by the estimation device 30 will be explained with reference to the flowchart of FIG. The estimating device 30 repeatedly executes the process shown in the flowchart shown in FIG. 8 every time input data is input to the estimating device 30. As described above, it is assumed that the input data is obtained by scanning video data, which is time-series data, in a sliding window manner.

The feature extraction unit 31 acquires input data supplied from an external device (step S11). Then, the feature extraction unit 31 generates the feature amount F from the input data acquired in step S11 by configuring a feature extractor with reference to the parameters stored in the parameter storage unit 22 (step S12). Next, the identification unit 35 generates behavior information Ifa from the feature amount F by configuring a behavior information output device with reference to the parameters stored in the parameter storage unit 22 (step S3). Then, the output unit 36 outputs the identification information of the action and its start and end points to the external device based on the action information Ifa generated by the identification unit 35 (step S14).

In this way, the estimation device 30 refers to the stored learned parameters, constructs an inference model, uses this model to infer behavior for the video data to be inferred, and outputs the inference result. .

As described above, the action recognition system in this embodiment simultaneously learns action section detection in video data when learning the action recognition model, and distinguishes between effective and ineffective sections for action recognition in video data. In valid sections, behavior recognition is performed with a high score, and in ineffective sections, behavior recognition is performed with a low score. Therefore, even when data that is difficult to judge, such as changes in behavior, is input within a video section that is a candidate for behavior recognition, behavior information can be output with low reliability for such data, thereby preventing false detections. This allows for more accurate recognition of behavior.

<Embodiment 2>
Next, a second embodiment of the present invention will be described with reference to FIGS. 9 and 10. FIG. 9 is a diagram for explaining the configuration of the estimation device, and FIG. 10 is a diagram for explaining the operation of the estimation device.

[composition]
The behavior recognition system 1 according to the present invention differs from the above-described first embodiment in the configuration of the estimation device 30. Hereinafter, configurations that are different from Embodiment 1 will be mainly described.

As shown in FIG. 9, the estimation device 30 in this embodiment includes a feature extraction section 31, an action section detection section 32, an intra-section feature extraction section 33, an identification section 35, and an output section 36. Note that each function of the feature extraction section 31, action section detection section 32, intra-section feature extraction section 33, identification section 35, and output section 36 is performed by the arithmetic unit using a program stored in the storage device to realize each function. This can be achieved by executing.

The action section detection section 32 (target section detection section) and intra-section feature extraction section 33 (target feature amount generation section) in this embodiment are configurations added to the estimation device 30 of the first embodiment, and It has the same functions as the action section detection section 12 and intra-section feature extraction section 13 included in the learning device 10 of No. 1. That is, the action section detection unit 32 generates section information S for the inference data in the same manner as described above, and the intra-section feature extraction section 13 extracts the in-section feature amount F1 from the feature amount F generated from the inference data and the section information S. generate.

Then, the identification unit 35 in this embodiment generates intra-section behavior information Ifa based on the above-mentioned intra-section feature amount F1, and the output unit 36 generates identification information of the behavior to be extracted based on the intra-section behavior information Ifa. output to an external device. In this way, in this embodiment, compared to the first embodiment, since the behavior section is detected in the inference data, the behavior can be detected without scanning the input time series data in a sliding window manner. The start and end of the period can be detected, and the output of the identification section 35 can be used to determine the identification information of the action in that section. For example, of the intra-section behavior information Ifa, the behavior class corresponding to the dimension with the largest score value is output.

Note that when scanning the input time series data in a sliding window manner, the action section detection unit 32 detects the action section and confirms whether the target action is included in the data. If the length of the section exceeds the threshold, identification information of the behavior to be extracted is output based on the intra-section behavior information Ifa. Since intra-section behavior information Ifa is output for each dimension in the time direction within the window, it is compressed in the time direction by averaging or summing the score values, and the number of dimensions is reduced by the number of behavior classes to be recognized. Outputs the vector quantity with. This vector quantity becomes the identification information of the action at the time at the center of the window. If the length of the section does not exceed the threshold, a certain value is prepared for the number of behavior classes to be recognized (for example, all are set to 0) and used as behavior identification information. Using this behavior identification information, when the input time series data is divided into sliding window format at fixed window intervals as input data, the behavior identification information is based on the behavior information Ifa in each window. For the score values arranged in chronological order, the time when a certain threshold is exceeded is set as the starting point, the time when it is below a certain threshold is set as the ending point, and the identification information of the action and its starting and ending points are output. do.

Here, the action section detection section 32 performs section detection based on input data input from an external device, but it also receives the feature amount F output from the feature extraction section 31 and performs section detection based on this. You can go.

[motion]
Next, the operation of the estimation device 30 in this embodiment will be explained with reference to the flowchart of FIG. The estimating device 30 repeatedly executes the process of the flowchart shown in FIG. 10 every time video data to be assumed as input data is input to the estimating device 30. The input data may be time series data input as is, or may be input data scanned in a sliding window manner.

The feature extraction unit 31 acquires input data supplied from an external device (step S21). Then, the feature extraction unit 31 generates the feature amount F from the input data acquired in step S21 by configuring a feature extractor with reference to the parameters stored in the parameter storage unit 22 (step S22).

Subsequently, the action section detection section 32 generates section information S from the learning data by configuring an action section detector with reference to the parameters stored in the parameter storage section 22 (step S23). Then, the intra-section feature extraction unit 33 generates the intra-section feature amount F1 from the feature amount F and the section information S (step S24).

Subsequently, the identification unit 35 refers to the parameters stored in the parameter storage unit 22 and configures a behavior information output device to extract behavior information Ifa from the intra-section feature quantity F1 generated by the intra-section feature extraction unit 33. generated (step S25). Then, the output unit 36 outputs the identification information, the start point, and the end point of the action to the external device based on the action information Ifa output by the identification unit 35 (step S26).

As described above, in this embodiment, since the estimation device 30 includes the action segment detection unit 32, when the input data is the entire time series data, segment detection can be performed without performing threshold processing. . When scanning input time series data using a sliding window method, it is possible to determine whether the target action is included within the window based on the length of the section. At this time, even if the detection of an interval is incorrect and data such as changes in behavior are mixed in the interval, false detection is unlikely to occur because the scores are trained to be uniform during learning.

<Embodiment 3>
Next, a third embodiment of the present invention will be described with reference to FIGS. 11 to 13. 11 and 12 are block diagrams showing the configuration of an information processing system according to the third embodiment, and FIG. 13 is a flowchart showing the operation of the information processing system. Note that this embodiment shows an outline of the configuration of the information processing system and the information processing method described in the above embodiments.

First, with reference to FIG. 11, the hardware configuration of the information processing system 100 in this embodiment will be described. The information processing system 100 is configured with a general information processing device, and is equipped with the following hardware configuration as an example.
・CPU (Central Processing Unit) 101 (arithmetic unit)
・ROM (Read Only Memory) 102 (storage device)
・RAM (Random Access Memory) 103 (storage device)
- Program group 104 loaded into RAM 103
- Storage device 105 that stores the program group 104
- A drive device 106 that reads and writes from and to a storage medium 110 external to the information processing device
-Communication interface 107 that connects to the communication network 111 outside the information processing device
・I/O interface 108 that inputs and outputs data
・Bus 109 connecting each component

Then, the information processing system 100 can construct and equip the feature amount generation unit 121 and the learning unit 122 shown in FIG. 12 by having the CPU 101 acquire the program group 104 and execute it by the CPU 101. Note that the program group 104 is stored in advance in the storage device 105 or ROM 102, for example, and is loaded into the RAM 103 and executed by the CPU 101 as needed. Further, the program group 104 may be supplied to the CPU 101 via the communication network 111, or may be stored in the storage medium 110 in advance, and the drive device 106 may read the program and supply it to the CPU 101. However, the above-mentioned feature amount generation section 121 and learning section 122 may be constructed of a dedicated electronic circuit for realizing such means.

Note that FIG. 11 shows an example of the hardware configuration of an information processing device that is the information processing system 100, and the hardware configuration of the information processing device is not limited to the above-described case. For example, the information processing device may be configured from part of the configuration described above, such as not having the drive device 106.

Then, the information processing system 100 executes the information processing method shown in the flowchart of FIG. 13 by the functions of the feature value generation section 121 and the learning section 122 constructed by the program as described above.

As shown in FIG. 13, the information processing system 100 includes:
Among section video data that is video data divided into predetermined time sections, a first feature amount that is a feature amount based on video data of a first section that is a part of the section, and a first feature amount that is a feature amount based on video data of a first section that is a part of the section, and a first feature amount that is a feature amount based on video data of a first section that is a part of the section, and A second feature amount, which is a feature amount based on video data of a certain second section, is generated (step S101),
When generating a learning model that recognizes the behavior corresponding to the feature based on the input of the feature based on the video data, the behavior corresponding to the first feature and the action corresponding to the second feature are generated. , (step S102),
Execute the process.

By being configured as described above, the present invention generates feature amounts for sections that are effective for action recognition in video data and feature amounts for sections that are not effective, and corresponds to feature amounts for effective sections. It learns the actions that correspond to the features of the interval that are not valid. At this time, for example, learning is performed so that the correct action will have a high score in an effective section, and learning will be performed so that a plurality of actions will have a low score in an ineffective section. As a result, even when difficult-to-determine data such as changes in behavior is input within a video section that is a candidate for behavior recognition, such data can output behavior information with low reliability, suppressing false detections. This allows for more accurate recognition of actions.

Note that the above-mentioned programs can be stored and supplied to a computer using various types of non-transitory computer readable media. Non-transitory computer-readable media include various types of tangible storage media. Examples of non-transitory computer-readable media include magnetic recording media (e.g., flexible disks, magnetic tapes, hard disk drives), magneto-optical recording media (e.g., magneto-optical disks), CD-ROMs (Read Only Memory), CD-Rs, CD-R/W, semiconductor memory (eg, mask ROM, PROM (Programmable ROM), EPROM (Erasable PROM), flash ROM, RAM (Random Access Memory)). The program may also be supplied to the computer via various types of transitory computer readable media. Examples of transitory computer-readable media include electrical signals, optical signals, and electromagnetic waves. The temporary computer-readable medium can provide the program to the computer via wired communication channels, such as electrical wires and fiber optics, or wireless communication channels.

Although the present invention has been described above with reference to the above-described embodiments, the present invention is not limited to the above-described embodiments. The configuration and details of the present invention can be modified in various ways within the scope of the present invention by those skilled in the art. Further, at least one or more of the functions of the feature value generation unit 121 and the learning unit 122 described above may be executed by an information processing device installed and connected to any location on the network, that is, the so-called It may also be performed in cloud computing.

<Additional notes>
Part or all of the above embodiments may also be described as in the following additional notes. Hereinafter, the outline of the configuration of the information processing system, information processing method, and program according to the present invention will be explained. However, the present invention is not limited to the following configuration.
(Additional note 1)
Among section video data that is video data divided into predetermined time sections, a first feature amount that is a feature amount based on video data of a first section that is a part of the section, and a first feature amount that is a feature amount based on video data of a first section that is a part of the section, and a first feature amount that is a feature amount based on video data of a first section that is a part of the section, and a feature amount generation unit that generates a second feature amount that is a feature amount based on video data of a certain second section;
When generating a learning model that outputs a behavior corresponding to a feature in response to input of a feature based on video data, an action corresponding to the first feature and an action corresponding to the second feature are generated. , a learning department that learns
An information processing system equipped with
(Additional note 2)
The information processing system described in Appendix 1,
The learning unit learns so that a correct action set for the section video data corresponds to the first feature amount, and learns so that a plurality of actions correspond to the second feature amount.
Information processing system.
(Additional note 3)
The information processing system described in Appendix 2,
The learning unit learns so that the plurality of actions equally correspond to the second feature amount.
Information processing system.
(Additional note 4)
The information processing system according to appendix 2 or 3,
When the learning unit generates a learning model that outputs the degree of correspondence of the behavior corresponding to the feature amount in response to the input of the feature amount based on the video data, the learning unit determines whether the degree of correspondence of the correct behavior to the first feature amount is learning so that the degree of correspondence of the plurality of actions to the second feature amount becomes low;
Information processing system.
(Appendix 5)
The information processing system according to any one of Supplementary Notes 1 to 4,
The feature amount generation unit generates one of the first feature amount and one of the second feature amount, each having a component in a time direction.
Information processing system.
(Appendix 6)
The information processing system according to appendix 5,
The feature quantity generation unit generates the first feature quantity so that the size in the time direction becomes a preset size.
Information processing system.
(Appendix 7)
The information processing system according to appendix 5 or 6,
When the second section is a plurality of sections divided in the time direction, the feature amount generation unit generates the second feature amount based on video data of the second section that is a combination of the plurality of sections. generate,
Information processing system.
(Appendix 8)
The information processing system according to any one of Supplementary Notes 1 to 7,
The feature quantity generation unit generates a feature quantity of the section video data based on the section video data, and also generates a feature quantity of the section video data based on the feature quantity and the first section and the second section. generating the second feature amount;
Information processing system.
(Appendix 9)
The information processing system according to any one of Supplementary Notes 1 to 8,
comprising a section detection unit that detects the first section and the second section of the section video data based on the section video data;
Information processing system.
(Appendix 10)
The information processing system according to any one of Supplementary Notes 1 to 9,
a target feature amount generation unit that generates a target feature amount that is a feature amount of the target video data based on the target video data that is the target of behavior identification;
an identification unit that identifies the behavior of the target video data based on the behavior output from the learning model by inputting the target feature amount to the learning model;
An information processing system equipped with
(Appendix 11)
The information processing system according to appendix 10,
comprising a target section detection unit that detects the first section of the target video data based on the target video data,
The target feature generation unit generates a first target feature that is a feature corresponding to the first section based on the target feature and the first section of the target video data,
The identification unit identifies the behavior of the target video data based on the behavior output from the learning model by inputting the first target feature amount to the learning model.
Information processing system.
(Appendix 12)
Among section video data that is video data divided into predetermined time sections, a first feature amount that is a feature amount based on video data of a first section that is a part of the section, and a first feature amount that is a feature amount based on video data of a first section that is a part of the section, and a first feature amount that is a feature amount based on video data of a first section that is a part of the section, and Generate a second feature amount that is a feature amount based on video data of a certain second section,
When generating a learning model that outputs a behavior corresponding to a feature in response to input of a feature based on video data, an action corresponding to the first feature and an action corresponding to the second feature are generated. , learn,
Information processing method.
(Appendix 13)
The information processing method according to appendix 12,
learning so that the correct action set for the section video data corresponds to the first feature amount, and learning so that a plurality of actions correspond to the second feature amount;
Information processing method.
(Appendix 14)
The information processing method according to appendix 12 or 13,
Generate a target feature amount that is a feature amount of the target video data based on the target video data that is the target of behavior identification,
identifying the behavior of the target video data based on the behavior output from the learning model by inputting the target feature amount to the learning model;
Information processing method.
(Appendix 15)
In the information processing device,
Among section video data that is video data divided into predetermined time sections, a first feature amount that is a feature amount based on video data of a first section that is a part of the section, and a first feature amount that is a feature amount based on video data of a first section that is a part of the section, and a first feature amount that is a feature amount based on video data of a first section that is a part of the section, and Generate a second feature amount that is a feature amount based on video data of a certain second section,
When generating a learning model that outputs a behavior corresponding to a feature in response to input of a feature based on video data, an action corresponding to the first feature and an action corresponding to the second feature are generated. , learn,
A computer-readable storage medium that stores a program for executing processing.

1 Behavior recognition system 10 Learning device 11 Feature extraction unit 12 Behavior section detection unit 13 In-section feature extraction unit 14 Out-of-section feature extraction unit 15 Identification unit 16 Learning unit 20 Storage device 21 Learning data storage unit 22 Parameter storage unit 30 Estimation device 31 Feature extraction unit 32 Activity section detection unit 33 Intra-section feature extraction unit 35 Identification unit 36 Output unit 100 Information processing system 101 CPU
102 ROM
103 RAM
104 Program group 105 Storage device 106 Drive device 107 Communication interface 108 Input/output interface 109 Bus 110 Storage medium 111 Communication network 121 Feature generation section 122 Learning section

Claims

Among section video data that is video data divided into predetermined time sections, a first feature amount that is a feature amount based on video data of a first section that is a part of the section, and a first feature amount that is a feature amount based on video data of a first section that is a part of the section, and a first feature amount that is a feature amount based on video data of a first section that is a part of the section, and a feature amount generation unit that generates a second feature amount that is a feature amount based on video data of a certain second section;
When generating a learning model that outputs a behavior corresponding to a feature in response to input of a feature based on video data, an action corresponding to the first feature and an action corresponding to the second feature are generated. , a learning department that learns
An information processing system equipped with
The information processing system according to claim 1,
The learning unit learns so that a correct action set for the section video data corresponds to the first feature amount, and learns so that a plurality of actions correspond to the second feature amount.
Information processing system.
The information processing system according to claim 2,
The learning unit learns so that the plurality of actions equally correspond to the second feature amount.
Information processing system.
The information processing system according to claim 2 or 3,
When the learning unit generates a learning model that outputs the degree of correspondence of the behavior corresponding to the feature amount in response to the input of the feature amount based on the video data, the learning unit determines whether the degree of correspondence of the correct behavior to the first feature amount is learning so that the degree of correspondence of the plurality of actions to the second feature amount becomes low;
Information processing system.
The information processing system according to any one of claims 1 to 4,
The feature amount generation unit generates one of the first feature amount and one of the second feature amount, each having a component in a time direction.
Information processing system.
The information processing system according to claim 5,
The feature quantity generation unit generates the first feature quantity so that the size in the time direction becomes a preset size.
Information processing system.
The information processing system according to claim 5 or 6,
When the second section is a plurality of sections divided in the time direction, the feature amount generation unit generates the second feature amount based on video data of the second section that is a combination of the plurality of sections. generate,
Information processing system.
The information processing system according to any one of claims 1 to 7,
The feature quantity generation unit generates a feature quantity of the section video data based on the section video data, and also generates a feature quantity of the section video data based on the feature quantity and the first section and the second section. generating the second feature amount;
Information processing system.
The information processing system according to any one of claims 1 to 8,
comprising a section detection unit that detects the first section and the second section of the section video data based on the section video data;
Information processing system.
The information processing system according to any one of claims 1 to 9,
a target feature amount generation unit that generates a target feature amount that is a feature amount of the target video data based on the target video data that is the target of behavior identification;
an identification unit that identifies the behavior of the target video data based on the behavior output from the learning model by inputting the target feature amount to the learning model;
An information processing system equipped with
The information processing system according to claim 10,
comprising a target section detection unit that detects the first section of the target video data based on the target video data,
The target feature generation unit generates a first target feature that is a feature corresponding to the first section based on the target feature and the first section of the target video data,
The identification unit identifies the behavior of the target video data based on the behavior output from the learning model by inputting the first target feature amount to the learning model.
Information processing system.
Among section video data that is video data divided into predetermined time sections, a first feature amount that is a feature amount based on video data of a first section that is a part of the section, and a first feature amount that is a feature amount based on video data of a first section that is a part of the section, and a first feature amount that is a feature amount based on video data of a first section that is a part of the section, and Generate a second feature amount that is a feature amount based on video data of a certain second section,
When generating a learning model that outputs a behavior corresponding to a feature in response to input of a feature based on video data, an action corresponding to the first feature and an action corresponding to the second feature are generated. , learn,
Information processing method.
The information processing method according to claim 12,
learning so that the correct action set for the section video data corresponds to the first feature amount, and learning so that a plurality of actions correspond to the second feature amount;
Information processing method.
The information processing method according to claim 12 or 13,
Generate a target feature amount that is a feature amount of the target video data based on the target video data that is the target of behavior identification,
identifying the behavior of the target video data based on the behavior output from the learning model by inputting the target feature amount to the learning model;
Information processing method.
In the information processing device,
Among section video data that is video data divided into predetermined time sections, a first feature amount that is a feature amount based on video data of a first section that is a part of the section, and a first feature amount that is a feature amount based on video data of a first section that is a part of the section, and a first feature amount that is a feature amount based on video data of a first section that is a part of the section, and Generate a second feature amount that is a feature amount based on video data of a certain second section,
When generating a learning model that outputs a behavior corresponding to a feature in response to input of a feature based on video data, an action corresponding to the first feature and an action corresponding to the second feature are generated. , learn,
A computer-readable storage medium that stores a program for executing processing.