WO2023188264A1 - Information processing system - Google Patents

Information processing system Download PDF

Info

Publication number
WO2023188264A1
WO2023188264A1 PCT/JP2022/016510 JP2022016510W WO2023188264A1 WO 2023188264 A1 WO2023188264 A1 WO 2023188264A1 JP 2022016510 W JP2022016510 W JP 2022016510W WO 2023188264 A1 WO2023188264 A1 WO 2023188264A1
Authority
WO
WIPO (PCT)
Prior art keywords
section
feature amount
video data
feature
information processing
Prior art date
Application number
PCT/JP2022/016510
Other languages
French (fr)
Japanese (ja)
Inventor
隆平 安藤
Original Assignee
日本電気株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電気株式会社 filed Critical 日本電気株式会社
Priority to PCT/JP2022/016510 priority Critical patent/WO2023188264A1/en
Publication of WO2023188264A1 publication Critical patent/WO2023188264A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis

Definitions

  • the present invention relates to an information processing system, an information processing method, and a program.
  • Patent Document 1 describes a method for extracting feature amounts of the behavior of a moving object from a video consisting of spatiotemporal information and estimating the behavior. Specifically, Patent Document 1 describes that a motion section from the start to the end of an action in a video is estimated, and the behavior is estimated based on the feature amount of the video of this motion section.
  • Patent Document 1 has a problem in that erroneous recognition may occur if the estimation of the motion section is not accurate. For example, if the estimated motion interval includes a transition part from one behavior to another, data that is not expected by the behavior estimation model will be mixed in, making behavior estimation difficult and causing erroneous recognition. It can occur.
  • an object of the present invention is to provide an information processing system that can solve the above-mentioned problem that erroneous recognition may occur when recognizing an action from a video.
  • An information processing system that is one form of the present invention includes: Among section video data that is video data divided into predetermined time sections, a first feature amount that is a feature amount based on video data of a first section that is a part of the section, and a first feature amount that is a feature amount based on video data of a first section that is a part of the section, and a first feature amount that is a feature amount based on video data of a first section that is a part of the section, and a feature amount generation unit that generates a second feature amount that is a feature amount based on video data of a certain second section;
  • a learning department that learns Equipped with The structure is as follows.
  • an information processing method that is one form of the present invention includes: Among section video data that is video data divided into predetermined time sections, a first feature amount that is a feature amount based on video data of a first section that is a part of the section, and a first feature amount that is a feature amount based on video data of a first section that is a part of the section, and a first feature amount that is a feature amount based on video data of a first section that is a part of the section, and Generate a second feature amount that is a feature amount based on video data of a certain second section, When generating a learning model that outputs a behavior corresponding to a feature in response to input of a feature based on video data, an action corresponding to the first feature and an action corresponding to the second feature are generated. , learn, The structure is as follows.
  • a program that is one form of the present invention is In the information processing device, Among section video data that is video data divided into predetermined time sections, a first feature amount that is a feature amount based on video data of a first section that is a part of the section, and a first feature amount that is a feature amount based on video data of a first section that is a part of the section, and a first feature amount that is a feature amount based on video data of a first section that is a part of the section, and Generate a second feature amount that is a feature amount based on video data of a certain second section, When generating a learning model that outputs a behavior corresponding to a feature in response to input of a feature based on video data, an action corresponding to the first feature and an action corresponding to the second feature are generated. , learn, execute the process,
  • the structure is as follows.
  • the present invention can suppress erroneous recognition when recognizing an action from a video.
  • FIG. 1 is a block diagram showing the overall configuration of an action recognition system in Embodiment 1 of the present invention.
  • FIG. 2 is a block diagram showing the configuration of the learning device disclosed in FIG. 1.
  • FIG. 2 is a block diagram showing the configuration of the estimation device disclosed in FIG. 1.
  • FIG. FIG. 2 is a diagram showing a state of processing by the learning device disclosed in FIG. 1.
  • FIG. FIG. 2 is a diagram showing a state of processing by the learning device disclosed in FIG. 1.
  • FIG. FIG. 2 is a diagram showing a state of processing by the learning device disclosed in FIG. 1.
  • FIG. 2 is a flowchart showing the operation of the learning device disclosed in FIG. 1.
  • FIG. 2 is a flowchart showing the estimation operation disclosed in FIG. 1.
  • FIG. 1 is a block diagram showing the overall configuration of an action recognition system in Embodiment 1 of the present invention.
  • FIG. 2 is a block diagram showing the configuration of the learning device disclosed in FIG. 1.
  • FIG. 2 is a block
  • FIG. 3 is a block diagram showing the hardware configuration of an information processing system in Embodiment 3 of the present invention.
  • FIG. 3 is a block diagram showing the configuration of an information processing system in Embodiment 3 of the present invention. It is a flowchart which shows the operation of the information processing system in Embodiment 3 of the present invention.
  • FIGS. 1 to 8. are diagrams for explaining the configuration of the behavior recognition system, and FIGS. 4 to 8 are diagrams for explaining the processing operation of the behavior recognition system.
  • the behavior recognition system 1 of the present invention generates a behavior estimation model by machine learning in order to recognize the behavior of a person from video data, and uses the generated behavior estimation model to recognize the behavior of a person in new video data.
  • the behavior recognition system 1 can be used for safety management at a construction site, and can be used to recognize whether or not a worker has performed a safety confirmation behavior such as a pointing gesture.
  • the behavior recognition system 1 uses images captured by surveillance cameras installed at construction sites to create a database of when, where, and how many times workers perform safety confirmation tasks such as pointing and checking. It can be used to record and send alerts to sites where safety confirmation work is not being performed.
  • the behavior recognition system 1 can also be used to manage man-hours at construction sites, and by recording which work the worker in the video did, how many times, and for how long, it can be used to manage work hours as expected. You can check whether the work is being done or not.
  • actions recognized by the action recognition system 1 will be described using actions such as a pointing action, a walking action, and a crouching action of a person as examples; however, any action may be recognized; It may be used not only for the behavioral recognition of any target.
  • the behavior recognition system 1 includes a learning device 10, a storage device 20, and an estimation device 30, as shown in FIG.
  • the learning device 10 is a device that performs learning of a learning model used for estimating behavioral information from time-series data based on time-series data (also referred to as learning data) used for learning the learning model.
  • the storage device 20 is a device that can refer to and write data, and is a device that stores learning data, deep learning model parameters, and the like.
  • the estimation device 30 configures an output (estimation) device by referring to the learned parameters stored in the storage device 20, and outputs information regarding the behavior of the estimation target. It is a device that generates. Each device will be explained in detail below.
  • the storage device 20 is composed of one or more information processing devices including an arithmetic device and a storage device.
  • the storage device 20 includes a learning data storage section 21 and a parameter storage section 22, as shown in FIG.
  • the learning data storage unit 21 is a device that stores learning data for performing learning processing of the learning device 10.
  • the learning data as shown by reference numeral D1 in FIG. 4, is video data consisting of a plurality of consecutive frames in time series, and is divided into predetermined time intervals by cutting out a window with a predetermined width Sw. It is generated as a time-series clip that is segmental video data.
  • reference numeral D2 in FIG. 4 frames are cut out while sliding the window at sliding intervals St, and time-series clips are sequentially generated and used as learning data.
  • this method is called a sliding window method.
  • inference data which is video data input to the estimation device 30 from an external device, also has a similar configuration.
  • correct information which is behavioral information (correct behavior) to be estimated in the corresponding learning data
  • the correct answer information includes identification information of an action that is a correct answer.
  • the correct information associated with the target learning data includes identification information indicating that the person is walking. It will be done.
  • the parameter storage unit 22 is a device that stores parameters obtained by learning a learning model.
  • the learning model may be a learning model based on a neural network, another type of learning model such as a support vector machine, or a combination thereof.
  • the parameters include the layer structure, the neuron structure of each layer, the number and size of filters in each layer, and the weight of each element of each filter. Note that before learning is executed, initial values of parameters to be applied to the learning model are stored in the parameter storage unit 22, and the parameters are updated each time learning is performed by the learning device 10, as will be described later.
  • the learning device 10 is composed of one or more information processing devices including an arithmetic device and a storage device. As shown in FIG. 2, the learning device 10 includes a feature extraction section 11, an action section detection section 12, an in-section feature extraction section 13, an out-of-section feature extraction section 14, an identification section 15, and a learning section 16. Each function of the feature extraction unit 11, the action section detection unit 12, the intra-section feature extraction unit 13, the out-of-section feature extraction unit 14, the identification unit 15, and the learning unit 16 is such that the arithmetic unit realizes each function stored in the storage device. This can be achieved by running a program to do this. Each configuration will be explained in detail below.
  • the feature extraction unit 11 acquires the learning data of the window width as described above from the learning data storage unit 21, and converts the acquired learning data of the window width into the feature quantity F.
  • the feature amount F has a width in the time direction, as an example is shown in FIG.
  • the feature amount F is, for example, three-dimensional data in the time direction, the direction of the skeleton points of the person, and the dimensional direction of the vector amount calculated at each time and position, as calculated by a neural network as described in Non-Patent Document 1.
  • it may be two-dimensional data in the dimensional direction of the feature amount in the time direction that is collapsed by taking the maximum value or the average value in the direction of the skeleton point.
  • the time direction may be for each frame, or may be compressed by convolution processing of a neural network.
  • the feature extraction unit 11 applies the parameters stored in the parameter storage unit 22 to the learning model that is trained to output the feature amount F from the input learning data. Configure. Then, the feature extraction unit 11 supplies the feature amount F obtained by inputting the learning data to the feature extractor to the intra-interval feature extraction unit 13 and the out-of-interval feature extraction unit 14, respectively.
  • the behavior section detection section 12 acquires learning data from the learning data storage section 21, detects an important section (first section) in estimating behavior information from the acquired learning data, and stores section information S. Output.
  • the important interval in estimating behavioral information here is an interval that is useful as a criterion for judgment when estimating behavior, and the error is reduced by comparing it with the correct information linked to the learning data later. It is an interval.
  • the action section detection unit 12 outputs the action section corresponding to the correct information of the learning data as the section information S.
  • the action section detection section 12 detects the action section by applying the parameters stored in the parameter storage section 22 to the learning model trained to output section information S from the input learning data. Configure the vessel.
  • the action section detector estimates the action section to be recognized with respect to the time direction of the feature amount F using a neural network that estimates the deformation parameter of the feature amount as described in Non-Patent Document 2.
  • a neural network that estimates the deformation parameter of the feature amount as described in Non-Patent Document 2.
  • an interval detector that directly learns interval detection by preparing correct values for the interval may be used.
  • the action section 12 sets pointing behavior as the action section to be recognized
  • the above-mentioned learning outputs the section in which the action is performed as the section information S, and as a result, other The section in which the above action is performed will be detected as outside the action section (second section).
  • a frame section shown in gray is detected as an action section to be recognized
  • a frame section indicated by reference numeral Da is detected as outside the action section.
  • the action section detection section 12 supplies the obtained section information S to the intra-section feature extraction section 13 and the out-of-section feature extraction section 14, respectively.
  • the intra-section feature extraction unit 13 (feature generation unit) and the out-of-section feature extraction unit 14 (feature generation unit) extract the feature F supplied from the feature extraction unit 11 and the interval supplied from the action interval detection unit 12.
  • an in-section feature amount F1 and an out-of-section feature amount F2 are respectively generated.
  • the intra-section feature extraction unit 13 extracts the section corresponding to the section information S from the feature amount F as in Non-Patent Document 2, and performs resizing processing, warping processing, etc. , the time direction is adjusted to always have a constant size regardless of the size of the section information S, and an intra-section feature amount F1 (first feature amount) is generated.
  • the intra-interval feature amount F1 may be three-dimensional data in the time direction, the direction of the skeletal position, the dimensional direction of the vector amount calculated at each time and each position, or the two-dimensional data in the time and dimensional direction. It may be data.
  • the intra-section feature extraction unit 13 supplies the generated intra-section feature amount F1 to the identification unit 15.
  • the out-of-section feature extraction unit 14 extracts the feature amount corresponding to the section corresponding to the section information S from the feature amount F as described above.
  • An out-of-section feature amount F2 (second feature amount) is generated from the feature amount of the section detected as being outside the action section.
  • the out-of-section feature extraction unit 14 connects them in the time direction to create one out-of-section feature. It is generated as a feature amount F2. Then, the out-of-interval feature extraction unit 14 supplies the generated out-of-interval feature amount F2 to the identification unit 15.
  • the feature amount F is generated from the learning data of the window width, and then the in-interval feature amount F1 and the out-of-interval feature amount F2 are generated.
  • the method for generating the amount F2 is not limited to the above method.
  • the intra-interval feature extraction unit 13 may generate the intra-interval feature amount F1 from the learning data portion that corresponds to the interval information S
  • the out-of-interval feature extraction unit 14 may generate the intra-interval feature amount F1 from the learning data portion that does not correspond to the interval information S.
  • the out-of-section feature amount F2 may also be generated.
  • the identification unit 15 generates information regarding the target behavior based on the intra-interval feature amount F1 supplied from the intra-interval feature extraction unit 13 and the out-of-interval feature amount F2 supplied from the out-of-interval feature extraction unit 14. At this time, the identification unit 15 sets the learning model that is trained to output the in-section behavior information If1 and the out-of-section behavior information If2 from the input in-section feature amount F1 and out-of-section feature amount F2, respectively. By applying the parameters stored in the parameter storage unit 22, an action information output device is configured.
  • the in-section behavior information If1 and the out-of-section behavior information If2 are, for example, the identification information of the behavior corresponding to the target learning data or the score value of the estimated behavior, and have the same number of dimensions as the number of defined behavior categories. It is a vector.
  • the intra-interval feature amount F1 and the out-of-interval feature amount F2 are three-dimensional data, the maximum value or average value is taken in the direction of the skeleton points and collapsed into two dimensions, and identification processing is performed for each dimension in the time direction. After that, averaging processing is performed in the time direction. Therefore, the output of the identification unit 15 is a vector quantity having the same number of dimensions as the number of behavior categories to be recognized.
  • the identification unit 15 supplies the learning unit 16 with the in-section behavior information If1 and the out-of-section behavior information If2 obtained by inputting the in-section feature amount F1 and the out-of-section feature amount F2 to the behavior information output device, respectively.
  • the learning unit 16 acquires correct answer information corresponding to the learning data input to the feature extraction unit 11 from the learning data storage unit 21. Then, the learning section 16 controls the feature extraction section 11, the action section detection section 12, and the identification section 15 based on the acquired correct answer information and the within-section behavior information If1 and the out-of-section behavior information If2 supplied from the identification section 15. Learn. At this time, the learning unit 16 calculates a loss value L1 calculated from the error between the action information indicated by the intra-section action information If1 and the correct answer information, and a loss value L2 calculated from the out-of-section action information If2, respectively. A loss is calculated from these values, and each parameter of the feature extraction section 11, action section detection section 12, and identification section 15 is updated based on this loss.
  • the loss value L1 may be calculated using any loss function used in machine learning, such as softmax cross entropy error and mean square error.
  • the loss value L2 is calculated so that the value is equal for all behavior categories.
  • the loss value L2 may be the average of the negative logarithm of the average value of each category of the softmax value of the out-of-section behavior information If2, or a constraint may be imposed so that 0 is output for all behavior categories.
  • the pointing behavior of the vector amount of the intra-section behavior information If1 The error L1 becomes smaller when the value of the corresponding dimension becomes the largest, and the error L2 becomes smaller as the values of all dimensions of the vector quantity of the out-of-section action information If2 are uniform.
  • the learning unit 16 determines each parameter so as to minimize these losses.
  • the threshold value is set to 0.7, and the inference results exceeding 0.7 are Subtract the threshold and add the squared value to the loss. For small results, similar processing is performed if the result is below the threshold.
  • the algorithm for determining the parameters described above so as to minimize the loss may be any learning algorithm used in machine learning, such as gradient descent or error backpropagation.
  • the learning section 16 stores the determined parameters of the feature extraction section 11, action section detection section 12, and identification section 15 in the parameter storage section 22.
  • the learning device 10 has the function of detecting an interval useful for behavior estimation in time-series data using the action interval detection unit 12.
  • the intra-interval feature extractor 13 and the extra-interval feature extractor 14 extract the intra-interval feature F1 and the extra-interval feature F2 from the feature F output by the feature extractor 11. Cut out.
  • learning proceeds so that the score value corresponding to the correct behavior class becomes the largest, and the out-of-interval feature F2 is passed through the identification unit 15. 15, the learning progresses so that the score values of all behavior classes are uniform and no class stands out.
  • the learning device 10 handles not only information within the section but also information outside the section at the same time, so that if there is sufficient data, when data from a section with low importance is input during behavior estimation, It is possible to ensure that the behavior estimation model operates so that the score values of all behavior classes do not stand out significantly.
  • the recognition target is a person's pointing, walking, or crouching behavior
  • the behavior candidate interval approaches the interval where the behavior is transitioning, the previous model does not take this interval into consideration, so results may be output where the score value of a certain behavior class stands out. false positives occur.
  • erroneous detection can be suppressed by learning to detect intervals useful for behavior estimation, and proceeding with learning so that all score values are uniform if outside the interval.
  • the action section detecting section 12 performs section detection based on the learning data received from the learning data storage section 21, but it is also possible to perform section detection using the feature amount F outputted by the feature extracting section 11 as input. good.
  • the estimation device 30 is configured with one or more information processing devices including an arithmetic device and a storage device.
  • the estimation device 30 includes a feature extraction section 31, an identification section 35, and an output section 36, as shown in FIG.
  • the functions of the feature extraction section 31, the identification section 35, and the output section 36 can be realized by the arithmetic unit executing a program stored in the storage device for realizing each function.
  • the feature extraction unit 31 acquires time series data input from an external device, and converts the acquired time series data into a feature F (target feature).
  • the time-series data input from the external device is inference data to be targeted for behavior identification, and is video data (target video data) similar to the learning data described above. That is, as shown in FIG. 4, the inference data is video data consisting of a plurality of consecutive frames (image sequences) along the time series, and is a time series clip cut out with a window of a predetermined width Sw.
  • the inference data input to the estimation device 30 may be data obtained by further extracting information from the image sequence, such as skeletal information.
  • the external device that inputs the inference data may be a camera if the input is an image sequence, or it may be a camera if the input is a generated image sequence or information extracted from the image sequence, and it may be a camera that stores the generated image sequence or information extracted from the image sequence. It may be a device.
  • the feature extraction unit 31 configures a feature extractor based on the parameters by referring to the parameters stored in the parameter storage unit 22 and obtained through the learning process by the learning device 10. Then, the feature extraction unit 31 supplies the feature amount F obtained by inputting the inference data to the feature extractor to the identification unit 35.
  • the identification unit 35 generates behavior information Ifa from the feature amount F supplied from the feature extraction unit 31.
  • the identification unit 35 performs estimation for each dimension of the feature amount F in the time direction.
  • the identification unit 35 configures the behavior information output device by referring to the parameters stored in the parameter storage unit 22. Then, the identification unit 35 supplies the behavior information Ifa obtained by inputting the feature amount F to the behavior information output device to the output unit 36.
  • the output unit 36 outputs identification information of the behavior to be extracted to an external device based on the behavior information Ifa. Since the behavior information Ifa is output for each dimension in the time direction, it is compressed in the time direction by averaging or summing the score values, and is then compressed into a vector quantity with the number of dimensions equal to the number of behavior classes to be recognized. Output. This vector quantity becomes the identification information of the action at the time at the center of this window.
  • the behavior identification information when the input time series data is divided at fixed window intervals using a sliding window method, the behavior identification information based on the behavior information Ifa in each window is input. For the score values arranged in chronological order, the time when a certain threshold is exceeded is set as the starting point, and the time when it is below a certain threshold is set as the ending point, and the identification information of the action and its starting and ending points are output. .
  • the estimation processing performed by the estimation device 30 described above does not perform section detection, which is performed in the learning processing performed by the learning device 10. This is because the behavior estimation model reacts and the score value increases only in the characteristic sections of the behavior during the learning process, so the time series data is scanned in a sliding window manner and the score is scored in time series. This is because when arranging the values and determining the start point and end point by threshold value determination, there is no need to perform section detection within the model.
  • the feature extraction unit 11 acquires learning data from the learning data storage unit 21 (step S1). At this time, the feature extraction unit 11 acquires learning data that has not yet been used for learning (that is, not acquired in step S1) from among the learning data stored in the learning data storage unit 21. Then, the feature extraction unit 11 generates a feature amount F from the learning data acquired in step S1 by configuring a feature extractor with reference to the parameters stored in the parameter storage unit 22 (step S2).
  • the action section detection section 12 generates section information S from the learning data by configuring an action section detector with reference to the parameters stored in the parameter storage section 22 (step S3).
  • the in-section feature extraction section 13 and the out-of-section feature extraction section 14 generate an in-section feature amount F1 and an outside-section feature amount F2, respectively (step S4).
  • the identification unit 15 refers to the parameters stored in the parameter storage unit 22 and configures a behavior information output device to extract behavior information If1, Behavior information If2 is generated from the out-of-interval feature amount F2 generated by the out-of-interval feature extraction unit 14 (step S5).
  • the learning unit 16 based on the intra-section behavior information If1 and the out-of-section behavior information If2 generated by the identification unit 15, and the correct answer information stored in the learning data storage unit 21 in association with the target learning data, Calculate loss (step S6). Further, the learning unit 16 updates the parameters used by each of the feature extraction unit 11, action segment detection unit 12, and identification unit 15 based on the loss calculated in step S6 (step S7). At this time, the learning section 16 stores the respective parameters used by the feature extraction section 11, the action section detection section 12, and the identification section 15 in the parameter storage section 22.
  • the learning device 10 determines whether the learning end condition is satisfied (step S8).
  • the learning device 10 may determine the end condition for learning by determining whether or not a predetermined number of loops has been reached, or by determining whether or not a predetermined number of loops have been reached, or by determining the end condition for learning based on a preset number of learning data. This may be done by determining whether learning has been performed, or by determining whether the loss is below a preset threshold, or whether the change in loss is below a preset threshold. This may be done by determining whether or not it has been completed.
  • step S8 may be a combination of the above-mentioned examples, or may be any other determination method.
  • step S8 if the learning end condition is satisfied (Yes in step S8), the learning device 10 ends the flowchart. On the other hand, if the learning end condition is not satisfied (No in step S8), the learning device 10 returns the process to step S1. At this time, the learning device 10 retrieves unused learning data from the learning data storage unit 21 in step S1, and performs the processing from step S2 onwards.
  • the learning device 10 learns the learning model used for behavior estimation from the learning data, and records the parameters of the learned learning model in the storage device 20.
  • the estimating device 30 repeatedly executes the process shown in the flowchart shown in FIG. 8 every time input data is input to the estimating device 30.
  • the input data is obtained by scanning video data, which is time-series data, in a sliding window manner.
  • the feature extraction unit 31 acquires input data supplied from an external device (step S11). Then, the feature extraction unit 31 generates the feature amount F from the input data acquired in step S11 by configuring a feature extractor with reference to the parameters stored in the parameter storage unit 22 (step S12). Next, the identification unit 35 generates behavior information Ifa from the feature amount F by configuring a behavior information output device with reference to the parameters stored in the parameter storage unit 22 (step S3). Then, the output unit 36 outputs the identification information of the action and its start and end points to the external device based on the action information Ifa generated by the identification unit 35 (step S14).
  • the estimation device 30 refers to the stored learned parameters, constructs an inference model, uses this model to infer behavior for the video data to be inferred, and outputs the inference result. .
  • the action recognition system in this embodiment simultaneously learns action section detection in video data when learning the action recognition model, and distinguishes between effective and ineffective sections for action recognition in video data.
  • behavior recognition is performed with a high score
  • in ineffective sections behavior recognition is performed with a low score. Therefore, even when data that is difficult to judge, such as changes in behavior, is input within a video section that is a candidate for behavior recognition, behavior information can be output with low reliability for such data, thereby preventing false detections. This allows for more accurate recognition of behavior.
  • FIG. 9 is a diagram for explaining the configuration of the estimation device
  • FIG. 10 is a diagram for explaining the operation of the estimation device.
  • the behavior recognition system 1 according to the present invention differs from the above-described first embodiment in the configuration of the estimation device 30.
  • configurations that are different from Embodiment 1 will be mainly described.
  • the estimation device 30 in this embodiment includes a feature extraction section 31, an action section detection section 32, an intra-section feature extraction section 33, an identification section 35, and an output section 36.
  • a feature extraction section 31 an action section detection section 32, an intra-section feature extraction section 33, an identification section 35, and an output section 36.
  • each function of the feature extraction section 31, action section detection section 32, intra-section feature extraction section 33, identification section 35, and output section 36 is performed by the arithmetic unit using a program stored in the storage device to realize each function. This can be achieved by executing.
  • the action section detection section 32 (target section detection section) and intra-section feature extraction section 33 (target feature amount generation section) in this embodiment are configurations added to the estimation device 30 of the first embodiment, and It has the same functions as the action section detection section 12 and intra-section feature extraction section 13 included in the learning device 10 of No. 1. That is, the action section detection unit 32 generates section information S for the inference data in the same manner as described above, and the intra-section feature extraction section 13 extracts the in-section feature amount F1 from the feature amount F generated from the inference data and the section information S. generate.
  • the identification unit 35 in this embodiment generates intra-section behavior information Ifa based on the above-mentioned intra-section feature amount F1, and the output unit 36 generates identification information of the behavior to be extracted based on the intra-section behavior information Ifa. output to an external device.
  • the behavior section since the behavior section is detected in the inference data, the behavior can be detected without scanning the input time series data in a sliding window manner. The start and end of the period can be detected, and the output of the identification section 35 can be used to determine the identification information of the action in that section. For example, of the intra-section behavior information Ifa, the behavior class corresponding to the dimension with the largest score value is output.
  • the action section detection unit 32 detects the action section and confirms whether the target action is included in the data. If the length of the section exceeds the threshold, identification information of the behavior to be extracted is output based on the intra-section behavior information Ifa. Since intra-section behavior information Ifa is output for each dimension in the time direction within the window, it is compressed in the time direction by averaging or summing the score values, and the number of dimensions is reduced by the number of behavior classes to be recognized. Outputs the vector quantity with. This vector quantity becomes the identification information of the action at the time at the center of the window.
  • a certain value is prepared for the number of behavior classes to be recognized (for example, all are set to 0) and used as behavior identification information.
  • the behavior identification information is based on the behavior information Ifa in each window. For the score values arranged in chronological order, the time when a certain threshold is exceeded is set as the starting point, the time when it is below a certain threshold is set as the ending point, and the identification information of the action and its starting and ending points are output. do.
  • the action section detection section 32 performs section detection based on input data input from an external device, but it also receives the feature amount F output from the feature extraction section 31 and performs section detection based on this. You can go.
  • the estimating device 30 repeatedly executes the process of the flowchart shown in FIG. 10 every time video data to be assumed as input data is input to the estimating device 30.
  • the input data may be time series data input as is, or may be input data scanned in a sliding window manner.
  • the feature extraction unit 31 acquires input data supplied from an external device (step S21). Then, the feature extraction unit 31 generates the feature amount F from the input data acquired in step S21 by configuring a feature extractor with reference to the parameters stored in the parameter storage unit 22 (step S22).
  • the action section detection section 32 generates section information S from the learning data by configuring an action section detector with reference to the parameters stored in the parameter storage section 22 (step S23).
  • the intra-section feature extraction unit 33 generates the intra-section feature amount F1 from the feature amount F and the section information S (step S24).
  • the identification unit 35 refers to the parameters stored in the parameter storage unit 22 and configures a behavior information output device to extract behavior information Ifa from the intra-section feature quantity F1 generated by the intra-section feature extraction unit 33. generated (step S25). Then, the output unit 36 outputs the identification information, the start point, and the end point of the action to the external device based on the action information Ifa output by the identification unit 35 (step S26).
  • the estimation device 30 since the estimation device 30 includes the action segment detection unit 32, when the input data is the entire time series data, segment detection can be performed without performing threshold processing. .
  • segment detection can be performed without performing threshold processing.
  • scanning input time series data using a sliding window method it is possible to determine whether the target action is included within the window based on the length of the section. At this time, even if the detection of an interval is incorrect and data such as changes in behavior are mixed in the interval, false detection is unlikely to occur because the scores are trained to be uniform during learning.
  • FIGS. 11 to 13 are block diagrams showing the configuration of an information processing system according to the third embodiment, and FIG. 13 is a flowchart showing the operation of the information processing system. Note that this embodiment shows an outline of the configuration of the information processing system and the information processing method described in the above embodiments.
  • the information processing system 100 is configured with a general information processing device, and is equipped with the following hardware configuration as an example.
  • ⁇ CPU Central Processing Unit
  • ⁇ ROM Read Only Memory
  • RAM Random Access Memory
  • Program group 104 loaded into RAM 103 - Storage device 105 that stores the program group 104 -
  • a drive device 106 that reads and writes from and to a storage medium 110 external to the information processing device -Communication interface 107 that connects to the communication network 111 outside the information processing device ⁇ I/O interface 108 that inputs and outputs data ⁇ Bus 109 connecting each component
  • the information processing system 100 can construct and equip the feature amount generation unit 121 and the learning unit 122 shown in FIG. 12 by having the CPU 101 acquire the program group 104 and execute it by the CPU 101.
  • the program group 104 is stored in advance in the storage device 105 or ROM 102, for example, and is loaded into the RAM 103 and executed by the CPU 101 as needed.
  • the program group 104 may be supplied to the CPU 101 via the communication network 111, or may be stored in the storage medium 110 in advance, and the drive device 106 may read the program and supply it to the CPU 101.
  • the above-mentioned feature amount generation section 121 and learning section 122 may be constructed of a dedicated electronic circuit for realizing such means.
  • FIG. 11 shows an example of the hardware configuration of an information processing device that is the information processing system 100, and the hardware configuration of the information processing device is not limited to the above-described case.
  • the information processing device may be configured from part of the configuration described above, such as not having the drive device 106.
  • the information processing system 100 executes the information processing method shown in the flowchart of FIG. 13 by the functions of the feature value generation section 121 and the learning section 122 constructed by the program as described above.
  • the information processing system 100 includes: Among section video data that is video data divided into predetermined time sections, a first feature amount that is a feature amount based on video data of a first section that is a part of the section, and a first feature amount that is a feature amount based on video data of a first section that is a part of the section, and a first feature amount that is a feature amount based on video data of a first section that is a part of the section, and A second feature amount, which is a feature amount based on video data of a certain second section, is generated (step S101), When generating a learning model that recognizes the behavior corresponding to the feature based on the input of the feature based on the video data, the behavior corresponding to the first feature and the action corresponding to the second feature are generated. , (step S102), Execute the process.
  • the present invention By being configured as described above, the present invention generates feature amounts for sections that are effective for action recognition in video data and feature amounts for sections that are not effective, and corresponds to feature amounts for effective sections. It learns the actions that correspond to the features of the interval that are not valid. At this time, for example, learning is performed so that the correct action will have a high score in an effective section, and learning will be performed so that a plurality of actions will have a low score in an ineffective section. As a result, even when difficult-to-determine data such as changes in behavior is input within a video section that is a candidate for behavior recognition, such data can output behavior information with low reliability, suppressing false detections. This allows for more accurate recognition of actions.
  • Non-transitory computer-readable media include various types of tangible storage media.
  • Examples of non-transitory computer-readable media include magnetic recording media (e.g., flexible disks, magnetic tapes, hard disk drives), magneto-optical recording media (e.g., magneto-optical disks), CD-ROMs (Read Only Memory), CD-Rs, CD-R/W, semiconductor memory (eg, mask ROM, PROM (Programmable ROM), EPROM (Erasable PROM), flash ROM, RAM (Random Access Memory)).
  • the program may also be supplied to the computer via various types of transitory computer readable media. Examples of transitory computer-readable media include electrical signals, optical signals, and electromagnetic waves.
  • the temporary computer-readable medium can provide the program to the computer via wired communication channels, such as electrical wires and fiber optics, or wireless communication channels.
  • the present invention has been described above with reference to the above-described embodiments, the present invention is not limited to the above-described embodiments.
  • the configuration and details of the present invention can be modified in various ways within the scope of the present invention by those skilled in the art.
  • at least one or more of the functions of the feature value generation unit 121 and the learning unit 122 described above may be executed by an information processing device installed and connected to any location on the network, that is, the so-called It may also be performed in cloud computing.
  • the feature quantity generation unit generates the first feature quantity so that the size in the time direction becomes a preset size.
  • Information processing system (Appendix 7) The information processing system according to appendix 5 or 6, When the second section is a plurality of sections divided in the time direction, the feature amount generation unit generates the second feature amount based on video data of the second section that is a combination of the plurality of sections. generate, Information processing system.
  • the information processing system according to any one of Supplementary Notes 1 to 7,
  • the feature quantity generation unit generates a feature quantity of the section video data based on the section video data, and also generates a feature quantity of the section video data based on the feature quantity and the first section and the second section. generating the second feature amount;
  • Information processing system (Appendix 9) The information processing system according to any one of Supplementary Notes 1 to 8, comprising a section detection unit that detects the first section and the second section of the section video data based on the section video data; Information processing system.
  • a target feature amount generation unit that generates a target feature amount that is a feature amount of the target video data based on the target video data that is the target of behavior identification; an identification unit that identifies the behavior of the target video data based on the behavior output from the learning model by inputting the target feature amount to the learning model;
  • An information processing system equipped with (Appendix 11) The information processing system according to appendix 10, comprising a target section detection unit that detects the first section of the target video data based on the target video data, The target feature generation unit generates a first target feature that is a feature corresponding to the first section based on the target feature and the first section of the target video data, The identification unit identifies the behavior of the target video data based on the behavior output from the learning model by inputting the first target feature amount to the learning model.
  • a first feature amount that is a feature amount based on video data of a first section that is a part of the section and a first feature amount that is a feature amount based on video data of a first section that is a part of the section, and a first feature amount that is a feature amount based on video data of a first section that is a part of the section
  • Generate a second feature amount that is a feature amount based on video data of a certain second section When generating a learning model that outputs a behavior corresponding to a feature in response to input of a feature based on video data, an action corresponding to the first feature and an action corresponding to the second feature are generated.
  • a first feature amount that is a feature amount based on video data of a first section that is a part of the section and a first feature amount that is a feature amount based on video data of a first section that is a part of the section, and a first feature amount that is a feature amount based on video data of a first section that is a part of the section
  • Generate a second feature amount that is a feature amount based on video data of a certain second section When generating a learning model that outputs a behavior corresponding to a feature in response to input of a feature based on video data, an action corresponding to the first feature and an action corresponding to the second feature are generated.
  • learn A computer-readable storage medium that stores a program for executing processing.
  • Behavior recognition system 10 Learning device 11 Feature extraction unit 12 Behavior section detection unit 13 In-section feature extraction unit 14 Out-of-section feature extraction unit 15 Identification unit 16 Learning unit 20 Storage device 21 Learning data storage unit 22 Parameter storage unit 30 Estimation device 31 Feature extraction unit 32 Activity section detection unit 33 Intra-section feature extraction unit 35 Identification unit 36 Output unit 100 Information processing system 101 CPU 102 ROM 103 RAM 104 Program group 105 Storage device 106 Drive device 107 Communication interface 108 Input/output interface 109 Bus 110 Storage medium 111 Communication network 121 Feature generation section 122 Learning section

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Image Analysis (AREA)

Abstract

This information processing device 100 comprises: a feature quantity generation unit for generating a first feature quantity based on video data of a first section, which is a section in one part, and a second feature quantity based on video data of a second section, which is a section other than the first section, from section video data divided into prescribed time sections; and a training unit that, when generating a learning model for outputting an action that corresponds to a feature quantity based on video data in response to input of the feature quantity, trains the learning model for an action that corresponds to the first feature quantity and for an action that corresponds to the second feature quantity.

Description

情報処理システムinformation processing system
 本発明は、情報処理システム、情報処理方法、プログラムに関する。 The present invention relates to an information processing system, an information processing method, and a program.
 特許文献1には、時空間情報からなる映像について、動体の行動の特徴量を抽出し、行動の推定を行う方法が記載されている。具体的に、特許文献1では、映像内における行動の開始から終了までの動作区間を推定し、かかる動作区間の映像の特徴量に基づいて行動を推定する、ことが記載されている。 Patent Document 1 describes a method for extracting feature amounts of the behavior of a moving object from a video consisting of spatiotemporal information and estimating the behavior. Specifically, Patent Document 1 describes that a motion section from the start to the end of an action in a video is estimated, and the behavior is estimated based on the feature amount of the video of this motion section.
特開2021-179728号公報JP2021-179728A
 しかしながら、上述した特許文献1における技術では、動作区間の推定が正確ではない場合には、誤認識が発生しうる、という問題が生じる。例えば、推定した動作区間に、ある行動から他の行動への移り変わり部分が含まれる場合には、行動推定モデルが想定していないデータが混入することとなり、行動推定が困難であり、誤認識が発生しうる。 However, the technique disclosed in Patent Document 1 described above has a problem in that erroneous recognition may occur if the estimation of the motion section is not accurate. For example, if the estimated motion interval includes a transition part from one behavior to another, data that is not expected by the behavior estimation model will be mixed in, making behavior estimation difficult and causing erroneous recognition. It can occur.
 このため、本発明の目的は、上述した課題である、映像から行動を認識する際に誤認識が発生しうる、ことを解決することができる情報処理システムを提供することにある。 Therefore, an object of the present invention is to provide an information processing system that can solve the above-mentioned problem that erroneous recognition may occur when recognizing an action from a video.
 本発明の一形態である情報処理システムは、
 所定の時間区間に区切られた映像データである区間映像データのうち、一部の区間である第一区間の映像データに基づく特徴量である第一特徴量と、前記第一区間以外の区間である第二区間の映像データに基づく特徴量である第二特徴量と、を生成する特徴量生成部と、
 映像データに基づく特徴量の入力に応じて当該特徴量に対応する行動を出力する学習モデルを生成する際に、前記第一特徴量に対応する行動と、前記第二特徴量に対応する行動と、を学習する学習部と、
を備えた、
という構成をとる。
An information processing system that is one form of the present invention includes:
Among section video data that is video data divided into predetermined time sections, a first feature amount that is a feature amount based on video data of a first section that is a part of the section, and a first feature amount that is a feature amount based on video data of a first section that is a part of the section, and a first feature amount that is a feature amount based on video data of a first section that is a part of the section, and a feature amount generation unit that generates a second feature amount that is a feature amount based on video data of a certain second section;
When generating a learning model that outputs a behavior corresponding to a feature in response to input of a feature based on video data, an action corresponding to the first feature and an action corresponding to the second feature are generated. , a learning department that learns
Equipped with
The structure is as follows.
 また、本発明の一形態である情報処理方法は、
 所定の時間区間に区切られた映像データである区間映像データのうち、一部の区間である第一区間の映像データに基づく特徴量である第一特徴量と、前記第一区間以外の区間である第二区間の映像データに基づく特徴量である第二特徴量と、を生成し、
 映像データに基づく特徴量の入力に応じて当該特徴量に対応する行動を出力する学習モデルを生成する際に、前記第一特徴量に対応する行動と、前記第二特徴量に対応する行動と、を学習する、
という構成をとる。
Further, an information processing method that is one form of the present invention includes:
Among section video data that is video data divided into predetermined time sections, a first feature amount that is a feature amount based on video data of a first section that is a part of the section, and a first feature amount that is a feature amount based on video data of a first section that is a part of the section, and a first feature amount that is a feature amount based on video data of a first section that is a part of the section, and Generate a second feature amount that is a feature amount based on video data of a certain second section,
When generating a learning model that outputs a behavior corresponding to a feature in response to input of a feature based on video data, an action corresponding to the first feature and an action corresponding to the second feature are generated. , learn,
The structure is as follows.
 また、本発明の一形態であるプログラムは、
 情報処理装置に、
 所定の時間区間に区切られた映像データである区間映像データのうち、一部の区間である第一区間の映像データに基づく特徴量である第一特徴量と、前記第一区間以外の区間である第二区間の映像データに基づく特徴量である第二特徴量と、を生成し、
 映像データに基づく特徴量の入力に応じて当該特徴量に対応する行動を出力する学習モデルを生成する際に、前記第一特徴量に対応する行動と、前記第二特徴量に対応する行動と、を学習する、
処理を実行させる、
という構成をとる。
Further, a program that is one form of the present invention is
In the information processing device,
Among section video data that is video data divided into predetermined time sections, a first feature amount that is a feature amount based on video data of a first section that is a part of the section, and a first feature amount that is a feature amount based on video data of a first section that is a part of the section, and a first feature amount that is a feature amount based on video data of a first section that is a part of the section, and Generate a second feature amount that is a feature amount based on video data of a certain second section,
When generating a learning model that outputs a behavior corresponding to a feature in response to input of a feature based on video data, an action corresponding to the first feature and an action corresponding to the second feature are generated. , learn,
execute the process,
The structure is as follows.
 本発明は、以上のように構成されることにより、映像から行動を認識する際において誤認識を抑制することができる。 By being configured as described above, the present invention can suppress erroneous recognition when recognizing an action from a video.
本発明の実施形態1における行動認識システムの全体構成を示すブロック図である。1 is a block diagram showing the overall configuration of an action recognition system in Embodiment 1 of the present invention. 図1に開示した学習装置の構成を示すブロック図である。FIG. 2 is a block diagram showing the configuration of the learning device disclosed in FIG. 1. FIG. 図1に開示した推定装置の構成を示すブロック図である。2 is a block diagram showing the configuration of the estimation device disclosed in FIG. 1. FIG. 図1に開示した学習装置による処理の様子を示す図である。FIG. 2 is a diagram showing a state of processing by the learning device disclosed in FIG. 1. FIG. 図1に開示した学習装置による処理の様子を示す図である。FIG. 2 is a diagram showing a state of processing by the learning device disclosed in FIG. 1. FIG. 図1に開示した学習装置による処理の様子を示す図である。FIG. 2 is a diagram showing a state of processing by the learning device disclosed in FIG. 1. FIG. 図1に開示した学習装置の動作を示すフローチャートである。2 is a flowchart showing the operation of the learning device disclosed in FIG. 1. FIG. 図1に開示した推定の動作を示すフローチャートである。2 is a flowchart showing the estimation operation disclosed in FIG. 1. FIG. 本発明の実施形態2における推定装置の構成を示すブロック図である。It is a block diagram showing the composition of the estimation device in Embodiment 2 of the present invention. 本発明の実施形態2における推定装置の動作を示すフローチャートである。It is a flowchart which shows the operation of the estimation device in Embodiment 2 of the present invention. 本発明の実施形態3における情報処理システムのハードウェア構成を示すブロック図である。FIG. 3 is a block diagram showing the hardware configuration of an information processing system in Embodiment 3 of the present invention. 本発明の実施形態3における情報処理システムの構成を示すブロック図である。FIG. 3 is a block diagram showing the configuration of an information processing system in Embodiment 3 of the present invention. 本発明の実施形態3における情報処理システムの動作を示すフローチャートである。It is a flowchart which shows the operation of the information processing system in Embodiment 3 of the present invention.
 <実施形態1>
 本発明の第1の実施形態を、図1乃至図8を参照して説明する。図1乃至図3は、行動認識システムの構成を説明するための図であり、図4乃至図8は、行動認識システムの処理動作を説明するための図である。
<Embodiment 1>
A first embodiment of the present invention will be described with reference to FIGS. 1 to 8. 1 to 3 are diagrams for explaining the configuration of the behavior recognition system, and FIGS. 4 to 8 are diagrams for explaining the processing operation of the behavior recognition system.
 [構成]
 本発明における行動認識システム1は、映像データから人物の行動を認識するために、機械学習により行動推定モデルを生成し、生成した行動推定モデルを用いて新たな映像データの人物の行動を認識する、というものである。例えば、行動認識システム1は、工事現場の安全管理に使用することができ、作業員が指差し確認動作といった安全確認の行動を行ったか否かを認識するために使用することができる。具体的に、行動認識システム1は、工事現場に設置された監視カメラ等で撮影された映像から、作業員が指差し確認動作などの安全確認の作業をいつどこで何回行ったのかをデータベースに記録し、安全確認作業を行っていない現場などにアラートを発信するといった使用が可能である。また、行動認識システム1は、工事現場の工数管理などにも使用でき、映像に写っている作業員がどの作業をどのぐらいの時間で何回行ったのかを記録しておくことで、想定通りの作業が行われているのか否かを確認することができる。なお、本実施形態では、行動認識システム1が認識する行動として、人物の指差し行動、歩き行動、しゃがみ行動、といった行動を一例に挙げて説明するが、いかなる行動を認識対象としてもよく、人物に限らずいかなる対象の行動認識に用いてもよい。
[composition]
The behavior recognition system 1 of the present invention generates a behavior estimation model by machine learning in order to recognize the behavior of a person from video data, and uses the generated behavior estimation model to recognize the behavior of a person in new video data. . For example, the behavior recognition system 1 can be used for safety management at a construction site, and can be used to recognize whether or not a worker has performed a safety confirmation behavior such as a pointing gesture. Specifically, the behavior recognition system 1 uses images captured by surveillance cameras installed at construction sites to create a database of when, where, and how many times workers perform safety confirmation tasks such as pointing and checking. It can be used to record and send alerts to sites where safety confirmation work is not being performed. The behavior recognition system 1 can also be used to manage man-hours at construction sites, and by recording which work the worker in the video did, how many times, and for how long, it can be used to manage work hours as expected. You can check whether the work is being done or not. Note that in this embodiment, actions recognized by the action recognition system 1 will be described using actions such as a pointing action, a walking action, and a crouching action of a person as examples; however, any action may be recognized; It may be used not only for the behavioral recognition of any target.
 行動認識システム1は、図1に示すように、学習装置10と、記憶装置20と、推定装置30と、を備える。学習装置10は、学習モデルの学習に用いる時系列データ(学習データとも呼ぶ)に基づき、時系列データからの行動情報の推定に用いられる学習モデルの学習を行う装置である。記憶装置20は、データの参照及び書き込みが可能な装置であり、学習データや深層学習モデルのパラメータなどを記憶する装置である。推定装置30は、外部装置から入力データが入力された場合に、記憶装置20に記憶されている学習済みパラメータを参照することで、出力(推定)器を構成し、推定対象の行動に関する情報を生成する装置である。以下、各装置について詳述する。 The behavior recognition system 1 includes a learning device 10, a storage device 20, and an estimation device 30, as shown in FIG. The learning device 10 is a device that performs learning of a learning model used for estimating behavioral information from time-series data based on time-series data (also referred to as learning data) used for learning the learning model. The storage device 20 is a device that can refer to and write data, and is a device that stores learning data, deep learning model parameters, and the like. When input data is input from an external device, the estimation device 30 configures an output (estimation) device by referring to the learned parameters stored in the storage device 20, and outputs information regarding the behavior of the estimation target. It is a device that generates. Each device will be explained in detail below.
 記憶装置20は、演算装置と記憶装置とを備えた1台又は複数台の情報処理装置にて構成される。そして、記憶装置20は、図1に示すように、学習データ記憶部21とパラメータ記憶部22とを有する。 The storage device 20 is composed of one or more information processing devices including an arithmetic device and a storage device. The storage device 20 includes a learning data storage section 21 and a parameter storage section 22, as shown in FIG.
 学習データ記憶部21は、学習装置10の学習処理を行うための学習データを記憶する装置である。ここで、学習データ記憶部21に記憶される学習データについて、図4を参照して説明する。学習データは、図4の符号D1に示すように、時系列に沿って連続する複数のフレームからなる映像データであり、所定の幅Swのウィンドウで切り出すことで、所定の時間区間に区切られた区間映像データである時系列クリップとして生成される。そして、図4の符号D2に示すように、ウィンドウをスライド間隔Stでスライドさせながらフレームを切り出していき、順次、時系列クリップが生成され、学習データとされる。ここでは、この方式をスライディングウィンドウ方式と呼ぶ。一例として、映像データは、60FPSのフレームレートであり、幅Sw=120、スライド間隔St=1でウィンドウをスライドさせることで、順次、時系列クリップが作成される。なお、後述するように、推定装置30に外部装置から入力される映像データである推論データも同様の構成である。 The learning data storage unit 21 is a device that stores learning data for performing learning processing of the learning device 10. Here, the learning data stored in the learning data storage section 21 will be explained with reference to FIG. 4. The learning data, as shown by reference numeral D1 in FIG. 4, is video data consisting of a plurality of consecutive frames in time series, and is divided into predetermined time intervals by cutting out a window with a predetermined width Sw. It is generated as a time-series clip that is segmental video data. Then, as shown by reference numeral D2 in FIG. 4, frames are cut out while sliding the window at sliding intervals St, and time-series clips are sequentially generated and used as learning data. Here, this method is called a sliding window method. As an example, the video data has a frame rate of 60 FPS, and time-series clips are sequentially created by sliding a window with a width Sw=120 and a slide interval St=1. Note that, as will be described later, inference data, which is video data input to the estimation device 30 from an external device, also has a similar configuration.
 また、学習データには、該当学習データにおいて推定されるべき行動情報(正解行動)である正解情報が関連付けられて記憶される。ここで、正解情報には、正解となる行動の識別情報が含まれる。例えば、歩行を行っている人物が表示されている画像列から抽出された骨格情報の時系列データの場合、対象の学習データに関連付けられる正解情報には、歩行であることを示す識別情報が含まれる。 Further, correct information, which is behavioral information (correct behavior) to be estimated in the corresponding learning data, is stored in association with the learning data. Here, the correct answer information includes identification information of an action that is a correct answer. For example, in the case of time-series data of skeletal information extracted from an image sequence in which a person walking is displayed, the correct information associated with the target learning data includes identification information indicating that the person is walking. It will be done.
 パラメータ記憶部22は、学習モデルを学習することで得られたパラメータを記憶する装置である。ここで、学習モデルは、ニューラルネットワークに基づく学習モデルであっても良く、サポートベクターマシンなどの他の種類の学習モデルであってもよく、これらの組み合わせであっても良い。例えば、学習モデルが、畳み込みニューラルネットワークなどのニューラルネットワークである場合、上記パラメータは、層構造、各層のニューロン構造、各層におけるフィルタ数及びフィルタサイズ、並びに各フィルタの各要素の重みなどが該当する。なお、学習の実行前においては、パラメータ記憶部22には学習モデルに適用するパラメータの初期値が記憶されており、後述するように学習装置10により学習が行われる毎にパラメータが更新される。 The parameter storage unit 22 is a device that stores parameters obtained by learning a learning model. Here, the learning model may be a learning model based on a neural network, another type of learning model such as a support vector machine, or a combination thereof. For example, when the learning model is a neural network such as a convolutional neural network, the parameters include the layer structure, the neuron structure of each layer, the number and size of filters in each layer, and the weight of each element of each filter. Note that before learning is executed, initial values of parameters to be applied to the learning model are stored in the parameter storage unit 22, and the parameters are updated each time learning is performed by the learning device 10, as will be described later.
 学習装置10は、演算装置と記憶装置とを備えた1台又は複数台の情報処理装置にて構成される。そして、学習装置10は、図2に示すように、特徴抽出部11、行動区間検出部12、区間内特徴抽出部13、区間外特徴抽出部14、識別部15、学習部16、を備える。特徴抽出部11、行動区間検出部12、区間内特徴抽出部13、区間外特徴抽出部14、識別部15、学習部16の各機能は、演算装置が記憶装置に格納された各機能を実現するためのプログラムを実行することにより実現することができる。以下、各構成について詳述する。 The learning device 10 is composed of one or more information processing devices including an arithmetic device and a storage device. As shown in FIG. 2, the learning device 10 includes a feature extraction section 11, an action section detection section 12, an in-section feature extraction section 13, an out-of-section feature extraction section 14, an identification section 15, and a learning section 16. Each function of the feature extraction unit 11, the action section detection unit 12, the intra-section feature extraction unit 13, the out-of-section feature extraction unit 14, the identification unit 15, and the learning unit 16 is such that the arithmetic unit realizes each function stored in the storage device. This can be achieved by running a program to do this. Each configuration will be explained in detail below.
 特徴抽出部11(特徴量生成部)は、学習データ記憶部21から上述したようなウィンドウ幅の学習データを取得し、取得したウィンドウ幅の学習データを特徴量Fに変換する。ここで、特徴量Fは、図6に一例を示すように、時間方向に幅を持つ。特徴量Fは、例えば、非特許文献1にあるようなニューラルネットワークによって算出されるような、時間方向、人物の骨格点の方向、各時間各位置で算出したベクトル量の次元方向の3次元データであっても良く、骨格点の方向で最大値を取ったり平均値を取ったりして潰した時間方向の特徴量の次元方向の2次元データであっても良い。また、時間方向はフレームごとでも良く、ニューラルネットワークの畳み込み処理に圧縮されていても良い。 The feature extraction unit 11 (feature generation unit) acquires the learning data of the window width as described above from the learning data storage unit 21, and converts the acquired learning data of the window width into the feature quantity F. Here, the feature amount F has a width in the time direction, as an example is shown in FIG. The feature amount F is, for example, three-dimensional data in the time direction, the direction of the skeleton points of the person, and the dimensional direction of the vector amount calculated at each time and position, as calculated by a neural network as described in Non-Patent Document 1. Alternatively, it may be two-dimensional data in the dimensional direction of the feature amount in the time direction that is collapsed by taking the maximum value or the average value in the direction of the skeleton point. Furthermore, the time direction may be for each frame, or may be compressed by convolution processing of a neural network.
 例えば、特徴量Fは、スライディングウィンドウ幅Sw=120、骨格点が18個、各時間各位置で算出されるベクトル量の次元数が256、畳み込み処理によって時間方向に半分に圧縮される場合には、60×18×256の3次元データとなる。このとき、特徴抽出部11は、入力された学習データから特徴量Fを出力するように学習される学習モデルに対して、パラメータ記憶部22に記憶されたパラメータを適用することで、特徴抽出器を構成する。そして、特徴抽出部11は、特徴抽出器に学習データを入力することで得られた特徴量Fを、区間内特徴抽出部13と区間外特徴抽出部14とにそれぞれ供給する。 For example, the feature amount F has a sliding window width Sw = 120, 18 skeleton points, the number of dimensions of the vector amount calculated at each time and each position is 256, and is compressed in half in the time direction by convolution processing. , resulting in 60×18×256 three-dimensional data. At this time, the feature extraction unit 11 applies the parameters stored in the parameter storage unit 22 to the learning model that is trained to output the feature amount F from the input learning data. Configure. Then, the feature extraction unit 11 supplies the feature amount F obtained by inputting the learning data to the feature extractor to the intra-interval feature extraction unit 13 and the out-of-interval feature extraction unit 14, respectively.
 行動区間検出部12(区間検出部)は、学習データ記憶部21から学習データを取得し、取得した学習データから行動情報の推定において重要な区間(第一区間)を検出し、区間情報Sを出力する。ここでいう行動情報の推定において重要な区間とは、行動を推定する際に判断の基準として有用な区間であり、のちに学習データに紐づいている正解情報と照らし合わせて、誤差が小さくなる区間である。つまり、行動区間検出部12は、学習データの正解情報に対応する行動の区間を、区間情報Sとして出力する。このとき、行動区間検出部12は、入力された学習データから区間情報Sを出力するように学習された学習モデルに対し、パラメータ記憶部22に記憶されたパラメータを適用することで、行動区間検出器を構成する。ここで、行動区間検出器は、非特許文献2にあるような特徴量の変形パラメータを推定するニューラルネットワークを用いて特徴量Fの時間方向に対して認識対象の行動区間を推定するものであっても良いし、区間の正解値を用意して、区間検出を直接学習する区間検出器を利用しても良い。 The behavior section detection section 12 (section detection section) acquires learning data from the learning data storage section 21, detects an important section (first section) in estimating behavior information from the acquired learning data, and stores section information S. Output. The important interval in estimating behavioral information here is an interval that is useful as a criterion for judgment when estimating behavior, and the error is reduced by comparing it with the correct information linked to the learning data later. It is an interval. In other words, the action section detection unit 12 outputs the action section corresponding to the correct information of the learning data as the section information S. At this time, the action section detection section 12 detects the action section by applying the parameters stored in the parameter storage section 22 to the learning model trained to output section information S from the input learning data. Configure the vessel. Here, the action section detector estimates the action section to be recognized with respect to the time direction of the feature amount F using a neural network that estimates the deformation parameter of the feature amount as described in Non-Patent Document 2. Alternatively, an interval detector that directly learns interval detection by preparing correct values for the interval may be used.
 一例として、行動区間検出部12は、指差し行動を認識対象の行動区間とする場合には、上述した学習により、かかる行動を行っている区間を区間情報Sとして出力し、その結果、それ以外の行動を行っている区間は行動区間外(第二区間)として検出することとなる。例えば、図5の符号D4に示すようなウィンドウW中、グレーで示したフレーム区間は認識対象の行動区間として検出され、符号Daで示すフレーム区間は、行動区間外として検出される。そして、行動区間検出部12は、得られた区間情報Sを、区間内特徴抽出部13と区間外特徴抽出部14とにそれぞれ供給する。 As an example, when the action section 12 sets pointing behavior as the action section to be recognized, the above-mentioned learning outputs the section in which the action is performed as the section information S, and as a result, other The section in which the above action is performed will be detected as outside the action section (second section). For example, in a window W as shown by reference numeral D4 in FIG. 5, a frame section shown in gray is detected as an action section to be recognized, and a frame section indicated by reference numeral Da is detected as outside the action section. Then, the action section detection section 12 supplies the obtained section information S to the intra-section feature extraction section 13 and the out-of-section feature extraction section 14, respectively.
 区間内特徴抽出部13(特徴量生成部)と区間外特徴抽出部14(特徴量生成部)とは、特徴抽出部11から供給された特徴量Fと行動区間検出部12から供給された区間情報Sとを用いて、区間内特徴量F1と区間外特徴量F2とをそれぞれ生成する。具体的に、区間内特徴抽出部13は、図6に示すように、非特許文献2のように特徴量Fから区間情報Sに該当する区間を切り出し、リサイズ処理やワーピング処理などを加えることで、区間情報Sの大きさに関わらず時間方向が常に一定の大きさになるように調整し、区間内特徴量F1(第一特徴量)を生成する。区間内特徴量F1は、特徴量Fと同様に時間方向、骨格位置の方向、各時間各位置で算出したベクトル量の次元方向の3次元データであってもよく、時間と次元方向の2次元データであっても良い。区間内特徴抽出部13は、生成した区間内特徴量F1を識別部15に供給する。 The intra-section feature extraction unit 13 (feature generation unit) and the out-of-section feature extraction unit 14 (feature generation unit) extract the feature F supplied from the feature extraction unit 11 and the interval supplied from the action interval detection unit 12. Using the information S, an in-section feature amount F1 and an out-of-section feature amount F2 are respectively generated. Specifically, as shown in FIG. 6, the intra-section feature extraction unit 13 extracts the section corresponding to the section information S from the feature amount F as in Non-Patent Document 2, and performs resizing processing, warping processing, etc. , the time direction is adjusted to always have a constant size regardless of the size of the section information S, and an intra-section feature amount F1 (first feature amount) is generated. Similar to the feature amount F, the intra-interval feature amount F1 may be three-dimensional data in the time direction, the direction of the skeletal position, the dimensional direction of the vector amount calculated at each time and each position, or the two-dimensional data in the time and dimensional direction. It may be data. The intra-section feature extraction unit 13 supplies the generated intra-section feature amount F1 to the identification unit 15.
 また、区間外特徴抽出部14は、図6に示すように、上述したように特徴量Fから区間情報Sに該当する区間に対応する特徴量を切り出した後、切り出されなかった区間、つまり、行動区間外として検出された区間の特徴量から区間外特徴量F2(第二特徴量)を生成する。このとき、区間外特徴抽出部14は、図5に示すように、切り出されなかった区間が時間方向に分断されて複数存在する場合には、これらを時間方向に連結して、1つの区間外特徴量F2として生成する。そして、区間外特徴抽出部14は、生成した区間外特徴量F2を識別部15に供給する。 Further, as shown in FIG. 6, after extracting the feature amount corresponding to the section corresponding to the section information S from the feature amount F as described above, the out-of-section feature extraction unit 14 extracts the feature amount corresponding to the section corresponding to the section information S from the feature amount F as described above. An out-of-section feature amount F2 (second feature amount) is generated from the feature amount of the section detected as being outside the action section. At this time, as shown in FIG. 5, if there are multiple sections that are not cut out and are divided in the time direction, the out-of-section feature extraction unit 14 connects them in the time direction to create one out-of-section feature. It is generated as a feature amount F2. Then, the out-of-interval feature extraction unit 14 supplies the generated out-of-interval feature amount F2 to the identification unit 15.
 ここで、上記では、ウィンドウ幅の学習データから特徴量Fを生成してから区間内特徴量F1と区間外特徴量F2とを生成する場合を例示したが、区間内特徴量F1と区間外特徴量F2とを生成する方法は上記の方法に限定されない。例えば、区間内特徴抽出部13は、区間情報Sに該当する学習データ部分から区間内特徴量F1を生成してもよく、区間外特徴抽出部14は、区間情報Sに該当しない学習データ部分から区間外特徴量F2を生成してもよい。 Here, in the above example, the feature amount F is generated from the learning data of the window width, and then the in-interval feature amount F1 and the out-of-interval feature amount F2 are generated. The method for generating the amount F2 is not limited to the above method. For example, the intra-interval feature extraction unit 13 may generate the intra-interval feature amount F1 from the learning data portion that corresponds to the interval information S, and the out-of-interval feature extraction unit 14 may generate the intra-interval feature amount F1 from the learning data portion that does not correspond to the interval information S. The out-of-section feature amount F2 may also be generated.
 識別部15は、区間内特徴抽出部13から供給される区間内特徴量F1と区間外特徴抽出部14から供給される区間外特徴量F2とに基づき、対象の行動に関する情報を生成する。このとき、識別部15は、入力された区間内特徴量F1と区間外特徴量F2のそれぞれから、区間内行動情報If1と区間外行動情報If2を出力するように学習される学習モデルに対し、パラメータ記憶部22に記憶されたパラメータを適用することで、行動情報出力器を構成する。区間内行動情報If1および区間外行動情報If2は、例えば、対象の学習データに対応する行動の識別情報や推定される行動のスコア値であり、定義されている行動カテゴリの数だけ次元数のあるベクトルである。区間内特徴量F1と区間外特徴量F2が3次元データの場合は、骨格点方向に最大値や平均値を取って2次元に潰し、時間方向の各次元に対して識別処理が行われる。その後、時間方向に対して平均化処理が行われる。従って、識別部15の出力は、認識対象の行動カテゴリの数だけの次元数をもったベクトル量となる。識別部15は、行動情報出力器に区間内特徴量F1と区間外特徴量F2をそれぞれ入力することで得られた区間内行動情報If1と区間外行動情報If2を学習部16に供給する。 The identification unit 15 generates information regarding the target behavior based on the intra-interval feature amount F1 supplied from the intra-interval feature extraction unit 13 and the out-of-interval feature amount F2 supplied from the out-of-interval feature extraction unit 14. At this time, the identification unit 15 sets the learning model that is trained to output the in-section behavior information If1 and the out-of-section behavior information If2 from the input in-section feature amount F1 and out-of-section feature amount F2, respectively. By applying the parameters stored in the parameter storage unit 22, an action information output device is configured. The in-section behavior information If1 and the out-of-section behavior information If2 are, for example, the identification information of the behavior corresponding to the target learning data or the score value of the estimated behavior, and have the same number of dimensions as the number of defined behavior categories. It is a vector. When the intra-interval feature amount F1 and the out-of-interval feature amount F2 are three-dimensional data, the maximum value or average value is taken in the direction of the skeleton points and collapsed into two dimensions, and identification processing is performed for each dimension in the time direction. After that, averaging processing is performed in the time direction. Therefore, the output of the identification unit 15 is a vector quantity having the same number of dimensions as the number of behavior categories to be recognized. The identification unit 15 supplies the learning unit 16 with the in-section behavior information If1 and the out-of-section behavior information If2 obtained by inputting the in-section feature amount F1 and the out-of-section feature amount F2 to the behavior information output device, respectively.
 学習部16は、特徴抽出部11に入力された学習データに対応する正解情報を学習データ記憶部21から取得する。そして、学習部16は、取得した正解情報と、識別部15から供給される区間内行動情報If1と区間外行動情報If2に基づき、特徴抽出部11、行動区間検出部12、および識別部15の学習を行う。このとき、学習部16は、区間内行動情報If1が示す行動情報と正解情報との誤差から算出される損失値L1と、区間外行動情報If2から算出される損失値L2と、をそれぞれ求め、これらの値から損失を計算し、この損失に基づき、特徴抽出部11、行動区間検出部12、および識別部15の各パラメータを更新する。このとき、損失値L1は、ソフトマックスクロスエントロピー誤差、平均二乗誤差などの機械学習で用いられる任意の損失関数を用いて算出されても良い。損失値L2は、値が全ての行動カテゴリで均等になるように算出する。区間外行動情報If2のソフトマックス値の各カテゴリの平均値の負の対数値の平均を損失値L2としてもよいし、全ての行動カテゴリで0が出力されるように制約付けても良い。例えば、指差し行動、歩き行動、しゃがみ行動が認識対象の行動であり、入力された学習データの正解行動が指差し行動である場合には、区間内行動情報If1のベクトル量の指差し行動に対応する次元の値が一番大きくなれば誤差L1が小さくなり、区間外行動情報If2のベクトル量の全ての次元の値が均一であればあるほど誤差L2が小さくなる。学習部16は、これらの損失を最小化するように各パラメータを決定する。 The learning unit 16 acquires correct answer information corresponding to the learning data input to the feature extraction unit 11 from the learning data storage unit 21. Then, the learning section 16 controls the feature extraction section 11, the action section detection section 12, and the identification section 15 based on the acquired correct answer information and the within-section behavior information If1 and the out-of-section behavior information If2 supplied from the identification section 15. Learn. At this time, the learning unit 16 calculates a loss value L1 calculated from the error between the action information indicated by the intra-section action information If1 and the correct answer information, and a loss value L2 calculated from the out-of-section action information If2, respectively. A loss is calculated from these values, and each parameter of the feature extraction section 11, action section detection section 12, and identification section 15 is updated based on this loss. At this time, the loss value L1 may be calculated using any loss function used in machine learning, such as softmax cross entropy error and mean square error. The loss value L2 is calculated so that the value is equal for all behavior categories. The loss value L2 may be the average of the negative logarithm of the average value of each category of the softmax value of the out-of-section behavior information If2, or a constraint may be imposed so that 0 is output for all behavior categories. For example, if pointing behavior, walking behavior, and crouching behavior are the behaviors to be recognized, and the correct behavior of the input learning data is pointing behavior, the pointing behavior of the vector amount of the intra-section behavior information If1 The error L1 becomes smaller when the value of the corresponding dimension becomes the largest, and the error L2 becomes smaller as the values of all dimensions of the vector quantity of the out-of-section action information If2 are uniform. The learning unit 16 determines each parameter so as to minimize these losses.
 なお、これまでで得られている区間情報Sを用いて、区間情報Sの結果が大きいものばかりや、逆に小さいものばかりになることを防ぐために、ある閾値を超えて検出された結果に対してペナルティを課して、これを損失に追加しても良い。例えば、データの時間方向に対して区間情報Sの大きさが0.9(1はデータ全体)であった場合、閾値を0.7として、0.7を超えた推論結果には推論結果から閾値を引き算して二乗した値を損失に加える。小さい結果に対しては、閾値を下回った場合に同様の処理を行う。また、損失を最小化するように上述したパラメータを決定するアルゴリズムは、勾配降下法や誤差逆伝搬法などの機械学習において用いられる任意の学習アルゴリズムであってもよい。学習部16は、決定した特徴抽出部11、行動区間検出部12、および識別部15のパラメータを、パラメータ記憶部22に記憶する。 In addition, using the section information S obtained so far, in order to prevent the results of section information S from being all large or, conversely, only small, it is necessary to You may add this to your losses by imposing a penalty. For example, if the size of the interval information S in the time direction of the data is 0.9 (1 is the entire data), the threshold value is set to 0.7, and the inference results exceeding 0.7 are Subtract the threshold and add the squared value to the loss. For small results, similar processing is performed if the result is below the threshold. Furthermore, the algorithm for determining the parameters described above so as to minimize the loss may be any learning algorithm used in machine learning, such as gradient descent or error backpropagation. The learning section 16 stores the determined parameters of the feature extraction section 11, action section detection section 12, and identification section 15 in the parameter storage section 22.
 以上のように、学習装置10は、行動区間検出部12で時系列データ内の行動推定に有用な区間を検出する機能を備えている。そして、学習装置10が実行する学習処理では、区間内特徴抽出部13と区間外特徴抽出部14で、特徴抽出部11によって出力される特徴量Fから区間内特徴量F1と区間外特徴量F2とを切り出する。その後、学習装置10では、区間内特徴量F1を識別部15に通した時には、正解の行動クラスに対応するスコア値が最も大きくなるように学習が進み、また、区間外特徴量F2を識別部15に通したときには、全ての行動クラスのスコア値が均一になりどのクラスも突出しないように学習が進む。このように、学習装置10は、区間内のみではなく区間外の情報も同時に扱うことで、データが十分にあれば、行動推定の際に重要度の低い区間のデータが入力されたときに、全ての行動クラスのスコア値が大きく突出しないように行動推定モデルの動作を保証することができる。例えば、人物の指差し行動、歩き行動、しゃがみ行動が認識対象であるとき、歩き行動から指差し行動に移行する際には、停止するために歩行速度を緩めて指差し行動の準備を行う必要がある。このとき、行動候補区間が行動を移行している区間に差し掛かる場合、これまでのモデルはこの区間について未考慮であるため、ある行動クラスのスコア値が突出するような結果が出力されるなどして誤検出が発生する。これに対して、本発明では、行動推定に有用な区間を検出できるよう学習し、区間外であればスコア値が全て均一になるように学習を進めることで、誤検出を抑えることができる。 As described above, the learning device 10 has the function of detecting an interval useful for behavior estimation in time-series data using the action interval detection unit 12. In the learning process executed by the learning device 10, the intra-interval feature extractor 13 and the extra-interval feature extractor 14 extract the intra-interval feature F1 and the extra-interval feature F2 from the feature F output by the feature extractor 11. Cut out. Thereafter, in the learning device 10, when the in-interval feature F1 is passed through the identification unit 15, learning proceeds so that the score value corresponding to the correct behavior class becomes the largest, and the out-of-interval feature F2 is passed through the identification unit 15. 15, the learning progresses so that the score values of all behavior classes are uniform and no class stands out. In this way, the learning device 10 handles not only information within the section but also information outside the section at the same time, so that if there is sufficient data, when data from a section with low importance is input during behavior estimation, It is possible to ensure that the behavior estimation model operates so that the score values of all behavior classes do not stand out significantly. For example, when the recognition target is a person's pointing, walking, or crouching behavior, when transitioning from walking to pointing, it is necessary to slow down and prepare for the pointing behavior in order to stop. There is. At this time, if the behavior candidate interval approaches the interval where the behavior is transitioning, the previous model does not take this interval into consideration, so results may be output where the score value of a certain behavior class stands out. false positives occur. In contrast, in the present invention, erroneous detection can be suppressed by learning to detect intervals useful for behavior estimation, and proceeding with learning so that all score values are uniform if outside the interval.
 ここで、行動区間検出部12は、学習データ記憶部21から受け取る学習データに基づいて区間検出を行っているが、特徴抽出部11が出力する特徴量Fを入力にして区間検出を行っても良い。 Here, the action section detecting section 12 performs section detection based on the learning data received from the learning data storage section 21, but it is also possible to perform section detection using the feature amount F outputted by the feature extracting section 11 as input. good.
 次に、推定装置30の構成について説明する。推定装置30は、演算装置と記憶装置とを備えた1台又は複数台の情報処理装置にて構成される。そして、推定装置30は、図3に示すように、特徴抽出部31、識別部35、出力部36、を備える。特徴抽出部31、識別部35、出力部36の各機能は、演算装置が記憶装置に格納された各機能を実現するためのプログラムを実行することにより実現することができる。以下、各構成について詳述する。 Next, the configuration of the estimation device 30 will be explained. The estimation device 30 is configured with one or more information processing devices including an arithmetic device and a storage device. The estimation device 30 includes a feature extraction section 31, an identification section 35, and an output section 36, as shown in FIG. The functions of the feature extraction section 31, the identification section 35, and the output section 36 can be realized by the arithmetic unit executing a program stored in the storage device for realizing each function. Each configuration will be explained in detail below.
 特徴抽出部31(対象特徴量生成部)は、外部装置から入力される時系列データを取得し、取得した時系列データを特徴量F(対象特徴量)に変換する。ここで、外部装置から入力される時系列データは、行動識別対象となる推論データであり、上述した学習データと同様の映像データ(対象映像データ)である。つまり、推論データは、図4に示すように、時系列に沿って連続する複数のフレーム(画像列)からなる映像データであり、所定の幅Swのウィンドウで切り出された時系列クリップである。 The feature extraction unit 31 (target feature generating unit) acquires time series data input from an external device, and converts the acquired time series data into a feature F (target feature). Here, the time-series data input from the external device is inference data to be targeted for behavior identification, and is video data (target video data) similar to the learning data described above. That is, as shown in FIG. 4, the inference data is video data consisting of a plurality of consecutive frames (image sequences) along the time series, and is a time series clip cut out with a window of a predetermined width Sw.
 但し、推定装置30に入力される推論データは、画像列からさらに情報抽出したデータ、例えば骨格情報などでもよい。また、推論データを入力する外部装置は、画像列を入力とする場合は、カメラであってもよく、生成された画像列や画像列から抽出した情報を入力とする場合は、これらを記憶する装置であってもよい。 However, the inference data input to the estimation device 30 may be data obtained by further extracting information from the image sequence, such as skeletal information. Furthermore, the external device that inputs the inference data may be a camera if the input is an image sequence, or it may be a camera if the input is a generated image sequence or information extracted from the image sequence, and it may be a camera that stores the generated image sequence or information extracted from the image sequence. It may be a device.
 そして、特徴抽出部31は、パラメータ記憶部22が記憶する、学習装置10による学習処理により得られたパラメータを参照することで、当該パラメータに基づき特徴抽出器を構成する。そして、特徴抽出部31は、特徴抽出器に推論データを入力することで得られた特徴量Fを識別部35に供給する。 Then, the feature extraction unit 31 configures a feature extractor based on the parameters by referring to the parameters stored in the parameter storage unit 22 and obtained through the learning process by the learning device 10. Then, the feature extraction unit 31 supplies the feature amount F obtained by inputting the inference data to the feature extractor to the identification unit 35.
 識別部35は、特徴抽出部31から供給された特徴量Fから行動情報Ifaを生成する。識別部35では、特徴量Fの時間方向の各次元に対して推定が行われる。このとき、識別部35は、パラメータ記憶部22が記憶するパラメータを参照して行動情報出力器を構成する。そして識別部35は、行動情報出力器に特徴量Fを入力することで得られた行動情報Ifaを出力部36に供給する。 The identification unit 35 generates behavior information Ifa from the feature amount F supplied from the feature extraction unit 31. The identification unit 35 performs estimation for each dimension of the feature amount F in the time direction. At this time, the identification unit 35 configures the behavior information output device by referring to the parameters stored in the parameter storage unit 22. Then, the identification unit 35 supplies the behavior information Ifa obtained by inputting the feature amount F to the behavior information output device to the output unit 36.
 出力部36は、行動情報Ifaに基づき、抽出対象の行動の識別情報を外部装置に出力する。行動情報Ifaは、時間方向の各次元に対して出力されるため、スコア値を平均や合計を取るなどして、時間方向に圧縮し、認識対象の行動クラス数だけ次元数を持つベクトル量を出力する。このベクトル量が、このウィンドウの中心の時刻での行動の識別情報となる。この行動の識別情報を用いて、入力の時系列データをスライディングウィンドウ式に一定のウィンドウ間隔で分割したものを入力データとしたときに、各ウィンドウでの行動情報Ifaに基づいた行動の識別情報のスコア値を時系列で並べたものに対して、ある閾値を超えた時刻を開始点、ある閾値を下回った時刻を終了点と定めて、行動の識別情報とその開始点と終了点を出力する。 The output unit 36 outputs identification information of the behavior to be extracted to an external device based on the behavior information Ifa. Since the behavior information Ifa is output for each dimension in the time direction, it is compressed in the time direction by averaging or summing the score values, and is then compressed into a vector quantity with the number of dimensions equal to the number of behavior classes to be recognized. Output. This vector quantity becomes the identification information of the action at the time at the center of this window. Using this behavior identification information, when the input time series data is divided at fixed window intervals using a sliding window method, the behavior identification information based on the behavior information Ifa in each window is input. For the score values arranged in chronological order, the time when a certain threshold is exceeded is set as the starting point, and the time when it is below a certain threshold is set as the ending point, and the identification information of the action and its starting and ending points are output. .
 なお、上述した推定装置30が実行する推定処理では、学習装置10が実行する学習処理で行っている区間検出を行っていない。これは、学習処理内で行動の特徴的な区間でのみ行動推定モデルが反応してスコア値が上昇するようになっているため、時系列データ上をスライディングウィンドウ式に走査して時系列でスコア値を並べて閾値判定によって開始点と終了点とする場合には、モデル内で区間検出を行わなくても良いからである。 Note that the estimation processing performed by the estimation device 30 described above does not perform section detection, which is performed in the learning processing performed by the learning device 10. This is because the behavior estimation model reacts and the score value increases only in the characteristic sections of the behavior during the learning process, so the time series data is scanned in a sliding window manner and the score is scored in time series. This is because when arranging the values and determining the start point and end point by threshold value determination, there is no need to perform section detection within the model.
 [動作]
 次に、上述した行動認識システム1の動作を、主に図7乃至図8のフローチャートを参照して説明する。まず、図7のフローチャートを参照して、学習装置10による学習モードの動作を説明する。
[motion]
Next, the operation of the behavior recognition system 1 described above will be explained mainly with reference to the flowcharts of FIGS. 7 and 8. First, the operation of the learning mode by the learning device 10 will be explained with reference to the flowchart in FIG.
 特徴抽出部11は、学習データ記憶部21から学習データを取得する(ステップS1)。このとき、特徴抽出部11は、学習データ記憶部21に記憶された学習データのうち、まだ学習に用いられていない(すなわちステップS1で取得されていない)学習データを取得する。そして、特徴抽出部11は、パラメータ記憶部22が記憶するパラメータを参照して、特徴抽出器を構成することで、ステップS1で取得した学習データから特徴量Fを生成する(ステップS2)。 The feature extraction unit 11 acquires learning data from the learning data storage unit 21 (step S1). At this time, the feature extraction unit 11 acquires learning data that has not yet been used for learning (that is, not acquired in step S1) from among the learning data stored in the learning data storage unit 21. Then, the feature extraction unit 11 generates a feature amount F from the learning data acquired in step S1 by configuring a feature extractor with reference to the parameters stored in the parameter storage unit 22 (step S2).
 続いて、行動区間検出部12は、パラメータ記憶部22が記憶するパラメータを参照して行動区間検出器を構成することで、学習データから区間情報Sを生成する(ステップS3)。そして、特徴量Fと区間情報Sから、区間内特徴抽出部13と区間外特徴抽出部14にて、区間内特徴量F1と区間外特徴量F2をそれぞれ生成する(ステップS4)。続いて、識別部15は、パラメータ記憶部22が記憶するパラメータを参照して、行動情報出力器を構成することで、区間内特徴抽出部13が生成した区間内特徴量F1から行動情報If1、区間外特徴抽出部14が生成した区間外特徴量F2から行動情報If2、をそれぞれ生成する(ステップS5)。 Subsequently, the action section detection section 12 generates section information S from the learning data by configuring an action section detector with reference to the parameters stored in the parameter storage section 22 (step S3). Then, from the feature amount F and the section information S, the in-section feature extraction section 13 and the out-of-section feature extraction section 14 generate an in-section feature amount F1 and an outside-section feature amount F2, respectively (step S4). Subsequently, the identification unit 15 refers to the parameters stored in the parameter storage unit 22 and configures a behavior information output device to extract behavior information If1, Behavior information If2 is generated from the out-of-interval feature amount F2 generated by the out-of-interval feature extraction unit 14 (step S5).
 そして、学習部16は、識別部15が生成した区間内行動情報If1及び区間外行動情報If2と、対象の学習データと関連付けられて学習データ記憶部21に記憶されている正解情報とに基づき、損失を計算する(ステップS6)。さらに学習部16は、ステップS6で計算された損失に基づき、特徴抽出部11、行動区間検出部12、識別部15のそれぞれが用いるパラメータを更新する(ステップS7)。このとき、学習部16は、特徴抽出部11、行動区間検出部12、識別部15が用いるそれぞれのパラメータをパラメータ記憶部22に記憶する。 Then, the learning unit 16, based on the intra-section behavior information If1 and the out-of-section behavior information If2 generated by the identification unit 15, and the correct answer information stored in the learning data storage unit 21 in association with the target learning data, Calculate loss (step S6). Further, the learning unit 16 updates the parameters used by each of the feature extraction unit 11, action segment detection unit 12, and identification unit 15 based on the loss calculated in step S6 (step S7). At this time, the learning section 16 stores the respective parameters used by the feature extraction section 11, the action section detection section 12, and the identification section 15 in the parameter storage section 22.
 続いて、学習装置10は、学習の終了条件を満たすか否かの判定を行う(ステップS8)。このとき、学習装置10は、学習の終了条件を、例えば、予め設定した所定のループ回数に到達したか否かで判定することで行っても良いし、予め設定した数の学習データに対して学習が実行されたか否かを判定することで行っても良いし、損失が予め設定した閾値を下回ったか否かで判定することで行っても良いし、損失の変化が予め設定した閾値を下回ったか否かを判定することで行っても良い。なお、ステップS8は、上述した例の組み合わせであってもよく、それ以外の任意の判定方法であっても良い。そして、学習装置10は、学習の終了条件を満たす場合(ステップS8でYes)、フローチャートを終了する。一方、学習装置10は、学習の終了条件を満たさない場合(ステップS8でNo)、ステップS1に処理を戻す。このとき、学習装置10は、ステップS1において未使用の学習データを学習データ記憶部21から取り出し、ステップS2以降の処理を行う。 Subsequently, the learning device 10 determines whether the learning end condition is satisfied (step S8). At this time, the learning device 10 may determine the end condition for learning by determining whether or not a predetermined number of loops has been reached, or by determining whether or not a predetermined number of loops have been reached, or by determining the end condition for learning based on a preset number of learning data. This may be done by determining whether learning has been performed, or by determining whether the loss is below a preset threshold, or whether the change in loss is below a preset threshold. This may be done by determining whether or not it has been completed. Note that step S8 may be a combination of the above-mentioned examples, or may be any other determination method. Then, if the learning end condition is satisfied (Yes in step S8), the learning device 10 ends the flowchart. On the other hand, if the learning end condition is not satisfied (No in step S8), the learning device 10 returns the process to step S1. At this time, the learning device 10 retrieves unused learning data from the learning data storage unit 21 in step S1, and performs the processing from step S2 onwards.
 このようにして、学習装置10は、学習データから行動の推定に用いられる学習モデルの学習を行い、学習した学習モデルのパラメータを記憶装置20に記録しておく。 In this way, the learning device 10 learns the learning model used for behavior estimation from the learning data, and records the parameters of the learned learning model in the storage device 20.
 次に、図8のフローチャートを参照して、推定装置30による推論モードの動作を説明する。推定装置30は、図8に示すフローチャートの処理を、推定装置30に入力データが入力される毎に繰り返し実行する。入力データは、上述したように、時系列データである映像データを、スライディングウィンドウ式に走査したものがそれぞれ入力されることを想定している。 Next, the operation of the inference mode by the estimation device 30 will be explained with reference to the flowchart of FIG. The estimating device 30 repeatedly executes the process shown in the flowchart shown in FIG. 8 every time input data is input to the estimating device 30. As described above, it is assumed that the input data is obtained by scanning video data, which is time-series data, in a sliding window manner.
 特徴抽出部31は、外部装置から供給される入力データを取得する(ステップS11)。そして、特徴抽出部31は、パラメータ記憶部22が記憶するパラメータを参照して特徴抽出器を構成することで、ステップS11で取得した入力データから特徴量Fを生成する(ステップS12)。次に、識別部35は、パラメータ記憶部22が記憶するパラメータを参照して行動情報出力器を構成することで、特徴量Fから行動情報Ifaを生成する(ステップS3)。そして、出力部36は、識別部35が生成した行動情報Ifaに基づいて、行動の識別情報とその開始点と終了点を外部装置へと出力する(ステップS14)。 The feature extraction unit 31 acquires input data supplied from an external device (step S11). Then, the feature extraction unit 31 generates the feature amount F from the input data acquired in step S11 by configuring a feature extractor with reference to the parameters stored in the parameter storage unit 22 (step S12). Next, the identification unit 35 generates behavior information Ifa from the feature amount F by configuring a behavior information output device with reference to the parameters stored in the parameter storage unit 22 (step S3). Then, the output unit 36 outputs the identification information of the action and its start and end points to the external device based on the action information Ifa generated by the identification unit 35 (step S14).
 このようにして推定装置30は、記憶されている学習済みパラメータを参照し、推論モデルを構成し、かかるモデルを使用して推論対象の映像データに対して行動を推論し、推論結果を出力する。 In this way, the estimation device 30 refers to the stored learned parameters, constructs an inference model, uses this model to infer behavior for the video data to be inferred, and outputs the inference result. .
 以上により、本実施形態における行動認識システムでは、行動認識モデルの学習時に映像データ内の行動区間検出を同時に学習しており、映像データ内の行動認識に有効な区間と有効ではない区間を見分け、有効な区間では高いスコアで行動認識を行い、有効ではない区間では低いスコアで行動認識を行っている。このため、行動認識の候補となる映像区間内に行動の移り変わりなどの判定の難しいデータが入力された時であっても、かかるデータについては低い信頼度で行動情報を出力できるため、誤検出を抑制でき、より正しく行動を認識することができる。 As described above, the action recognition system in this embodiment simultaneously learns action section detection in video data when learning the action recognition model, and distinguishes between effective and ineffective sections for action recognition in video data. In valid sections, behavior recognition is performed with a high score, and in ineffective sections, behavior recognition is performed with a low score. Therefore, even when data that is difficult to judge, such as changes in behavior, is input within a video section that is a candidate for behavior recognition, behavior information can be output with low reliability for such data, thereby preventing false detections. This allows for more accurate recognition of behavior.
 <実施形態2>
 次に、本発明の第2の実施形態を、図9乃至図10を参照して説明する。図9は、推定装置の構成を説明するための図であり、図10は、推定装置の動作を説明するための図である。
<Embodiment 2>
Next, a second embodiment of the present invention will be described with reference to FIGS. 9 and 10. FIG. 9 is a diagram for explaining the configuration of the estimation device, and FIG. 10 is a diagram for explaining the operation of the estimation device.
 [構成]
 本発明における行動認識システム1は、上述した実施形態1に対して、推定装置30の構成が異なる。以下、主に実施形態1と異なる構成について説明する。
[composition]
The behavior recognition system 1 according to the present invention differs from the above-described first embodiment in the configuration of the estimation device 30. Hereinafter, configurations that are different from Embodiment 1 will be mainly described.
 本実施形態における推定装置30は、図9に示すように、特徴抽出部31、行動区間検出部32、区間内特徴抽出部33、識別部35、出力部36、を備える。なお、特徴抽出部31、行動区間検出部32、区間内特徴抽出部33、識別部35、出力部36の各機能は、演算装置が記憶装置に格納された各機能を実現するためのプログラムを実行することにより実現することができる。 As shown in FIG. 9, the estimation device 30 in this embodiment includes a feature extraction section 31, an action section detection section 32, an intra-section feature extraction section 33, an identification section 35, and an output section 36. Note that each function of the feature extraction section 31, action section detection section 32, intra-section feature extraction section 33, identification section 35, and output section 36 is performed by the arithmetic unit using a program stored in the storage device to realize each function. This can be achieved by executing.
 本実施形態における行動区間検出部32(対象区間検出部)及び区間内特徴抽出部33(対象特徴量生成部)は、実施形態1の推定装置30に対して追加された構成であり、実施形態1の学習装置10が備える行動区間検出部12及び区間内特徴抽出部13と同様の機能を有する。つまり、行動区間検出部32は、推論データに対して上述同様に区間情報Sを生成し、区間内特徴抽出部13は、推論データから生成した特徴量Fと区間情報Sから区間内特徴量F1を生成する。 The action section detection section 32 (target section detection section) and intra-section feature extraction section 33 (target feature amount generation section) in this embodiment are configurations added to the estimation device 30 of the first embodiment, and It has the same functions as the action section detection section 12 and intra-section feature extraction section 13 included in the learning device 10 of No. 1. That is, the action section detection unit 32 generates section information S for the inference data in the same manner as described above, and the intra-section feature extraction section 13 extracts the in-section feature amount F1 from the feature amount F generated from the inference data and the section information S. generate.
 そして、本実施形態における識別部35は、上述した区間内特徴量F1に基づき区間内行動情報Ifaを生成し、出力部36は、かかる区間内行動情報Ifaに基づき、抽出対象の行動の識別情報を外部装置に出力する。このように、本実施形態では、第一の実施形態と比較して、推論データ内における行動区間の検出を行うため、入力される時系列データ上をスライディングウィンドウ式に走査しなくても、行動の開始と終了を検出することができ、識別部35の出力を用いて、その区間の行動の識別情報を求めることができる。例えば、区間内行動情報Ifaのうち、スコア値の最も大きい次元に対応している行動クラスを出力する。 Then, the identification unit 35 in this embodiment generates intra-section behavior information Ifa based on the above-mentioned intra-section feature amount F1, and the output unit 36 generates identification information of the behavior to be extracted based on the intra-section behavior information Ifa. output to an external device. In this way, in this embodiment, compared to the first embodiment, since the behavior section is detected in the inference data, the behavior can be detected without scanning the input time series data in a sliding window manner. The start and end of the period can be detected, and the output of the identification section 35 can be used to determine the identification information of the action in that section. For example, of the intra-section behavior information Ifa, the behavior class corresponding to the dimension with the largest score value is output.
 なお、入力される時系列データ上をスライディングウィンドウ式に走査する場合には、行動区間検出部32にて行動区間検出を行い、データ内に対象の行動が含まれているかを確認する。区間の長さが閾値を超えている場合は、区間内行動情報Ifaに基づき、抽出対象の行動の識別情報を出力する。区間内行動情報Ifaは、ウィンドウ内での時間方向の各次元について出力されるため、スコア値を平均や合計を取るなどして、時間方向に圧縮し、認識対象の行動クラス数だけ次元数をもつベクトル量を出力する。このベクトル量が、そのウィンドウの中心の時刻での行動の識別情報となる。区間の長さが閾値を超えていない場合は、一定の値を認識対象の行動クラス数だけ用意し(例えば、全て0にするなど)、行動の識別情報とする。この行動の識別情報を用いて、入力される時系列データをスライディングウィンドウ式に一定のウィンドウ間隔で分割したものを入力データとしたときに、各ウィンドウでの行動情報Ifaに基づいた行動の識別情報のスコア値を時系列で並べたものに対して、ある閾値を超えた時刻を開始点、ある閾値を下回った時刻を終了点と定めて、行動の識別情報とその開始点と終了点を出力する。 Note that when scanning the input time series data in a sliding window manner, the action section detection unit 32 detects the action section and confirms whether the target action is included in the data. If the length of the section exceeds the threshold, identification information of the behavior to be extracted is output based on the intra-section behavior information Ifa. Since intra-section behavior information Ifa is output for each dimension in the time direction within the window, it is compressed in the time direction by averaging or summing the score values, and the number of dimensions is reduced by the number of behavior classes to be recognized. Outputs the vector quantity with. This vector quantity becomes the identification information of the action at the time at the center of the window. If the length of the section does not exceed the threshold, a certain value is prepared for the number of behavior classes to be recognized (for example, all are set to 0) and used as behavior identification information. Using this behavior identification information, when the input time series data is divided into sliding window format at fixed window intervals as input data, the behavior identification information is based on the behavior information Ifa in each window. For the score values arranged in chronological order, the time when a certain threshold is exceeded is set as the starting point, the time when it is below a certain threshold is set as the ending point, and the identification information of the action and its starting and ending points are output. do.
 ここで、行動区間検出部32は、外部装置から入力される入力データに基づいて区間検出を行っているが、特徴抽出部31の出力する特徴量Fを入力し、これに基づいて区間検出を行っても良い。 Here, the action section detection section 32 performs section detection based on input data input from an external device, but it also receives the feature amount F output from the feature extraction section 31 and performs section detection based on this. You can go.
 [動作]
 次に、本実施形態における推定装置30の動作を、図10のフローチャートを参照して説明する。推定装置30は、図10に示すフローチャートの処理を、推定装置30に入力データである想定対象となる映像データが入力される毎に繰り返し実行する。入力データは、時系列データをそのまま入力しても良いし、スライディングウィンドウ式に走査したものをそれぞれ入力したものであっても良い。
[motion]
Next, the operation of the estimation device 30 in this embodiment will be explained with reference to the flowchart of FIG. The estimating device 30 repeatedly executes the process of the flowchart shown in FIG. 10 every time video data to be assumed as input data is input to the estimating device 30. The input data may be time series data input as is, or may be input data scanned in a sliding window manner.
 特徴抽出部31は、外部装置から供給される入力データを取得する(ステップS21)。そして、特徴抽出部31は、パラメータ記憶部22が記憶するパラメータを参照して特徴抽出器を構成することで、ステップS21で取得した入力データから特徴量Fを生成する(ステップS22)。 The feature extraction unit 31 acquires input data supplied from an external device (step S21). Then, the feature extraction unit 31 generates the feature amount F from the input data acquired in step S21 by configuring a feature extractor with reference to the parameters stored in the parameter storage unit 22 (step S22).
 続いて、行動区間検出部32は、パラメータ記憶部22が記憶するパラメータを参照して行動区間検出器を構成することで、学習データから区間情報Sを生成する(ステップS23)。そして、特徴量Fと区間情報Sから、区間内特徴抽出部33にて、区間内特徴量F1を生成する(ステップS24)。 Subsequently, the action section detection section 32 generates section information S from the learning data by configuring an action section detector with reference to the parameters stored in the parameter storage section 22 (step S23). Then, the intra-section feature extraction unit 33 generates the intra-section feature amount F1 from the feature amount F and the section information S (step S24).
 続いて、識別部35は、パラメータ記憶部22が記憶するパラメータを参照して、行動情報出力器を構成することで、区間内特徴抽出部33が生成した区間内特徴量F1から行動情報Ifaを生成する(ステップS25)。そして、出力部36は、識別部35が出力した行動情報Ifaに基づいて行動の識別情報と開始点と終了点を外部装置へと出力する(ステップS26)。 Subsequently, the identification unit 35 refers to the parameters stored in the parameter storage unit 22 and configures a behavior information output device to extract behavior information Ifa from the intra-section feature quantity F1 generated by the intra-section feature extraction unit 33. generated (step S25). Then, the output unit 36 outputs the identification information, the start point, and the end point of the action to the external device based on the action information Ifa output by the identification unit 35 (step S26).
 以上のように、本実施形態では、推定装置30が行動区間検出部32を備えているため、入力データが時系列データ全体であるとき、閾値処理を行わなくても区間検出を行うことができる。入力された時系列データ上をスライディングウィンドウ方式に走査するときでは、ウィンドウ内に対象の行動が含まれていることを区間の長さで判定を行うことができる。このとき、区間の検出がずれて、区間内に行動の移り変わりなどのデータが混じってしまったとしても、学習時にスコアが均一に出るように学習されているため、誤検出が起きにくい。 As described above, in this embodiment, since the estimation device 30 includes the action segment detection unit 32, when the input data is the entire time series data, segment detection can be performed without performing threshold processing. . When scanning input time series data using a sliding window method, it is possible to determine whether the target action is included within the window based on the length of the section. At this time, even if the detection of an interval is incorrect and data such as changes in behavior are mixed in the interval, false detection is unlikely to occur because the scores are trained to be uniform during learning.
 <実施形態3>
 次に、本発明の第3の実施形態を、図11乃至図13を参照して説明する。図11乃至図12は、実施形態3における情報処理システムの構成を示すブロック図であり、図13は、情報処理システムの動作を示すフローチャートである。なお、本実施形態では、上述した実施形態で説明した情報処理システム及び情報処理方法の構成の概略を示している。
<Embodiment 3>
Next, a third embodiment of the present invention will be described with reference to FIGS. 11 to 13. 11 and 12 are block diagrams showing the configuration of an information processing system according to the third embodiment, and FIG. 13 is a flowchart showing the operation of the information processing system. Note that this embodiment shows an outline of the configuration of the information processing system and the information processing method described in the above embodiments.
 まず、図11を参照して、本実施形態における情報処理システム100のハードウェア構成を説明する。情報処理システム100は、一般的な情報処理装置にて構成されており、一例として、以下のようなハードウェア構成を装備している。
 ・CPU(Central Processing Unit)101(演算装置)
 ・ROM(Read Only Memory)102(記憶装置)
 ・RAM(Random Access Memory)103(記憶装置)
 ・RAM103にロードされるプログラム群104
 ・プログラム群104を格納する記憶装置105
 ・情報処理装置外部の記憶媒体110の読み書きを行うドライブ装置106
 ・情報処理装置外部の通信ネットワーク111と接続する通信インタフェース107
 ・データの入出力を行う入出力インタフェース108
 ・各構成要素を接続するバス109
First, with reference to FIG. 11, the hardware configuration of the information processing system 100 in this embodiment will be described. The information processing system 100 is configured with a general information processing device, and is equipped with the following hardware configuration as an example.
・CPU (Central Processing Unit) 101 (arithmetic unit)
・ROM (Read Only Memory) 102 (storage device)
・RAM (Random Access Memory) 103 (storage device)
- Program group 104 loaded into RAM 103
- Storage device 105 that stores the program group 104
- A drive device 106 that reads and writes from and to a storage medium 110 external to the information processing device
-Communication interface 107 that connects to the communication network 111 outside the information processing device
・I/O interface 108 that inputs and outputs data
・Bus 109 connecting each component
 そして、情報処理システム100は、プログラム群104をCPU101が取得して当該CPU101が実行することで、図12に示す特徴量生成部121と学習部122とを構築して装備することができる。なお、プログラム群104は、例えば、予め記憶装置105やROM102に格納されており、必要に応じてCPU101がRAM103にロードして実行する。また、プログラム群104は、通信ネットワーク111を介してCPU101に供給されてもよいし、予め記憶媒体110に格納されており、ドライブ装置106が該プログラムを読み出してCPU101に供給してもよい。但し、上述した特徴量生成部121と学習部122とは、かかる手段を実現させるための専用の電子回路で構築されるものであってもよい。 Then, the information processing system 100 can construct and equip the feature amount generation unit 121 and the learning unit 122 shown in FIG. 12 by having the CPU 101 acquire the program group 104 and execute it by the CPU 101. Note that the program group 104 is stored in advance in the storage device 105 or ROM 102, for example, and is loaded into the RAM 103 and executed by the CPU 101 as needed. Further, the program group 104 may be supplied to the CPU 101 via the communication network 111, or may be stored in the storage medium 110 in advance, and the drive device 106 may read the program and supply it to the CPU 101. However, the above-mentioned feature amount generation section 121 and learning section 122 may be constructed of a dedicated electronic circuit for realizing such means.
 なお、図11は、情報処理システム100である情報処理装置のハードウェア構成の一例を示しており、情報処理装置のハードウェア構成は上述した場合に限定されない。例えば、情報処理装置は、ドライブ装置106を有さないなど、上述した構成の一部から構成されてもよい。 Note that FIG. 11 shows an example of the hardware configuration of an information processing device that is the information processing system 100, and the hardware configuration of the information processing device is not limited to the above-described case. For example, the information processing device may be configured from part of the configuration described above, such as not having the drive device 106.
 そして、情報処理システム100は、上述したようにプログラムによって構築された特徴量生成部121と学習部122との機能により、図13のフローチャートに示す情報処理方法を実行する。 Then, the information processing system 100 executes the information processing method shown in the flowchart of FIG. 13 by the functions of the feature value generation section 121 and the learning section 122 constructed by the program as described above.
 図13に示すように、情報処理システム100は、
 所定の時間区間に区切られた映像データである区間映像データのうち、一部の区間である第一区間の映像データに基づく特徴量である第一特徴量と、前記第一区間以外の区間である第二区間の映像データに基づく特徴量である第二特徴量と、を生成し(ステップS101)、
 映像データに基づく特徴量の入力に応じて当該特徴量に対応する行動を認識する学習モデルを生成する際に、前記第一特徴量に対応する行動と、前記第二特徴量に対応する行動と、を学習する(ステップS102)、
という処理を実行する。
As shown in FIG. 13, the information processing system 100 includes:
Among section video data that is video data divided into predetermined time sections, a first feature amount that is a feature amount based on video data of a first section that is a part of the section, and a first feature amount that is a feature amount based on video data of a first section that is a part of the section, and a first feature amount that is a feature amount based on video data of a first section that is a part of the section, and A second feature amount, which is a feature amount based on video data of a certain second section, is generated (step S101),
When generating a learning model that recognizes the behavior corresponding to the feature based on the input of the feature based on the video data, the behavior corresponding to the first feature and the action corresponding to the second feature are generated. , (step S102),
Execute the process.
 本発明は、以上のように構成されることにより、映像データ内の行動認識に有効な区間の特徴量と、有効ではない区間の特徴量と、を生成し、有効な区間の特徴量に対応する行動と、有効ではない区間の特徴量に対応する行動と、を学習している。このとき、例えば、有効な区間では正解行動が高いスコアとなるよう学習し、有効ではない区間では複数の行動が低いスコアとなるよう学習する。これにより、行動認識の候補となる映像区間内に行動の移り変わりなどの判定の難しいデータが入力された時であっても、かかるデータは低い信頼度で行動情報を出力できるため、誤検出を抑制でき、より正しく行動を認識することができる。 By being configured as described above, the present invention generates feature amounts for sections that are effective for action recognition in video data and feature amounts for sections that are not effective, and corresponds to feature amounts for effective sections. It learns the actions that correspond to the features of the interval that are not valid. At this time, for example, learning is performed so that the correct action will have a high score in an effective section, and learning will be performed so that a plurality of actions will have a low score in an ineffective section. As a result, even when difficult-to-determine data such as changes in behavior is input within a video section that is a candidate for behavior recognition, such data can output behavior information with low reliability, suppressing false detections. This allows for more accurate recognition of actions.
 なお、上述したプログラムは、様々なタイプの非一時的なコンピュータ可読媒体(non-transitory computer readable medium)を用いて格納され、コンピュータに供給することができる。非一時的なコンピュータ可読媒体は、様々なタイプの実体のある記録媒体(tangible storage medium)を含む。非一時的なコンピュータ可読媒体の例は、磁気記録媒体(例えばフレキシブルディスク、磁気テープ、ハードディスクドライブ)、光磁気記録媒体(例えば光磁気ディスク)、CD-ROM(Read Only Memory)、CD-R、CD-R/W、半導体メモリ(例えば、マスクROM、PROM(Programmable ROM)、EPROM(Erasable PROM)、フラッシュROM、RAM(Random Access Memory))を含む。また、プログラムは、様々なタイプの一時的なコンピュータ可読媒体(transitory computer readable medium)によってコンピュータに供給されてもよい。一時的なコンピュータ可読媒体の例は、電気信号、光信号、及び電磁波を含む。一時的なコンピュータ可読媒体は、電線及び光ファイバ等の有線通信路、又は無線通信路を介して、プログラムをコンピュータに供給できる。 Note that the above-mentioned programs can be stored and supplied to a computer using various types of non-transitory computer readable media. Non-transitory computer-readable media include various types of tangible storage media. Examples of non-transitory computer-readable media include magnetic recording media (e.g., flexible disks, magnetic tapes, hard disk drives), magneto-optical recording media (e.g., magneto-optical disks), CD-ROMs (Read Only Memory), CD-Rs, CD-R/W, semiconductor memory (eg, mask ROM, PROM (Programmable ROM), EPROM (Erasable PROM), flash ROM, RAM (Random Access Memory)). The program may also be supplied to the computer via various types of transitory computer readable media. Examples of transitory computer-readable media include electrical signals, optical signals, and electromagnetic waves. The temporary computer-readable medium can provide the program to the computer via wired communication channels, such as electrical wires and fiber optics, or wireless communication channels.
 以上、上記実施形態等を参照して本願発明を説明したが、本願発明は、上述した実施形態に限定されるものではない。本願発明の構成や詳細には、本願発明の範囲内で当業者が理解しうる様々な変更をすることができる。また、上述した特徴量生成部121と学習部122との機能のうちの少なくとも一以上の機能は、ネットワーク上のいかなる場所に設置され接続された情報処理装置で実行されてもよく、つまり、いわゆるクラウドコンピューティングで実行されてもよい。 Although the present invention has been described above with reference to the above-described embodiments, the present invention is not limited to the above-described embodiments. The configuration and details of the present invention can be modified in various ways within the scope of the present invention by those skilled in the art. Further, at least one or more of the functions of the feature value generation unit 121 and the learning unit 122 described above may be executed by an information processing device installed and connected to any location on the network, that is, the so-called It may also be performed in cloud computing.
 <付記>
 上記実施形態の一部又は全部は、以下の付記のようにも記載されうる。以下、本発明における情報処理システム、情報処理方法、プログラムの構成の概略を説明する。但し、本発明は、以下の構成に限定されない。
(付記1)
 所定の時間区間に区切られた映像データである区間映像データのうち、一部の区間である第一区間の映像データに基づく特徴量である第一特徴量と、前記第一区間以外の区間である第二区間の映像データに基づく特徴量である第二特徴量と、を生成する特徴量生成部と、
 映像データに基づく特徴量の入力に応じて当該特徴量に対応する行動を出力する学習モデルを生成する際に、前記第一特徴量に対応する行動と、前記第二特徴量に対応する行動と、を学習する学習部と、
を備えた情報処理システム。
(付記2)
 付記1に記載の情報処理システムであって、
 前記学習部は、前記第一特徴量に前記区間映像データに対して設定された正解行動が対応するよう学習し、前記第二特徴量に複数の行動が対応するよう学習する、
情報処理システム。
(付記3)
 付記2に記載の情報処理システムであって、
 前記学習部は、前記第二特徴量に複数の行動が均等に対応するよう学習する、
情報処理システム。
(付記4)
 付記2又は3に記載の情報処理システムであって、
 前記学習部は、映像データに基づく特徴量の入力に応じて当該特徴量に対応する行動の対応度合いを出力する学習モデルを生成する際に、前記第一特徴量に対する前記正解行動の対応度合いが高くなるよう学習し、前記第二特徴量に対する複数の行動の対応度合いがそれぞれ低くなるよう学習する、
情報処理システム。
(付記5)
 付記1乃至4のいずれかに記載の情報処理システムであって、
 前記特徴量生成部は、時間方向の成分を有するよう、それぞれ1つの前記第一特徴量と前記第二特徴量とを生成する、
情報処理システム。
(付記6)
 付記5に記載の情報処理システムであって、
 前記特徴量生成部は、前記第一特徴量の時間方向のサイズが予め設定されたサイズとなるよう生成する、
情報処理システム。
(付記7)
 付記5又は6に記載の情報処理システムであって、
 前記特徴量生成部は、前記第二区間が時間方向に分断された複数の区間である場合に、当該複数の区間を1つに連結した前記第二区間の映像データに基づく前記第二特徴量を生成する、
情報処理システム。
(付記8)
 付記1乃至7のいずれかに記載の情報処理システムであって、
 前記特徴量生成部は、前記区間映像データに基づいて当該区間映像データの特徴量を生成すると共に、当該特徴量と前記第一区間及び前記第二区間とに基づいて、前記第一特徴量と前記第二特徴量とを生成する、
情報処理システム。
(付記9)
 付記1乃至8のいずれかに記載の情報処理システムであって、
 前記区間映像データに基づいて、当該区間映像データのうち前記第一区間と前記第二区間とを検出する区間検出部を備えた、
情報処理システム。
(付記10)
 付記1乃至9のいずれかに記載の情報処理システムであって、
 行動識別の対象となる対象映像データに基づいて当該対象映像データの特徴量である対象特徴量を生成する対象特徴量生成部と、
 前記学習モデルに前記対象特徴量を入力することにより当該学習モデルから出力された行動に基づいて前記対象映像データの行動を識別する識別部と、
を備えた情報処理システム。
(付記11)
 付記10に記載の情報処理システムであって、
 前記対象映像データに基づいて、当該対象映像データのうち前記第一区間を検出する対象区間検出部を備え、
 前記対象特徴量生成部は、前記対象特徴量と前記対象映像データの前記第一区間とに基づいて、当該第一区間に対応する特徴量である第一対象特徴量を生成し、
 前記識別部は、前記学習モデルに前記第一対象特徴量を入力することにより当該学習モデルから出力された行動に基づいて前記対象映像データの行動を識別する、
情報処理システム。
(付記12)
 所定の時間区間に区切られた映像データである区間映像データのうち、一部の区間である第一区間の映像データに基づく特徴量である第一特徴量と、前記第一区間以外の区間である第二区間の映像データに基づく特徴量である第二特徴量と、を生成し、
 映像データに基づく特徴量の入力に応じて当該特徴量に対応する行動を出力する学習モデルを生成する際に、前記第一特徴量に対応する行動と、前記第二特徴量に対応する行動と、を学習する、
情報処理方法。
(付記13)
 付記12に記載の情報処理方法であって、
 前記第一特徴量に前記区間映像データに対して設定された正解行動が対応するよう学習し、前記第二特徴量に複数の行動が対応するよう学習する、
情報処理方法。
(付記14)
 付記12又は13に記載の情報処理方法であって、
 行動識別の対象となる対象映像データに基づいて当該対象映像データの特徴量である対象特徴量を生成し、
 前記学習モデルに前記対象特徴量を入力することにより当該学習モデルから出力された行動に基づいて前記対象映像データの行動を識別する、
情報処理方法。
(付記15)
 情報処理装置に、
 所定の時間区間に区切られた映像データである区間映像データのうち、一部の区間である第一区間の映像データに基づく特徴量である第一特徴量と、前記第一区間以外の区間である第二区間の映像データに基づく特徴量である第二特徴量と、を生成し、
 映像データに基づく特徴量の入力に応じて当該特徴量に対応する行動を出力する学習モデルを生成する際に、前記第一特徴量に対応する行動と、前記第二特徴量に対応する行動と、を学習する、
処理を実行させるためのプログラムを記憶したコンピュータにて読み取り可能な記憶媒体。
<Additional notes>
Part or all of the above embodiments may also be described as in the following additional notes. Hereinafter, the outline of the configuration of the information processing system, information processing method, and program according to the present invention will be explained. However, the present invention is not limited to the following configuration.
(Additional note 1)
Among section video data that is video data divided into predetermined time sections, a first feature amount that is a feature amount based on video data of a first section that is a part of the section, and a first feature amount that is a feature amount based on video data of a first section that is a part of the section, and a first feature amount that is a feature amount based on video data of a first section that is a part of the section, and a feature amount generation unit that generates a second feature amount that is a feature amount based on video data of a certain second section;
When generating a learning model that outputs a behavior corresponding to a feature in response to input of a feature based on video data, an action corresponding to the first feature and an action corresponding to the second feature are generated. , a learning department that learns
An information processing system equipped with
(Additional note 2)
The information processing system described in Appendix 1,
The learning unit learns so that a correct action set for the section video data corresponds to the first feature amount, and learns so that a plurality of actions correspond to the second feature amount.
Information processing system.
(Additional note 3)
The information processing system described in Appendix 2,
The learning unit learns so that the plurality of actions equally correspond to the second feature amount.
Information processing system.
(Additional note 4)
The information processing system according to appendix 2 or 3,
When the learning unit generates a learning model that outputs the degree of correspondence of the behavior corresponding to the feature amount in response to the input of the feature amount based on the video data, the learning unit determines whether the degree of correspondence of the correct behavior to the first feature amount is learning so that the degree of correspondence of the plurality of actions to the second feature amount becomes low;
Information processing system.
(Appendix 5)
The information processing system according to any one of Supplementary Notes 1 to 4,
The feature amount generation unit generates one of the first feature amount and one of the second feature amount, each having a component in a time direction.
Information processing system.
(Appendix 6)
The information processing system according to appendix 5,
The feature quantity generation unit generates the first feature quantity so that the size in the time direction becomes a preset size.
Information processing system.
(Appendix 7)
The information processing system according to appendix 5 or 6,
When the second section is a plurality of sections divided in the time direction, the feature amount generation unit generates the second feature amount based on video data of the second section that is a combination of the plurality of sections. generate,
Information processing system.
(Appendix 8)
The information processing system according to any one of Supplementary Notes 1 to 7,
The feature quantity generation unit generates a feature quantity of the section video data based on the section video data, and also generates a feature quantity of the section video data based on the feature quantity and the first section and the second section. generating the second feature amount;
Information processing system.
(Appendix 9)
The information processing system according to any one of Supplementary Notes 1 to 8,
comprising a section detection unit that detects the first section and the second section of the section video data based on the section video data;
Information processing system.
(Appendix 10)
The information processing system according to any one of Supplementary Notes 1 to 9,
a target feature amount generation unit that generates a target feature amount that is a feature amount of the target video data based on the target video data that is the target of behavior identification;
an identification unit that identifies the behavior of the target video data based on the behavior output from the learning model by inputting the target feature amount to the learning model;
An information processing system equipped with
(Appendix 11)
The information processing system according to appendix 10,
comprising a target section detection unit that detects the first section of the target video data based on the target video data,
The target feature generation unit generates a first target feature that is a feature corresponding to the first section based on the target feature and the first section of the target video data,
The identification unit identifies the behavior of the target video data based on the behavior output from the learning model by inputting the first target feature amount to the learning model.
Information processing system.
(Appendix 12)
Among section video data that is video data divided into predetermined time sections, a first feature amount that is a feature amount based on video data of a first section that is a part of the section, and a first feature amount that is a feature amount based on video data of a first section that is a part of the section, and a first feature amount that is a feature amount based on video data of a first section that is a part of the section, and Generate a second feature amount that is a feature amount based on video data of a certain second section,
When generating a learning model that outputs a behavior corresponding to a feature in response to input of a feature based on video data, an action corresponding to the first feature and an action corresponding to the second feature are generated. , learn,
Information processing method.
(Appendix 13)
The information processing method according to appendix 12,
learning so that the correct action set for the section video data corresponds to the first feature amount, and learning so that a plurality of actions correspond to the second feature amount;
Information processing method.
(Appendix 14)
The information processing method according to appendix 12 or 13,
Generate a target feature amount that is a feature amount of the target video data based on the target video data that is the target of behavior identification,
identifying the behavior of the target video data based on the behavior output from the learning model by inputting the target feature amount to the learning model;
Information processing method.
(Appendix 15)
In the information processing device,
Among section video data that is video data divided into predetermined time sections, a first feature amount that is a feature amount based on video data of a first section that is a part of the section, and a first feature amount that is a feature amount based on video data of a first section that is a part of the section, and a first feature amount that is a feature amount based on video data of a first section that is a part of the section, and Generate a second feature amount that is a feature amount based on video data of a certain second section,
When generating a learning model that outputs a behavior corresponding to a feature in response to input of a feature based on video data, an action corresponding to the first feature and an action corresponding to the second feature are generated. , learn,
A computer-readable storage medium that stores a program for executing processing.
1 行動認識システム
10 学習装置
11 特徴抽出部
12 行動区間検出部
13 区間内特徴抽出部
14 区間外特徴抽出部
15 識別部
16 学習部
20 記憶装置
21 学習データ記憶部
22 パラメータ記憶部
30 推定装置
31 特徴抽出部
32 行動区間検出部
33 区間内特徴抽出部
35 識別部
36 出力部
100 情報処理システム
101 CPU
102 ROM
103 RAM
104 プログラム群
105 記憶装置
106 ドライブ装置
107 通信インタフェース
108 入出力インタフェース
109 バス
110 記憶媒体
111 通信ネットワーク
121 特徴量生成部
122 学習部
 
1 Behavior recognition system 10 Learning device 11 Feature extraction unit 12 Behavior section detection unit 13 In-section feature extraction unit 14 Out-of-section feature extraction unit 15 Identification unit 16 Learning unit 20 Storage device 21 Learning data storage unit 22 Parameter storage unit 30 Estimation device 31 Feature extraction unit 32 Activity section detection unit 33 Intra-section feature extraction unit 35 Identification unit 36 Output unit 100 Information processing system 101 CPU
102 ROM
103 RAM
104 Program group 105 Storage device 106 Drive device 107 Communication interface 108 Input/output interface 109 Bus 110 Storage medium 111 Communication network 121 Feature generation section 122 Learning section

Claims (15)

  1.  所定の時間区間に区切られた映像データである区間映像データのうち、一部の区間である第一区間の映像データに基づく特徴量である第一特徴量と、前記第一区間以外の区間である第二区間の映像データに基づく特徴量である第二特徴量と、を生成する特徴量生成部と、
     映像データに基づく特徴量の入力に応じて当該特徴量に対応する行動を出力する学習モデルを生成する際に、前記第一特徴量に対応する行動と、前記第二特徴量に対応する行動と、を学習する学習部と、
    を備えた情報処理システム。
    Among section video data that is video data divided into predetermined time sections, a first feature amount that is a feature amount based on video data of a first section that is a part of the section, and a first feature amount that is a feature amount based on video data of a first section that is a part of the section, and a first feature amount that is a feature amount based on video data of a first section that is a part of the section, and a feature amount generation unit that generates a second feature amount that is a feature amount based on video data of a certain second section;
    When generating a learning model that outputs a behavior corresponding to a feature in response to input of a feature based on video data, an action corresponding to the first feature and an action corresponding to the second feature are generated. , a learning department that learns
    An information processing system equipped with
  2.  請求項1に記載の情報処理システムであって、
     前記学習部は、前記第一特徴量に前記区間映像データに対して設定された正解行動が対応するよう学習し、前記第二特徴量に複数の行動が対応するよう学習する、
    情報処理システム。
    The information processing system according to claim 1,
    The learning unit learns so that a correct action set for the section video data corresponds to the first feature amount, and learns so that a plurality of actions correspond to the second feature amount.
    Information processing system.
  3.  請求項2に記載の情報処理システムであって、
     前記学習部は、前記第二特徴量に複数の行動が均等に対応するよう学習する、
    情報処理システム。
    The information processing system according to claim 2,
    The learning unit learns so that the plurality of actions equally correspond to the second feature amount.
    Information processing system.
  4.  請求項2又は3に記載の情報処理システムであって、
     前記学習部は、映像データに基づく特徴量の入力に応じて当該特徴量に対応する行動の対応度合いを出力する学習モデルを生成する際に、前記第一特徴量に対する前記正解行動の対応度合いが高くなるよう学習し、前記第二特徴量に対する複数の行動の対応度合いがそれぞれ低くなるよう学習する、
    情報処理システム。
    The information processing system according to claim 2 or 3,
    When the learning unit generates a learning model that outputs the degree of correspondence of the behavior corresponding to the feature amount in response to the input of the feature amount based on the video data, the learning unit determines whether the degree of correspondence of the correct behavior to the first feature amount is learning so that the degree of correspondence of the plurality of actions to the second feature amount becomes low;
    Information processing system.
  5.  請求項1乃至4のいずれかに記載の情報処理システムであって、
     前記特徴量生成部は、時間方向の成分を有するよう、それぞれ1つの前記第一特徴量と前記第二特徴量とを生成する、
    情報処理システム。
    The information processing system according to any one of claims 1 to 4,
    The feature amount generation unit generates one of the first feature amount and one of the second feature amount, each having a component in a time direction.
    Information processing system.
  6.  請求項5に記載の情報処理システムであって、
     前記特徴量生成部は、前記第一特徴量の時間方向のサイズが予め設定されたサイズとなるよう生成する、
    情報処理システム。
    The information processing system according to claim 5,
    The feature quantity generation unit generates the first feature quantity so that the size in the time direction becomes a preset size.
    Information processing system.
  7.  請求項5又は6に記載の情報処理システムであって、
     前記特徴量生成部は、前記第二区間が時間方向に分断された複数の区間である場合に、当該複数の区間を1つに連結した前記第二区間の映像データに基づく前記第二特徴量を生成する、
    情報処理システム。
    The information processing system according to claim 5 or 6,
    When the second section is a plurality of sections divided in the time direction, the feature amount generation unit generates the second feature amount based on video data of the second section that is a combination of the plurality of sections. generate,
    Information processing system.
  8.  請求項1乃至7のいずれかに記載の情報処理システムであって、
     前記特徴量生成部は、前記区間映像データに基づいて当該区間映像データの特徴量を生成すると共に、当該特徴量と前記第一区間及び前記第二区間とに基づいて、前記第一特徴量と前記第二特徴量とを生成する、
    情報処理システム。
    The information processing system according to any one of claims 1 to 7,
    The feature quantity generation unit generates a feature quantity of the section video data based on the section video data, and also generates a feature quantity of the section video data based on the feature quantity and the first section and the second section. generating the second feature amount;
    Information processing system.
  9.  請求項1乃至8のいずれかに記載の情報処理システムであって、
     前記区間映像データに基づいて、当該区間映像データのうち前記第一区間と前記第二区間とを検出する区間検出部を備えた、
    情報処理システム。
    The information processing system according to any one of claims 1 to 8,
    comprising a section detection unit that detects the first section and the second section of the section video data based on the section video data;
    Information processing system.
  10.  請求項1乃至9のいずれかに記載の情報処理システムであって、
     行動識別の対象となる対象映像データに基づいて当該対象映像データの特徴量である対象特徴量を生成する対象特徴量生成部と、
     前記学習モデルに前記対象特徴量を入力することにより当該学習モデルから出力された行動に基づいて前記対象映像データの行動を識別する識別部と、
    を備えた情報処理システム。
    The information processing system according to any one of claims 1 to 9,
    a target feature amount generation unit that generates a target feature amount that is a feature amount of the target video data based on the target video data that is the target of behavior identification;
    an identification unit that identifies the behavior of the target video data based on the behavior output from the learning model by inputting the target feature amount to the learning model;
    An information processing system equipped with
  11.  請求項10に記載の情報処理システムであって、
     前記対象映像データに基づいて、当該対象映像データのうち前記第一区間を検出する対象区間検出部を備え、
     前記対象特徴量生成部は、前記対象特徴量と前記対象映像データの前記第一区間とに基づいて、当該第一区間に対応する特徴量である第一対象特徴量を生成し、
     前記識別部は、前記学習モデルに前記第一対象特徴量を入力することにより当該学習モデルから出力された行動に基づいて前記対象映像データの行動を識別する、
    情報処理システム。
    The information processing system according to claim 10,
    comprising a target section detection unit that detects the first section of the target video data based on the target video data,
    The target feature generation unit generates a first target feature that is a feature corresponding to the first section based on the target feature and the first section of the target video data,
    The identification unit identifies the behavior of the target video data based on the behavior output from the learning model by inputting the first target feature amount to the learning model.
    Information processing system.
  12.  所定の時間区間に区切られた映像データである区間映像データのうち、一部の区間である第一区間の映像データに基づく特徴量である第一特徴量と、前記第一区間以外の区間である第二区間の映像データに基づく特徴量である第二特徴量と、を生成し、
     映像データに基づく特徴量の入力に応じて当該特徴量に対応する行動を出力する学習モデルを生成する際に、前記第一特徴量に対応する行動と、前記第二特徴量に対応する行動と、を学習する、
    情報処理方法。
    Among section video data that is video data divided into predetermined time sections, a first feature amount that is a feature amount based on video data of a first section that is a part of the section, and a first feature amount that is a feature amount based on video data of a first section that is a part of the section, and a first feature amount that is a feature amount based on video data of a first section that is a part of the section, and Generate a second feature amount that is a feature amount based on video data of a certain second section,
    When generating a learning model that outputs a behavior corresponding to a feature in response to input of a feature based on video data, an action corresponding to the first feature and an action corresponding to the second feature are generated. , learn,
    Information processing method.
  13.  請求項12に記載の情報処理方法であって、
     前記第一特徴量に前記区間映像データに対して設定された正解行動が対応するよう学習し、前記第二特徴量に複数の行動が対応するよう学習する、
    情報処理方法。
    The information processing method according to claim 12,
    learning so that the correct action set for the section video data corresponds to the first feature amount, and learning so that a plurality of actions correspond to the second feature amount;
    Information processing method.
  14.  請求項12又は13に記載の情報処理方法であって、
     行動識別の対象となる対象映像データに基づいて当該対象映像データの特徴量である対象特徴量を生成し、
     前記学習モデルに前記対象特徴量を入力することにより当該学習モデルから出力された行動に基づいて前記対象映像データの行動を識別する、
    情報処理方法。
    The information processing method according to claim 12 or 13,
    Generate a target feature amount that is a feature amount of the target video data based on the target video data that is the target of behavior identification,
    identifying the behavior of the target video data based on the behavior output from the learning model by inputting the target feature amount to the learning model;
    Information processing method.
  15.  情報処理装置に、
     所定の時間区間に区切られた映像データである区間映像データのうち、一部の区間である第一区間の映像データに基づく特徴量である第一特徴量と、前記第一区間以外の区間である第二区間の映像データに基づく特徴量である第二特徴量と、を生成し、
     映像データに基づく特徴量の入力に応じて当該特徴量に対応する行動を出力する学習モデルを生成する際に、前記第一特徴量に対応する行動と、前記第二特徴量に対応する行動と、を学習する、
    処理を実行させるためのプログラムを記憶したコンピュータにて読み取り可能な記憶媒体。
     
    In the information processing device,
    Among section video data that is video data divided into predetermined time sections, a first feature amount that is a feature amount based on video data of a first section that is a part of the section, and a first feature amount that is a feature amount based on video data of a first section that is a part of the section, and a first feature amount that is a feature amount based on video data of a first section that is a part of the section, and Generate a second feature amount that is a feature amount based on video data of a certain second section,
    When generating a learning model that outputs a behavior corresponding to a feature in response to input of a feature based on video data, an action corresponding to the first feature and an action corresponding to the second feature are generated. , learn,
    A computer-readable storage medium that stores a program for executing processing.
PCT/JP2022/016510 2022-03-31 2022-03-31 Information processing system WO2023188264A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/JP2022/016510 WO2023188264A1 (en) 2022-03-31 2022-03-31 Information processing system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2022/016510 WO2023188264A1 (en) 2022-03-31 2022-03-31 Information processing system

Publications (1)

Publication Number Publication Date
WO2023188264A1 true WO2023188264A1 (en) 2023-10-05

Family

ID=88199855

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2022/016510 WO2023188264A1 (en) 2022-03-31 2022-03-31 Information processing system

Country Status (1)

Country Link
WO (1) WO2023188264A1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2011177300A (en) * 2010-02-26 2011-09-15 Emprie Technology Development LLC Feature transformation apparatus and feature transformation method
JP2016158954A (en) * 2015-03-03 2016-09-05 富士通株式会社 State detection method, state detection device, and state detection program
JP2019040465A (en) * 2017-08-25 2019-03-14 トヨタ自動車株式会社 Behavior recognition device, learning device, and method and program
JP2019159819A (en) * 2018-03-13 2019-09-19 オムロン株式会社 Annotation method, annotation device, annotation program, and identification system
JP2020021421A (en) * 2018-08-03 2020-02-06 株式会社東芝 Data dividing device, data dividing method, and program

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2011177300A (en) * 2010-02-26 2011-09-15 Emprie Technology Development LLC Feature transformation apparatus and feature transformation method
JP2016158954A (en) * 2015-03-03 2016-09-05 富士通株式会社 State detection method, state detection device, and state detection program
JP2019040465A (en) * 2017-08-25 2019-03-14 トヨタ自動車株式会社 Behavior recognition device, learning device, and method and program
JP2019159819A (en) * 2018-03-13 2019-09-19 オムロン株式会社 Annotation method, annotation device, annotation program, and identification system
JP2020021421A (en) * 2018-08-03 2020-02-06 株式会社東芝 Data dividing device, data dividing method, and program

Similar Documents

Publication Publication Date Title
US11450146B2 (en) Gesture recognition method, apparatus, and device
JP4966820B2 (en) Congestion estimation apparatus and method
US9824296B2 (en) Event detection apparatus and event detection method
JP4369961B2 (en) Abnormality detection device and abnormality detection program
US9092662B2 (en) Pattern recognition method and pattern recognition apparatus
JP2016159407A (en) Robot control device and robot control method
EP2309454B1 (en) Apparatus and method for detecting motion
KR102217253B1 (en) Apparatus and method for analyzing behavior pattern
US20150262068A1 (en) Event detection apparatus and event detection method
JP2019522297A (en) Method and system for finding precursor subsequences in time series
JP2020507177A (en) System for identifying defined objects
KR101708491B1 (en) Method for recognizing object using pressure sensor
KR102129771B1 (en) Cctv management system apparatus that recognizes behavior of subject of shooting in video from video taken through cctv camera and operating method thereof
JP6910786B2 (en) Information processing equipment, information processing methods and programs
US9256945B2 (en) System for tracking a moving object, and a method and a non-transitory computer readable medium thereof
KR101979375B1 (en) Method of predicting object behavior of surveillance video
KR20230069892A (en) Method and apparatus for identifying object representing abnormal temperatures
US20200342215A1 (en) Model learning device, model learning method, and recording medium
US20200012866A1 (en) System and method of video content filtering
US20160202065A1 (en) Object linking method, object linking apparatus, and storage medium
JP2022003526A (en) Information processor, detection system, method for processing information, and program
WO2023188264A1 (en) Information processing system
JP2019194758A (en) Information processing device, information processing method, and program
US20200311401A1 (en) Analyzing apparatus, control method, and program
CN113661516A (en) Information processing apparatus, information processing method, and recording medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22935409

Country of ref document: EP

Kind code of ref document: A1