WO2023188264A1 - 情報処理システム - Google Patents

情報処理システム Download PDF

Info

Publication number
WO2023188264A1
WO2023188264A1 PCT/JP2022/016510 JP2022016510W WO2023188264A1 WO 2023188264 A1 WO2023188264 A1 WO 2023188264A1 JP 2022016510 W JP2022016510 W JP 2022016510W WO 2023188264 A1 WO2023188264 A1 WO 2023188264A1
Authority
WO
WIPO (PCT)
Prior art keywords
section
feature amount
video data
feature
information processing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/JP2022/016510
Other languages
English (en)
French (fr)
Japanese (ja)
Inventor
隆平 安藤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
NEC Corp
Original Assignee
NEC Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by NEC Corp filed Critical NEC Corp
Priority to JP2024511022A priority Critical patent/JP7754289B2/ja
Priority to PCT/JP2022/016510 priority patent/WO2023188264A1/ja
Publication of WO2023188264A1 publication Critical patent/WO2023188264A1/ja
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis

Definitions

  • the present invention relates to an information processing system, an information processing method, and a program.
  • Patent Document 1 describes a method for extracting feature amounts of the behavior of a moving object from a video consisting of spatiotemporal information and estimating the behavior. Specifically, Patent Document 1 describes that a motion section from the start to the end of an action in a video is estimated, and the behavior is estimated based on the feature amount of the video of this motion section.
  • Patent Document 1 has a problem in that erroneous recognition may occur if the estimation of the motion section is not accurate. For example, if the estimated motion interval includes a transition part from one behavior to another, data that is not expected by the behavior estimation model will be mixed in, making behavior estimation difficult and causing erroneous recognition. It can occur.
  • an object of the present invention is to provide an information processing system that can solve the above-mentioned problem that erroneous recognition may occur when recognizing an action from a video.
  • An information processing system that is one form of the present invention includes: Among section video data that is video data divided into predetermined time sections, a first feature amount that is a feature amount based on video data of a first section that is a part of the section, and a first feature amount that is a feature amount based on video data of a first section that is a part of the section, and a first feature amount that is a feature amount based on video data of a first section that is a part of the section, and a feature amount generation unit that generates a second feature amount that is a feature amount based on video data of a certain second section;
  • a learning department that learns Equipped with The structure is as follows.
  • an information processing method that is one form of the present invention includes: Among section video data that is video data divided into predetermined time sections, a first feature amount that is a feature amount based on video data of a first section that is a part of the section, and a first feature amount that is a feature amount based on video data of a first section that is a part of the section, and a first feature amount that is a feature amount based on video data of a first section that is a part of the section, and Generate a second feature amount that is a feature amount based on video data of a certain second section, When generating a learning model that outputs a behavior corresponding to a feature in response to input of a feature based on video data, an action corresponding to the first feature and an action corresponding to the second feature are generated. , learn, The structure is as follows.
  • a program that is one form of the present invention is In the information processing device, Among section video data that is video data divided into predetermined time sections, a first feature amount that is a feature amount based on video data of a first section that is a part of the section, and a first feature amount that is a feature amount based on video data of a first section that is a part of the section, and a first feature amount that is a feature amount based on video data of a first section that is a part of the section, and Generate a second feature amount that is a feature amount based on video data of a certain second section, When generating a learning model that outputs a behavior corresponding to a feature in response to input of a feature based on video data, an action corresponding to the first feature and an action corresponding to the second feature are generated. , learn, execute the process,
  • the structure is as follows.
  • the present invention can suppress erroneous recognition when recognizing an action from a video.
  • FIG. 1 is a block diagram showing the overall configuration of an action recognition system in Embodiment 1 of the present invention.
  • FIG. 2 is a block diagram showing the configuration of the learning device disclosed in FIG. 1.
  • FIG. 2 is a block diagram showing the configuration of the estimation device disclosed in FIG. 1.
  • FIG. FIG. 2 is a diagram showing a state of processing by the learning device disclosed in FIG. 1.
  • FIG. FIG. 2 is a diagram showing a state of processing by the learning device disclosed in FIG. 1.
  • FIG. FIG. 2 is a diagram showing a state of processing by the learning device disclosed in FIG. 1.
  • FIG. 2 is a flowchart showing the operation of the learning device disclosed in FIG. 1.
  • FIG. 2 is a flowchart showing the estimation operation disclosed in FIG. 1.
  • FIG. 1 is a block diagram showing the overall configuration of an action recognition system in Embodiment 1 of the present invention.
  • FIG. 2 is a block diagram showing the configuration of the learning device disclosed in FIG. 1.
  • FIG. 2 is a block
  • FIG. 3 is a block diagram showing the hardware configuration of an information processing system in Embodiment 3 of the present invention.
  • FIG. 3 is a block diagram showing the configuration of an information processing system in Embodiment 3 of the present invention. It is a flowchart which shows the operation of the information processing system in Embodiment 3 of the present invention.
  • FIGS. 1 to 8. are diagrams for explaining the configuration of the behavior recognition system, and FIGS. 4 to 8 are diagrams for explaining the processing operation of the behavior recognition system.
  • the behavior recognition system 1 of the present invention generates a behavior estimation model by machine learning in order to recognize the behavior of a person from video data, and uses the generated behavior estimation model to recognize the behavior of a person in new video data.
  • the behavior recognition system 1 can be used for safety management at a construction site, and can be used to recognize whether or not a worker has performed a safety confirmation behavior such as a pointing gesture.
  • the behavior recognition system 1 uses images captured by surveillance cameras installed at construction sites to create a database of when, where, and how many times workers perform safety confirmation tasks such as pointing and checking. It can be used to record and send alerts to sites where safety confirmation work is not being performed.
  • the behavior recognition system 1 can also be used to manage man-hours at construction sites, and by recording which work the worker in the video did, how many times, and for how long, it can be used to manage work hours as expected. You can check whether the work is being done or not.
  • actions recognized by the action recognition system 1 will be described using actions such as a pointing action, a walking action, and a crouching action of a person as examples; however, any action may be recognized; It may be used not only for the behavioral recognition of any target.
  • the behavior recognition system 1 includes a learning device 10, a storage device 20, and an estimation device 30, as shown in FIG.
  • the learning device 10 is a device that performs learning of a learning model used for estimating behavioral information from time-series data based on time-series data (also referred to as learning data) used for learning the learning model.
  • the storage device 20 is a device that can refer to and write data, and is a device that stores learning data, deep learning model parameters, and the like.
  • the estimation device 30 configures an output (estimation) device by referring to the learned parameters stored in the storage device 20, and outputs information regarding the behavior of the estimation target. It is a device that generates. Each device will be explained in detail below.
  • the storage device 20 is composed of one or more information processing devices including an arithmetic device and a storage device.
  • the storage device 20 includes a learning data storage section 21 and a parameter storage section 22, as shown in FIG.
  • the learning data storage unit 21 is a device that stores learning data for performing learning processing of the learning device 10.
  • the learning data as shown by reference numeral D1 in FIG. 4, is video data consisting of a plurality of consecutive frames in time series, and is divided into predetermined time intervals by cutting out a window with a predetermined width Sw. It is generated as a time-series clip that is segmental video data.
  • reference numeral D2 in FIG. 4 frames are cut out while sliding the window at sliding intervals St, and time-series clips are sequentially generated and used as learning data.
  • this method is called a sliding window method.
  • inference data which is video data input to the estimation device 30 from an external device, also has a similar configuration.
  • correct information which is behavioral information (correct behavior) to be estimated in the corresponding learning data
  • the correct answer information includes identification information of an action that is a correct answer.
  • the correct information associated with the target learning data includes identification information indicating that the person is walking. It will be done.
  • the parameter storage unit 22 is a device that stores parameters obtained by learning a learning model.
  • the learning model may be a learning model based on a neural network, another type of learning model such as a support vector machine, or a combination thereof.
  • the parameters include the layer structure, the neuron structure of each layer, the number and size of filters in each layer, and the weight of each element of each filter. Note that before learning is executed, initial values of parameters to be applied to the learning model are stored in the parameter storage unit 22, and the parameters are updated each time learning is performed by the learning device 10, as will be described later.
  • the learning device 10 is composed of one or more information processing devices including an arithmetic device and a storage device. As shown in FIG. 2, the learning device 10 includes a feature extraction section 11, an action section detection section 12, an in-section feature extraction section 13, an out-of-section feature extraction section 14, an identification section 15, and a learning section 16. Each function of the feature extraction unit 11, the action section detection unit 12, the intra-section feature extraction unit 13, the out-of-section feature extraction unit 14, the identification unit 15, and the learning unit 16 is such that the arithmetic unit realizes each function stored in the storage device. This can be achieved by running a program to do this. Each configuration will be explained in detail below.
  • the feature extraction unit 11 acquires the learning data of the window width as described above from the learning data storage unit 21, and converts the acquired learning data of the window width into the feature quantity F.
  • the feature amount F has a width in the time direction, as an example is shown in FIG.
  • the feature amount F is, for example, three-dimensional data in the time direction, the direction of the skeleton points of the person, and the dimensional direction of the vector amount calculated at each time and position, as calculated by a neural network as described in Non-Patent Document 1.
  • it may be two-dimensional data in the dimensional direction of the feature amount in the time direction that is collapsed by taking the maximum value or the average value in the direction of the skeleton point.
  • the time direction may be for each frame, or may be compressed by convolution processing of a neural network.
  • the feature extraction unit 11 applies the parameters stored in the parameter storage unit 22 to the learning model that is trained to output the feature amount F from the input learning data. Configure. Then, the feature extraction unit 11 supplies the feature amount F obtained by inputting the learning data to the feature extractor to the intra-interval feature extraction unit 13 and the out-of-interval feature extraction unit 14, respectively.
  • the behavior section detection section 12 acquires learning data from the learning data storage section 21, detects an important section (first section) in estimating behavior information from the acquired learning data, and stores section information S. Output.
  • the important interval in estimating behavioral information here is an interval that is useful as a criterion for judgment when estimating behavior, and the error is reduced by comparing it with the correct information linked to the learning data later. It is an interval.
  • the action section detection unit 12 outputs the action section corresponding to the correct information of the learning data as the section information S.
  • the action section detection section 12 detects the action section by applying the parameters stored in the parameter storage section 22 to the learning model trained to output section information S from the input learning data. Configure the vessel.
  • the action section detector estimates the action section to be recognized with respect to the time direction of the feature amount F using a neural network that estimates the deformation parameter of the feature amount as described in Non-Patent Document 2.
  • a neural network that estimates the deformation parameter of the feature amount as described in Non-Patent Document 2.
  • an interval detector that directly learns interval detection by preparing correct values for the interval may be used.
  • the action section 12 sets pointing behavior as the action section to be recognized
  • the above-mentioned learning outputs the section in which the action is performed as the section information S, and as a result, other The section in which the above action is performed will be detected as outside the action section (second section).
  • a frame section shown in gray is detected as an action section to be recognized
  • a frame section indicated by reference numeral Da is detected as outside the action section.
  • the action section detection section 12 supplies the obtained section information S to the intra-section feature extraction section 13 and the out-of-section feature extraction section 14, respectively.
  • the intra-section feature extraction unit 13 (feature generation unit) and the out-of-section feature extraction unit 14 (feature generation unit) extract the feature F supplied from the feature extraction unit 11 and the interval supplied from the action interval detection unit 12.
  • an in-section feature amount F1 and an out-of-section feature amount F2 are respectively generated.
  • the intra-section feature extraction unit 13 extracts the section corresponding to the section information S from the feature amount F as in Non-Patent Document 2, and performs resizing processing, warping processing, etc. , the time direction is adjusted to always have a constant size regardless of the size of the section information S, and an intra-section feature amount F1 (first feature amount) is generated.
  • the intra-interval feature amount F1 may be three-dimensional data in the time direction, the direction of the skeletal position, the dimensional direction of the vector amount calculated at each time and each position, or the two-dimensional data in the time and dimensional direction. It may be data.
  • the intra-section feature extraction unit 13 supplies the generated intra-section feature amount F1 to the identification unit 15.
  • the out-of-section feature extraction unit 14 extracts the feature amount corresponding to the section corresponding to the section information S from the feature amount F as described above.
  • An out-of-section feature amount F2 (second feature amount) is generated from the feature amount of the section detected as being outside the action section.
  • the out-of-section feature extraction unit 14 connects them in the time direction to create one out-of-section feature. It is generated as a feature amount F2. Then, the out-of-interval feature extraction unit 14 supplies the generated out-of-interval feature amount F2 to the identification unit 15.
  • the feature amount F is generated from the learning data of the window width, and then the in-interval feature amount F1 and the out-of-interval feature amount F2 are generated.
  • the method for generating the amount F2 is not limited to the above method.
  • the intra-interval feature extraction unit 13 may generate the intra-interval feature amount F1 from the learning data portion that corresponds to the interval information S
  • the out-of-interval feature extraction unit 14 may generate the intra-interval feature amount F1 from the learning data portion that does not correspond to the interval information S.
  • the out-of-section feature amount F2 may also be generated.
  • the identification unit 15 generates information regarding the target behavior based on the intra-interval feature amount F1 supplied from the intra-interval feature extraction unit 13 and the out-of-interval feature amount F2 supplied from the out-of-interval feature extraction unit 14. At this time, the identification unit 15 sets the learning model that is trained to output the in-section behavior information If1 and the out-of-section behavior information If2 from the input in-section feature amount F1 and out-of-section feature amount F2, respectively. By applying the parameters stored in the parameter storage unit 22, an action information output device is configured.
  • the in-section behavior information If1 and the out-of-section behavior information If2 are, for example, the identification information of the behavior corresponding to the target learning data or the score value of the estimated behavior, and have the same number of dimensions as the number of defined behavior categories. It is a vector.
  • the intra-interval feature amount F1 and the out-of-interval feature amount F2 are three-dimensional data, the maximum value or average value is taken in the direction of the skeleton points and collapsed into two dimensions, and identification processing is performed for each dimension in the time direction. After that, averaging processing is performed in the time direction. Therefore, the output of the identification unit 15 is a vector quantity having the same number of dimensions as the number of behavior categories to be recognized.
  • the identification unit 15 supplies the learning unit 16 with the in-section behavior information If1 and the out-of-section behavior information If2 obtained by inputting the in-section feature amount F1 and the out-of-section feature amount F2 to the behavior information output device, respectively.
  • the learning unit 16 acquires correct answer information corresponding to the learning data input to the feature extraction unit 11 from the learning data storage unit 21. Then, the learning section 16 controls the feature extraction section 11, the action section detection section 12, and the identification section 15 based on the acquired correct answer information and the within-section behavior information If1 and the out-of-section behavior information If2 supplied from the identification section 15. Learn. At this time, the learning unit 16 calculates a loss value L1 calculated from the error between the action information indicated by the intra-section action information If1 and the correct answer information, and a loss value L2 calculated from the out-of-section action information If2, respectively. A loss is calculated from these values, and each parameter of the feature extraction section 11, action section detection section 12, and identification section 15 is updated based on this loss.
  • the loss value L1 may be calculated using any loss function used in machine learning, such as softmax cross entropy error and mean square error.
  • the loss value L2 is calculated so that the value is equal for all behavior categories.
  • the loss value L2 may be the average of the negative logarithm of the average value of each category of the softmax value of the out-of-section behavior information If2, or a constraint may be imposed so that 0 is output for all behavior categories.
  • the pointing behavior of the vector amount of the intra-section behavior information If1 The error L1 becomes smaller when the value of the corresponding dimension becomes the largest, and the error L2 becomes smaller as the values of all dimensions of the vector quantity of the out-of-section action information If2 are uniform.
  • the learning unit 16 determines each parameter so as to minimize these losses.
  • the threshold value is set to 0.7, and the inference results exceeding 0.7 are Subtract the threshold and add the squared value to the loss. For small results, similar processing is performed if the result is below the threshold.
  • the algorithm for determining the parameters described above so as to minimize the loss may be any learning algorithm used in machine learning, such as gradient descent or error backpropagation.
  • the learning section 16 stores the determined parameters of the feature extraction section 11, action section detection section 12, and identification section 15 in the parameter storage section 22.
  • the learning device 10 has the function of detecting an interval useful for behavior estimation in time-series data using the action interval detection unit 12.
  • the intra-interval feature extractor 13 and the extra-interval feature extractor 14 extract the intra-interval feature F1 and the extra-interval feature F2 from the feature F output by the feature extractor 11. Cut out.
  • learning proceeds so that the score value corresponding to the correct behavior class becomes the largest, and the out-of-interval feature F2 is passed through the identification unit 15. 15, the learning progresses so that the score values of all behavior classes are uniform and no class stands out.
  • the learning device 10 handles not only information within the section but also information outside the section at the same time, so that if there is sufficient data, when data from a section with low importance is input during behavior estimation, It is possible to ensure that the behavior estimation model operates so that the score values of all behavior classes do not stand out significantly.
  • the recognition target is a person's pointing, walking, or crouching behavior
  • the behavior candidate interval approaches the interval where the behavior is transitioning, the previous model does not take this interval into consideration, so results may be output where the score value of a certain behavior class stands out. false positives occur.
  • erroneous detection can be suppressed by learning to detect intervals useful for behavior estimation, and proceeding with learning so that all score values are uniform if outside the interval.
  • the action section detecting section 12 performs section detection based on the learning data received from the learning data storage section 21, but it is also possible to perform section detection using the feature amount F outputted by the feature extracting section 11 as input. good.
  • the estimation device 30 is configured with one or more information processing devices including an arithmetic device and a storage device.
  • the estimation device 30 includes a feature extraction section 31, an identification section 35, and an output section 36, as shown in FIG.
  • the functions of the feature extraction section 31, the identification section 35, and the output section 36 can be realized by the arithmetic unit executing a program stored in the storage device for realizing each function.
  • the feature extraction unit 31 acquires time series data input from an external device, and converts the acquired time series data into a feature F (target feature).
  • the time-series data input from the external device is inference data to be targeted for behavior identification, and is video data (target video data) similar to the learning data described above. That is, as shown in FIG. 4, the inference data is video data consisting of a plurality of consecutive frames (image sequences) along the time series, and is a time series clip cut out with a window of a predetermined width Sw.
  • the inference data input to the estimation device 30 may be data obtained by further extracting information from the image sequence, such as skeletal information.
  • the external device that inputs the inference data may be a camera if the input is an image sequence, or it may be a camera if the input is a generated image sequence or information extracted from the image sequence, and it may be a camera that stores the generated image sequence or information extracted from the image sequence. It may be a device.
  • the feature extraction unit 31 configures a feature extractor based on the parameters by referring to the parameters stored in the parameter storage unit 22 and obtained through the learning process by the learning device 10. Then, the feature extraction unit 31 supplies the feature amount F obtained by inputting the inference data to the feature extractor to the identification unit 35.
  • the identification unit 35 generates behavior information Ifa from the feature amount F supplied from the feature extraction unit 31.
  • the identification unit 35 performs estimation for each dimension of the feature amount F in the time direction.
  • the identification unit 35 configures the behavior information output device by referring to the parameters stored in the parameter storage unit 22. Then, the identification unit 35 supplies the behavior information Ifa obtained by inputting the feature amount F to the behavior information output device to the output unit 36.
  • the output unit 36 outputs identification information of the behavior to be extracted to an external device based on the behavior information Ifa. Since the behavior information Ifa is output for each dimension in the time direction, it is compressed in the time direction by averaging or summing the score values, and is then compressed into a vector quantity with the number of dimensions equal to the number of behavior classes to be recognized. Output. This vector quantity becomes the identification information of the action at the time at the center of this window.
  • the behavior identification information when the input time series data is divided at fixed window intervals using a sliding window method, the behavior identification information based on the behavior information Ifa in each window is input. For the score values arranged in chronological order, the time when a certain threshold is exceeded is set as the starting point, and the time when it is below a certain threshold is set as the ending point, and the identification information of the action and its starting and ending points are output. .
  • the estimation processing performed by the estimation device 30 described above does not perform section detection, which is performed in the learning processing performed by the learning device 10. This is because the behavior estimation model reacts and the score value increases only in the characteristic sections of the behavior during the learning process, so the time series data is scanned in a sliding window manner and the score is scored in time series. This is because when arranging the values and determining the start point and end point by threshold value determination, there is no need to perform section detection within the model.
  • the feature extraction unit 11 acquires learning data from the learning data storage unit 21 (step S1). At this time, the feature extraction unit 11 acquires learning data that has not yet been used for learning (that is, not acquired in step S1) from among the learning data stored in the learning data storage unit 21. Then, the feature extraction unit 11 generates a feature amount F from the learning data acquired in step S1 by configuring a feature extractor with reference to the parameters stored in the parameter storage unit 22 (step S2).
  • the action section detection section 12 generates section information S from the learning data by configuring an action section detector with reference to the parameters stored in the parameter storage section 22 (step S3).
  • the in-section feature extraction section 13 and the out-of-section feature extraction section 14 generate an in-section feature amount F1 and an outside-section feature amount F2, respectively (step S4).
  • the identification unit 15 refers to the parameters stored in the parameter storage unit 22 and configures a behavior information output device to extract behavior information If1, Behavior information If2 is generated from the out-of-interval feature amount F2 generated by the out-of-interval feature extraction unit 14 (step S5).
  • the learning unit 16 based on the intra-section behavior information If1 and the out-of-section behavior information If2 generated by the identification unit 15, and the correct answer information stored in the learning data storage unit 21 in association with the target learning data, Calculate loss (step S6). Further, the learning unit 16 updates the parameters used by each of the feature extraction unit 11, action segment detection unit 12, and identification unit 15 based on the loss calculated in step S6 (step S7). At this time, the learning section 16 stores the respective parameters used by the feature extraction section 11, the action section detection section 12, and the identification section 15 in the parameter storage section 22.
  • the learning device 10 determines whether the learning end condition is satisfied (step S8).
  • the learning device 10 may determine the end condition for learning by determining whether or not a predetermined number of loops has been reached, or by determining whether or not a predetermined number of loops have been reached, or by determining the end condition for learning based on a preset number of learning data. This may be done by determining whether learning has been performed, or by determining whether the loss is below a preset threshold, or whether the change in loss is below a preset threshold. This may be done by determining whether or not it has been completed.
  • step S8 may be a combination of the above-mentioned examples, or may be any other determination method.
  • step S8 if the learning end condition is satisfied (Yes in step S8), the learning device 10 ends the flowchart. On the other hand, if the learning end condition is not satisfied (No in step S8), the learning device 10 returns the process to step S1. At this time, the learning device 10 retrieves unused learning data from the learning data storage unit 21 in step S1, and performs the processing from step S2 onwards.
  • the learning device 10 learns the learning model used for behavior estimation from the learning data, and records the parameters of the learned learning model in the storage device 20.
  • the estimating device 30 repeatedly executes the process shown in the flowchart shown in FIG. 8 every time input data is input to the estimating device 30.
  • the input data is obtained by scanning video data, which is time-series data, in a sliding window manner.
  • the feature extraction unit 31 acquires input data supplied from an external device (step S11). Then, the feature extraction unit 31 generates the feature amount F from the input data acquired in step S11 by configuring a feature extractor with reference to the parameters stored in the parameter storage unit 22 (step S12). Next, the identification unit 35 generates behavior information Ifa from the feature amount F by configuring a behavior information output device with reference to the parameters stored in the parameter storage unit 22 (step S3). Then, the output unit 36 outputs the identification information of the action and its start and end points to the external device based on the action information Ifa generated by the identification unit 35 (step S14).
  • the estimation device 30 refers to the stored learned parameters, constructs an inference model, uses this model to infer behavior for the video data to be inferred, and outputs the inference result. .
  • the action recognition system in this embodiment simultaneously learns action section detection in video data when learning the action recognition model, and distinguishes between effective and ineffective sections for action recognition in video data.
  • behavior recognition is performed with a high score
  • in ineffective sections behavior recognition is performed with a low score. Therefore, even when data that is difficult to judge, such as changes in behavior, is input within a video section that is a candidate for behavior recognition, behavior information can be output with low reliability for such data, thereby preventing false detections. This allows for more accurate recognition of behavior.
  • FIG. 9 is a diagram for explaining the configuration of the estimation device
  • FIG. 10 is a diagram for explaining the operation of the estimation device.
  • the behavior recognition system 1 according to the present invention differs from the above-described first embodiment in the configuration of the estimation device 30.
  • configurations that are different from Embodiment 1 will be mainly described.
  • the estimation device 30 in this embodiment includes a feature extraction section 31, an action section detection section 32, an intra-section feature extraction section 33, an identification section 35, and an output section 36.
  • a feature extraction section 31 an action section detection section 32, an intra-section feature extraction section 33, an identification section 35, and an output section 36.
  • each function of the feature extraction section 31, action section detection section 32, intra-section feature extraction section 33, identification section 35, and output section 36 is performed by the arithmetic unit using a program stored in the storage device to realize each function. This can be achieved by executing.
  • the action section detection section 32 (target section detection section) and intra-section feature extraction section 33 (target feature amount generation section) in this embodiment are configurations added to the estimation device 30 of the first embodiment, and It has the same functions as the action section detection section 12 and intra-section feature extraction section 13 included in the learning device 10 of No. 1. That is, the action section detection unit 32 generates section information S for the inference data in the same manner as described above, and the intra-section feature extraction section 13 extracts the in-section feature amount F1 from the feature amount F generated from the inference data and the section information S. generate.
  • the identification unit 35 in this embodiment generates intra-section behavior information Ifa based on the above-mentioned intra-section feature amount F1, and the output unit 36 generates identification information of the behavior to be extracted based on the intra-section behavior information Ifa. output to an external device.
  • the behavior section since the behavior section is detected in the inference data, the behavior can be detected without scanning the input time series data in a sliding window manner. The start and end of the period can be detected, and the output of the identification section 35 can be used to determine the identification information of the action in that section. For example, of the intra-section behavior information Ifa, the behavior class corresponding to the dimension with the largest score value is output.
  • the action section detection unit 32 detects the action section and confirms whether the target action is included in the data. If the length of the section exceeds the threshold, identification information of the behavior to be extracted is output based on the intra-section behavior information Ifa. Since intra-section behavior information Ifa is output for each dimension in the time direction within the window, it is compressed in the time direction by averaging or summing the score values, and the number of dimensions is reduced by the number of behavior classes to be recognized. Outputs the vector quantity with. This vector quantity becomes the identification information of the action at the time at the center of the window.
  • a certain value is prepared for the number of behavior classes to be recognized (for example, all are set to 0) and used as behavior identification information.
  • the behavior identification information is based on the behavior information Ifa in each window. For the score values arranged in chronological order, the time when a certain threshold is exceeded is set as the starting point, the time when it is below a certain threshold is set as the ending point, and the identification information of the action and its starting and ending points are output. do.
  • the action section detection section 32 performs section detection based on input data input from an external device, but it also receives the feature amount F output from the feature extraction section 31 and performs section detection based on this. You can go.
  • the estimating device 30 repeatedly executes the process of the flowchart shown in FIG. 10 every time video data to be assumed as input data is input to the estimating device 30.
  • the input data may be time series data input as is, or may be input data scanned in a sliding window manner.
  • the feature extraction unit 31 acquires input data supplied from an external device (step S21). Then, the feature extraction unit 31 generates the feature amount F from the input data acquired in step S21 by configuring a feature extractor with reference to the parameters stored in the parameter storage unit 22 (step S22).
  • the action section detection section 32 generates section information S from the learning data by configuring an action section detector with reference to the parameters stored in the parameter storage section 22 (step S23).
  • the intra-section feature extraction unit 33 generates the intra-section feature amount F1 from the feature amount F and the section information S (step S24).
  • the identification unit 35 refers to the parameters stored in the parameter storage unit 22 and configures a behavior information output device to extract behavior information Ifa from the intra-section feature quantity F1 generated by the intra-section feature extraction unit 33. generated (step S25). Then, the output unit 36 outputs the identification information, the start point, and the end point of the action to the external device based on the action information Ifa output by the identification unit 35 (step S26).
  • the estimation device 30 since the estimation device 30 includes the action segment detection unit 32, when the input data is the entire time series data, segment detection can be performed without performing threshold processing. .
  • segment detection can be performed without performing threshold processing.
  • scanning input time series data using a sliding window method it is possible to determine whether the target action is included within the window based on the length of the section. At this time, even if the detection of an interval is incorrect and data such as changes in behavior are mixed in the interval, false detection is unlikely to occur because the scores are trained to be uniform during learning.
  • FIGS. 11 to 13 are block diagrams showing the configuration of an information processing system according to the third embodiment, and FIG. 13 is a flowchart showing the operation of the information processing system. Note that this embodiment shows an outline of the configuration of the information processing system and the information processing method described in the above embodiments.
  • the information processing system 100 is configured with a general information processing device, and is equipped with the following hardware configuration as an example.
  • ⁇ CPU Central Processing Unit
  • ⁇ ROM Read Only Memory
  • RAM Random Access Memory
  • Program group 104 loaded into RAM 103 - Storage device 105 that stores the program group 104 -
  • a drive device 106 that reads and writes from and to a storage medium 110 external to the information processing device -Communication interface 107 that connects to the communication network 111 outside the information processing device ⁇ I/O interface 108 that inputs and outputs data ⁇ Bus 109 connecting each component
  • the information processing system 100 can construct and equip the feature amount generation unit 121 and the learning unit 122 shown in FIG. 12 by having the CPU 101 acquire the program group 104 and execute it by the CPU 101.
  • the program group 104 is stored in advance in the storage device 105 or ROM 102, for example, and is loaded into the RAM 103 and executed by the CPU 101 as needed.
  • the program group 104 may be supplied to the CPU 101 via the communication network 111, or may be stored in the storage medium 110 in advance, and the drive device 106 may read the program and supply it to the CPU 101.
  • the above-mentioned feature amount generation section 121 and learning section 122 may be constructed of a dedicated electronic circuit for realizing such means.
  • FIG. 11 shows an example of the hardware configuration of an information processing device that is the information processing system 100, and the hardware configuration of the information processing device is not limited to the above-described case.
  • the information processing device may be configured from part of the configuration described above, such as not having the drive device 106.
  • the information processing system 100 executes the information processing method shown in the flowchart of FIG. 13 by the functions of the feature value generation section 121 and the learning section 122 constructed by the program as described above.
  • the information processing system 100 includes: Among section video data that is video data divided into predetermined time sections, a first feature amount that is a feature amount based on video data of a first section that is a part of the section, and a first feature amount that is a feature amount based on video data of a first section that is a part of the section, and a first feature amount that is a feature amount based on video data of a first section that is a part of the section, and A second feature amount, which is a feature amount based on video data of a certain second section, is generated (step S101), When generating a learning model that recognizes the behavior corresponding to the feature based on the input of the feature based on the video data, the behavior corresponding to the first feature and the action corresponding to the second feature are generated. , (step S102), Execute the process.
  • the present invention By being configured as described above, the present invention generates feature amounts for sections that are effective for action recognition in video data and feature amounts for sections that are not effective, and corresponds to feature amounts for effective sections. It learns the actions that correspond to the features of the interval that are not valid. At this time, for example, learning is performed so that the correct action will have a high score in an effective section, and learning will be performed so that a plurality of actions will have a low score in an ineffective section. As a result, even when difficult-to-determine data such as changes in behavior is input within a video section that is a candidate for behavior recognition, such data can output behavior information with low reliability, suppressing false detections. This allows for more accurate recognition of actions.
  • Non-transitory computer-readable media include various types of tangible storage media.
  • Examples of non-transitory computer-readable media include magnetic recording media (e.g., flexible disks, magnetic tapes, hard disk drives), magneto-optical recording media (e.g., magneto-optical disks), CD-ROMs (Read Only Memory), CD-Rs, CD-R/W, semiconductor memory (eg, mask ROM, PROM (Programmable ROM), EPROM (Erasable PROM), flash ROM, RAM (Random Access Memory)).
  • the program may also be supplied to the computer via various types of transitory computer readable media. Examples of transitory computer-readable media include electrical signals, optical signals, and electromagnetic waves.
  • the temporary computer-readable medium can provide the program to the computer via wired communication channels, such as electrical wires and fiber optics, or wireless communication channels.
  • the present invention has been described above with reference to the above-described embodiments, the present invention is not limited to the above-described embodiments.
  • the configuration and details of the present invention can be modified in various ways within the scope of the present invention by those skilled in the art.
  • at least one or more of the functions of the feature value generation unit 121 and the learning unit 122 described above may be executed by an information processing device installed and connected to any location on the network, that is, the so-called It may also be performed in cloud computing.
  • the feature quantity generation unit generates the first feature quantity so that the size in the time direction becomes a preset size.
  • Information processing system (Appendix 7) The information processing system according to appendix 5 or 6, When the second section is a plurality of sections divided in the time direction, the feature amount generation unit generates the second feature amount based on video data of the second section that is a combination of the plurality of sections. generate, Information processing system.
  • the information processing system according to any one of Supplementary Notes 1 to 7,
  • the feature quantity generation unit generates a feature quantity of the section video data based on the section video data, and also generates a feature quantity of the section video data based on the feature quantity and the first section and the second section. generating the second feature amount;
  • Information processing system (Appendix 9) The information processing system according to any one of Supplementary Notes 1 to 8, comprising a section detection unit that detects the first section and the second section of the section video data based on the section video data; Information processing system.
  • a target feature amount generation unit that generates a target feature amount that is a feature amount of the target video data based on the target video data that is the target of behavior identification; an identification unit that identifies the behavior of the target video data based on the behavior output from the learning model by inputting the target feature amount to the learning model;
  • An information processing system equipped with (Appendix 11) The information processing system according to appendix 10, comprising a target section detection unit that detects the first section of the target video data based on the target video data, The target feature generation unit generates a first target feature that is a feature corresponding to the first section based on the target feature and the first section of the target video data, The identification unit identifies the behavior of the target video data based on the behavior output from the learning model by inputting the first target feature amount to the learning model.
  • a first feature amount that is a feature amount based on video data of a first section that is a part of the section and a first feature amount that is a feature amount based on video data of a first section that is a part of the section, and a first feature amount that is a feature amount based on video data of a first section that is a part of the section
  • Generate a second feature amount that is a feature amount based on video data of a certain second section When generating a learning model that outputs a behavior corresponding to a feature in response to input of a feature based on video data, an action corresponding to the first feature and an action corresponding to the second feature are generated.
  • a first feature amount that is a feature amount based on video data of a first section that is a part of the section and a first feature amount that is a feature amount based on video data of a first section that is a part of the section, and a first feature amount that is a feature amount based on video data of a first section that is a part of the section
  • Generate a second feature amount that is a feature amount based on video data of a certain second section When generating a learning model that outputs a behavior corresponding to a feature in response to input of a feature based on video data, an action corresponding to the first feature and an action corresponding to the second feature are generated.
  • learn A computer-readable storage medium that stores a program for executing processing.
  • Behavior recognition system 10 Learning device 11 Feature extraction unit 12 Behavior section detection unit 13 In-section feature extraction unit 14 Out-of-section feature extraction unit 15 Identification unit 16 Learning unit 20 Storage device 21 Learning data storage unit 22 Parameter storage unit 30 Estimation device 31 Feature extraction unit 32 Activity section detection unit 33 Intra-section feature extraction unit 35 Identification unit 36 Output unit 100 Information processing system 101 CPU 102 ROM 103 RAM 104 Program group 105 Storage device 106 Drive device 107 Communication interface 108 Input/output interface 109 Bus 110 Storage medium 111 Communication network 121 Feature generation section 122 Learning section

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Image Analysis (AREA)
PCT/JP2022/016510 2022-03-31 2022-03-31 情報処理システム Ceased WO2023188264A1 (ja)

Priority Applications (2)

Application Number Priority Date Filing Date Title
JP2024511022A JP7754289B2 (ja) 2022-03-31 2022-03-31 情報処理システム
PCT/JP2022/016510 WO2023188264A1 (ja) 2022-03-31 2022-03-31 情報処理システム

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2022/016510 WO2023188264A1 (ja) 2022-03-31 2022-03-31 情報処理システム

Publications (1)

Publication Number Publication Date
WO2023188264A1 true WO2023188264A1 (ja) 2023-10-05

Family

ID=88199855

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2022/016510 Ceased WO2023188264A1 (ja) 2022-03-31 2022-03-31 情報処理システム

Country Status (2)

Country Link
JP (1) JP7754289B2 (https=)
WO (1) WO2023188264A1 (https=)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2025248770A1 (ja) * 2024-05-31 2025-12-04 富士通株式会社 行動区間検出プログラム及び方法
WO2025263147A1 (ja) * 2024-06-17 2025-12-26 パナソニックIpマネジメント株式会社 動画データ処理システム、動画データ処理方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2011177300A (ja) * 2010-02-26 2011-09-15 Emprie Technology Development LLC 特徴量変換装置、および特徴量変換方法
JP2016158954A (ja) * 2015-03-03 2016-09-05 富士通株式会社 状態検出方法、状態検出装置および状態検出プログラム
JP2019040465A (ja) * 2017-08-25 2019-03-14 トヨタ自動車株式会社 行動認識装置,学習装置,並びに方法およびプログラム
JP2019159819A (ja) * 2018-03-13 2019-09-19 オムロン株式会社 アノテーション方法、アノテーション装置、アノテーションプログラム及び識別システム
JP2020021421A (ja) * 2018-08-03 2020-02-06 株式会社東芝 データ分割装置、データ分割方法およびプログラム

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2011177300A (ja) * 2010-02-26 2011-09-15 Emprie Technology Development LLC 特徴量変換装置、および特徴量変換方法
JP2016158954A (ja) * 2015-03-03 2016-09-05 富士通株式会社 状態検出方法、状態検出装置および状態検出プログラム
JP2019040465A (ja) * 2017-08-25 2019-03-14 トヨタ自動車株式会社 行動認識装置,学習装置,並びに方法およびプログラム
JP2019159819A (ja) * 2018-03-13 2019-09-19 オムロン株式会社 アノテーション方法、アノテーション装置、アノテーションプログラム及び識別システム
JP2020021421A (ja) * 2018-08-03 2020-02-06 株式会社東芝 データ分割装置、データ分割方法およびプログラム

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2025248770A1 (ja) * 2024-05-31 2025-12-04 富士通株式会社 行動区間検出プログラム及び方法
WO2025263147A1 (ja) * 2024-06-17 2025-12-26 パナソニックIpマネジメント株式会社 動画データ処理システム、動画データ処理方法

Also Published As

Publication number Publication date
JP7754289B2 (ja) 2025-10-15
JPWO2023188264A1 (https=) 2023-10-05

Similar Documents

Publication Publication Date Title
JP6494331B2 (ja) ロボット制御装置およびロボット制御方法
EP3509011B1 (en) Apparatuses and methods for recognizing a facial expression robust against change in facial expression
US9824296B2 (en) Event detection apparatus and event detection method
JP4966820B2 (ja) 混雑推定装置および方法
US9092662B2 (en) Pattern recognition method and pattern recognition apparatus
JP2019522297A (ja) 時系列内の前兆部分列を発見する方法及びシステム
KR102217253B1 (ko) 행동패턴 분석 장치 및 방법
US20150262068A1 (en) Event detection apparatus and event detection method
JP2020507177A (ja) 定義されたオブジェクトを識別するためのシステム
EP2309454B1 (en) Apparatus and method for detecting motion
US9256945B2 (en) System for tracking a moving object, and a method and a non-transitory computer readable medium thereof
JP2016085487A (ja) 情報処理装置、情報処理方法及びコンピュータプログラム
KR101708491B1 (ko) 압력 센서를 이용한 객체 인식 방법
KR101979375B1 (ko) 감시 영상의 객체 행동 예측 방법
WO2023188264A1 (ja) 情報処理システム
KR20230069892A (ko) 이상 온도를 나타내는 객체를 식별하는 방법 및 장치
US9752880B2 (en) Object linking method, object linking apparatus, and storage medium
US10929688B2 (en) System and method of video content filtering
KR102129771B1 (ko) Cctv 카메라를 통해 촬영된 영상으로부터 영상 내 촬영 대상자의 행위 인식을 수행하는 cctv 관리 시스템 장치 및 그 동작 방법
KR102290857B1 (ko) 채널상태정보를 이용한 인공지능 기반의 스마트 사용자 검출 방법 및 장치
WO2016038872A1 (ja) 情報処理装置、表示方法およびプログラム記憶媒体
JP2019194758A (ja) 情報処理装置、情報処理方法、およびプログラム
JP7414660B2 (ja) 異常行動検出システム及び異常行動検出方法
JP6564756B2 (ja) 流動状況計測装置、方法、及びプログラム
CN113989853A (zh) 文物保护区异常状态识别方法、装置、终端设备及介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22935409

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2024511022

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 22935409

Country of ref document: EP

Kind code of ref document: A1