WO2024127554A1 - Dispositif de traitement d'informations, procédé d'inférence, programme d'inférence et procédé de génération de modèle de génération de valeur caractéristique - Google Patents

Dispositif de traitement d'informations, procédé d'inférence, programme d'inférence et procédé de génération de modèle de génération de valeur caractéristique Download PDF

Info

Publication number
WO2024127554A1
WO2024127554A1 PCT/JP2022/046044 JP2022046044W WO2024127554A1 WO 2024127554 A1 WO2024127554 A1 WO 2024127554A1 JP 2022046044 W JP2022046044 W JP 2022046044W WO 2024127554 A1 WO2024127554 A1 WO 2024127554A1
Authority
WO
WIPO (PCT)
Prior art keywords
inference
context
feature
frame images
information indicating
Prior art date
Application number
PCT/JP2022/046044
Other languages
English (en)
Japanese (ja)
Inventor
あずさ 澤田
尚司 谷内田
Original Assignee
日本電気株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電気株式会社 filed Critical 日本電気株式会社
Priority to PCT/JP2022/046044 priority Critical patent/WO2024127554A1/fr
Publication of WO2024127554A1 publication Critical patent/WO2024127554A1/fr

Links

Images

Definitions

  • One aspect of the present invention aims to realize an information processing device or the like that is capable of performing inference that takes into account context while keeping computational costs down.
  • An information processing device includes a feature generating means for generating, for each of a plurality of frame images extracted from a video of an object moving along a predetermined context, a feature corresponding to the context of the object appearing in the frame image using time information indicating the timing at which the frame image was captured and position information indicating the detected position of the object in the frame image, and an inference means for making a predetermined inference regarding the object based on the feature.
  • An inference method includes at least one processor generating, for each of a plurality of frame images extracted from a video of an object moving along a predetermined context, features of the object appearing in the frame images according to the context using position information indicating the detected position of the object in the frame images and time information indicating the timing at which the frame images were captured, and performing a predetermined inference regarding the object based on the features.
  • An inference program causes a computer to function as a feature generating means for generating, for each of a plurality of frame images extracted from a video of an object moving along a predetermined context, a feature corresponding to the context of the object appearing in the frame image, using time information indicating the timing at which the frame image was captured and position information indicating the detected position of the object in the frame image, and an inference means for making a predetermined inference regarding the object based on the feature.
  • a method for generating a feature generation model includes: at least one processor inputs, for each of a plurality of frame images extracted from a video capturing an object moving along a predetermined context, time information indicating the timing at which the frame image was captured and position information indicating the detected position of the object in the frame image into a feature generation model for generating features according to the context, thereby performing a predetermined inference regarding the object based on the calculated features; and updating the feature generation model so that the result of the inference approaches predetermined ground truth data.
  • FIG. 1 is a block diagram showing a configuration of an information processing device according to a first exemplary embodiment of the present invention
  • 1 is a flow diagram showing the flow of a method for generating and inferring a feature generation model according to an exemplary embodiment 1 of the present invention.
  • FIG. 1 is a diagram for explaining a method of inspection for checking for foreign matter.
  • FIG. 13 is a diagram for explaining an overview of an inference method according to an exemplary embodiment 2 of the present invention.
  • FIG. 11 is a block diagram showing an example of the configuration of an information processing device according to an exemplary embodiment 2 of the present invention.
  • FIG. 13 is a diagram showing an example in which a difference occurs between contexts during learning and inference.
  • FIG. 11 is a flowchart showing a flow of processing performed by an information processing device according to an exemplary embodiment 2 of the present invention during learning.
  • FIG. 11 is a flow chart showing the flow of processing performed during inference by an information processing device according to an exemplary embodiment 2 of the present invention.
  • FIG. 1 is a diagram showing an example of a computer that executes instructions of a program, which is software that realizes the functions of each device according to each exemplary embodiment of the present invention.
  • Example embodiment 1 A first exemplary embodiment of the present invention will be described in detail with reference to the drawings. This exemplary embodiment is a basic form of the exemplary embodiments described below. First, information processing devices 1 and 2 according to this exemplary embodiment will be described with reference to Fig. 1. Fig. 1 is a block diagram showing the configuration of the information processing devices 1 and 2.
  • the information processing device 1 includes an inference unit 11 and a learning unit 12.
  • the inference unit 11 performs a predetermined inference regarding an object moving according to a predetermined context, based on features calculated by inputting time information indicating the timing at which the frame image was captured and position information indicating the detected position of the object in the frame image into a feature generation model for generating features according to the context.
  • the learning unit 12 updates the feature generation model so that the result of the inference by the inference unit 11 approaches the predetermined correct answer data.
  • the information processing device 1 includes an inference unit 11 that performs a predetermined inference regarding the object based on features calculated by inputting, for each of a plurality of frame images extracted from a video capturing an object moving according to a predetermined context, time information indicating the timing at which the frame image was captured and position information indicating the detected position of the object in the frame image into a feature generation model for generating features according to the context, and a learning unit 12 that updates the feature generation model so that the result of the inference by the inference unit 11 approaches predetermined correct answer data.
  • the above configuration makes it possible to generate a feature generation model that can generate features according to the context from time information and location information. This has the effect of making it possible to perform inference that takes into account the context while reducing computational costs compared to conventional techniques that generate context features from the entire video.
  • the information processing device 2 includes a feature generating unit 21 and an inference unit 22. For each of a plurality of frame images extracted from a moving image capturing an object moving along a predetermined context, the feature generating unit 21 generates a feature corresponding to the context of the object captured in the frame image by using time information indicating the timing at which the frame image was captured and position information indicating the detection position of the object in the frame image.
  • the inference unit 22 performs a predetermined inference regarding the object based on the features generated by the feature generation unit 21.
  • the information processing device 2 includes a feature generation unit 21 that generates features according to the context of an object appearing in a plurality of frame images extracted from a video capturing an object moving along a predetermined context, using time information indicating the timing at which the frame image was captured and position information indicating the detected position of the object in the frame image, and an inference unit 22 that performs a predetermined inference regarding the object based on the features generated by the feature generation unit 21.
  • An object that moves according to a specific context is affected differently at each point in time based on that context. Therefore, with the above configuration, it is possible to generate features according to the context of the object and perform inference based on those features.
  • the information processing device 2 has the advantage of being able to perform inference that takes context into account while keeping computational costs down compared to conventional techniques that generate context features from the entire video image.
  • the above-mentioned functions of the information processing device 1 can also be realized by a program.
  • the learning program according to this exemplary embodiment causes a computer to function as an inference unit 11 that performs a predetermined inference regarding the object based on a feature calculated by inputting time information indicating the timing at which the frame image was captured and position information indicating the detection position of the object in the frame image into a feature generation model for generating a feature according to the context, for each of a plurality of frame images extracted from a video image capturing an object moving along a predetermined context, and a learning unit 12 that updates the feature generation model so that the result of the inference by the inference unit 11 approaches a predetermined correct answer data.
  • a feature generation model capable of generating a feature according to a context from time information and position information can be generated, and thus an effect is obtained in which it is possible to perform inference taking into account the context while suppressing calculation costs.
  • the inference program causes a computer to function as a feature amount generating unit 21 that generates a feature amount according to the context of an object appearing in a plurality of frame images extracted from a video in which an object moving along a predetermined context is captured, using time information indicating the timing at which the frame image was captured and position information indicating the detection position of the object in the frame image, and an inference unit 22 that performs a predetermined inference regarding the object based on the feature amount generated by the feature amount generating unit 21.
  • This inference program provides the effect of being able to perform inference taking into account the context while suppressing calculation costs.
  • Fig. 2 is a flow diagram showing the flow of the method for generating and inferring a feature generation model. Note that the execution subject of each step in the determination method shown in Fig. 2 may be a processor provided in the information processing device 1 or 2, or a processor provided in another device, or each step may be a processor provided in a different device.
  • the flow diagram shown on the left side of FIG. 2 illustrates a method for generating a feature generation model according to this exemplary embodiment.
  • at least one processor inputs time information indicating the timing at which each of a plurality of frame images extracted from a video of an object moving along a predetermined context was captured and position information indicating the detected position of the object in the frame image into a feature generation model for generating features according to the context, and performs a predetermined inference regarding the object based on the calculated features.
  • At least one processor updates the feature generation model so that the result of the inference in S11 approaches the predetermined ground truth data.
  • the method for generating a feature generation model includes at least one processor inputting, for each of a plurality of frame images extracted from a video capturing an object moving along a predetermined context, time information indicating the timing at which the frame image was captured and position information indicating the detected position of the object in the frame image into a feature generation model for generating features according to the context, thereby performing a predetermined inference regarding the object based on the calculated features (S11), and updating the feature generation model so that the result of the inference in S11 approaches predetermined ground truth data (S12).
  • a feature generation model capable of generating features according to the context from the time information and position information, which has the effect of making it possible to perform inference taking into account the context while suppressing calculation costs.
  • the flow diagram shown on the right side of FIG. 2 illustrates an inference method according to this exemplary embodiment.
  • at least one processor generates, for each of a plurality of frame images extracted from a video capturing an object moving along a predetermined context, a feature quantity corresponding to the context of the object appearing in the frame image, using time information indicating the timing at which the frame image was captured and position information indicating the detection position of the object in the frame image.
  • At least one processor performs a predetermined inference regarding the object based on the features generated in S21.
  • the inference method includes at least one processor generating, for each of a plurality of frame images extracted from a video capturing an object moving along a predetermined context, feature values of the object appearing in the frame images according to the above-mentioned context using position information indicating the detected position of the object in the frame images and time information indicating the timing at which the frame images were captured (S21), and making a predetermined inference about the object based on the feature values generated in S21 (S22).
  • This provides the effect of being able to make inferences that take into account the context while keeping computational costs down.
  • Exemplary embodiment 2 A second exemplary embodiment of the present invention will be described in detail with reference to the drawings.
  • an inference method according to this exemplary embodiment (hereinafter, referred to as this inference method) is used in an inspection to check whether a foreign object is present in a liquid (e.g., medicine, beverage, etc.) sealed in a transparent container (hereinafter, referred to as a foreign object confirmation inspection).
  • FIG. 3 is a diagram for explaining the method of foreign object inspection.
  • a container filled with a predetermined liquid which is an object to be inspected, is fixed in a device (not shown in FIG. 3), and the device is used to rock the container.
  • a control sequence for rocking the container is determined in advance.
  • the control sequence in the example of FIG. 3 is such that the container is rotated in a vertical plane for a predetermined time, then stopped for a predetermined time, and then rotated in a horizontal plane for a predetermined time. This control sequence may be repeated multiple times.
  • the container is rocked using the above-mentioned control while a moving image of the liquid inside the container is captured.
  • frame images are extracted from the captured moving image, and object detection is performed on each frame image.
  • object detection is performed on each frame image. Then, for each object detected by this object detection, it is determined whether the object is an air bubble or a foreign body, and if there is no object determined to be a foreign body, the inspected item is determined to be a good product, and if there is even one object determined to be a foreign body, it is determined to be a defective product.
  • the device rocks the container by controlling a specific pattern, so that the object inside the container (air bubbles or foreign body) moves along a specific context based on this pattern.
  • the flow of liquid inside the container accelerates for a while after the container starts to rotate, so the object also accelerates along this flow.
  • the object moves in the direction of the container's rotation.
  • the flow rate of the liquid gradually slows down and stabilizes in a steady state.
  • the speed and direction of the object's movement during this time also follow the flow rate and direction of the liquid. The same applies to subsequent controls.
  • the inference in this inference method is to determine whether the object is an air bubble or a foreign object.
  • this inference method makes it possible to perform inference that takes context into account while keeping computational costs down compared to conventional techniques that generate context features from the entire video image.
  • Fig. 4 is a diagram for explaining an overview of this inference method.
  • frame images Prior to execution of this inference method, frame images are first extracted from a video image capturing an object.
  • n frame images from FR1 to FRn are extracted.
  • FR1 a container filled with liquid is captured, as well as objects OB1 and OB2. The same is true for the other frame images.
  • the objects are air bubbles and foreign objects. Since both are small in size and have similar appearances, it is difficult to accurately identify whether a detected object is an air bubble or a foreign object based on only one frame image.
  • trajectory data is generated that indicates the trajectory of the movement of the objects.
  • Fig. 4 shows schematic diagrams of trajectory data A1 of the object OB1 and trajectory data A2 of the object OB2.
  • the trajectory data A1 includes time information indicating the timing at which the frame image in which the object OB1 was detected was shot, and position information indicating the detected position of the object OB1 in the frame image. For example, assume that the object OB1 was detected in each of the frame images FR 1 to FR 10. In this case, the trajectory data A1 includes time information indicating the shooting timing of each of the frame images FR 1 to FR 10 , and position information indicating the detected position of the object OB1 in the frame images FR 1 to FR 10 .
  • the trajectory data A1 may also include information indicating the characteristics of the detected object OB1.
  • the trajectory data A1 may include an image patch that is an image cut out of an area in the frame image that includes the object OB1, or feature amounts extracted from the image patch, information indicating the size of the detected object, and information indicating the moving speed of the detected object.
  • normalized position information may be used to avoid being affected by differences in container size and liquid volume.
  • normalized position information can be generated by applying at least one of translation, rotation, and scale transformation to position information indicating the position of the object on the frame image.
  • the time information may also be normalized.
  • the time information of the frame image FR 1 may be set to 0, and the time information of the frame image FR n captured at the timing when a series of control sequences ends may be set to 1, and the time information of the frame images FR 2 to FR n-1 may be determined based on these values.
  • trajectory data A2 which includes time information indicating the timing at which the frame image in which the object OB2 was detected was captured, and position information indicating the detected position of the object OB2 in the frame image.
  • the trajectory data A2 may also include information indicating the characteristics of the detected object OB2.
  • trajectory data A1 and A2 may differ. Also, only the trajectory data A1 and A2 for two objects OB1 and OB2 are shown here, but a larger number of objects may be detected in an actual foreign object confirmation inspection.
  • the time information and position information contained in each of the trajectory data A1 and A2 are input into a feature generation model to generate feature quantities B1 and B2 according to the context.
  • These feature quantities are generated for each time. For example, a feature quantity at time t1 is generated from the time information and position information at that time t1.
  • a previously generated judgment function may be used to judge whether the frame image is inside or outside a liquid region, and a value indicating the judgment result may be included in feature quantities B1 and B2. This makes it possible to eliminate the influence of objects, etc. detected outside the liquid region and obtain valid inference results.
  • the feature generation model makes it possible to generate features according to the context using time and position information, which are significantly smaller in data size than video images.
  • the trajectory data and features generated as described above are integrated to generate integrated data, and the integrated data is input to the inference model to output the inference result.
  • the inference result indicates whether each of the objects OB1 and OB2 is an air bubble or a foreign body.
  • the objects OB1 and OB2 detected in the above-mentioned foreign body confirmation inspection are both small in size and similar in appearance. For this reason, it is difficult to accurately determine whether a detected object is an air bubble or a foreign body.
  • this inference method uses integrated data that reflects features generated using a feature generation model to perform estimation that takes into account the context, making it possible to perform difficult estimations with high accuracy.
  • Fig. 5 is a block diagram showing an example of the configuration of the information processing device 3.
  • the information processing device 3 includes a control unit 30 that controls each unit of the information processing device 3, and a storage unit 31 that stores various data used by the information processing device 3.
  • the information processing device 3 also includes a communication unit 32 that allows the information processing device 3 to communicate with other devices, an input unit 33 that accepts input of various data to the information processing device 3, and an output unit 34 that allows the information processing device 3 to output various data.
  • the control unit 30 also includes an object detection unit 301, a trajectory data generation unit 302, a feature generation unit 303, an integration unit 304, an inference unit 305, a learning unit 306, a difference identification unit 307, an adjustment unit 308, and a similarity calculation unit 309.
  • the memory unit 31 stores trajectory data 311, a feature generation model 312, an inference model 313, and teacher data 314.
  • the learning unit 306, the similarity calculation unit 309, and the teacher data 314 will be described later in the section “About learning", and the difference identification unit 307 and the adjustment unit 308 will be described later in the section "About absorbing context differences".
  • the object detection unit 301 detects a specific object from each of multiple frame images extracted from a video. If the target video has been shot at a high frame rate, the object detection unit 301 can detect the object from each frame image with relatively lightweight image processing by utilizing the positional continuity. There are no particular limitations on the method of detecting the object. For example, the object detection unit 301 may detect the object using a detection model that has been trained to detect the object using an image of the object as training data. There are no particular limitations on the algorithm of the detection model. For example, the object detection unit 301 may use a detection model such as a convolutional neural network, a recursive neural network, or a transformer, or a detection model that combines a plurality of these.
  • a detection model such as a convolutional neural network, a recursive neural network, or a transformer, or a detection model that combines a plurality of these.
  • the trajectory data generation unit 302 generates trajectory data indicating the trajectory of the movement of an object based on the detection result of the object from multiple frame images by the object detection unit 301.
  • the trajectory data includes time information indicating the timing when the frame image in which the object is detected was captured, and position information indicating the detected position of the object in the frame image, and may also include image patches or the like as information indicating the characteristics of the detected object.
  • the time information may also include information indicating the time difference between the frame images.
  • the generated trajectory data is stored in the memory unit 31 as trajectory data 311.
  • the feature generation unit 303 generates features according to the context of an object appearing in each of a plurality of frame images extracted from a video capturing an object moving along a specific context, using time information indicating the timing at which the frame image was captured and position information indicating the detected position of the object in the frame image. Specifically, the feature generation unit 303 generates features by inputting the time information and position information indicated in the trajectory data 311 to the feature generation model 312.
  • the feature generation model 312 is a trained model that has been trained to generate features according to a context. More specifically, the feature generation model 312 is generated by learning the relationship between time information indicating the timing at which an object moving in accordance with a specified context was photographed, and position information indicating the detected position of the object in the image photographed at that timing, and the feature of the object at that timing.
  • the feature generation model 312 may be a function that uses the above-mentioned feature as a response variable and time information and position information as explanatory variables.
  • the context that influences the movement of the object used in learning may be the same as or similar to the context that influences the movement of the target object that is the subject of inference.
  • the algorithm of the feature generation model 312 is not particularly limited.
  • the feature generation model 312 may be a model such as a convolutional neural network, a recurrent neural network, or a transformer, or may be a model that combines two or more of these.
  • the integration unit 304 integrates the trajectory data 311 generated by the trajectory data generation unit 302 and the features generated by the feature generation unit 303 to generate integrated data.
  • the integration method is not particularly limited.
  • the integration unit 304 may combine the feature as an additional dimension with respect to each time component in the trajectory data 311 (specifically, position information or an image patch associated with one piece of time information, etc.).
  • the integration unit 304 may generate integrated data reflecting the feature by adding the feature to each time component or multiplying the feature by each time component.
  • the integration unit 304 may reflect the feature in each time component by an attention mechanism.
  • the inference unit 305 performs a predetermined inference regarding the object based on the features generated by the feature generation unit 303. Specifically, the inference unit 305 inputs integrated data reflecting the features generated by the feature generation unit 303 to the inference model 313, thereby obtaining an inference result, i.e., a determination result as to whether the object is an air bubble or a foreign object.
  • an inference result i.e., a determination result as to whether the object is an air bubble or a foreign object.
  • the inference model 313 is a model generated by using integrated data generated from an image showing air bubbles or foreign objects to learn whether an object shown in the image is an air bubble or a foreign object.
  • the algorithm of the inference model 313 is not particularly limited.
  • the inference model 313 may be a model such as a convolutional neural network, a recursive neural network, or a transformer, or may be a model that combines two or more of these.
  • the information processing device 3 includes a feature generating unit 303 that generates features according to the context of an object appearing in a frame image using time information indicating the timing at which the frame image was captured and position information indicating the detected position of the object in the frame image for each of a plurality of frame images extracted from a video capturing an object moving along a predetermined context, and an inference unit 305 that performs a predetermined inference regarding the object based on the features generated by the feature generating unit 303.
  • This provides the effect of being able to perform inference that takes into account the context while suppressing calculation costs, similar to the information processing device 2 according to the exemplary embodiment 1.
  • still images in a time series obtained by continuously capturing still images are also included in the category of "a plurality of frame images extracted from a video”.
  • the feature generation unit 303 may generate features using a feature generation model 312 that has learned the relationship between time information indicating the timing at which an object moving along a context that is the same as or similar to the context in which the target object moves was photographed, position information indicating the detected position of the object in the image photographed at that timing, and features according to the context of the object at that timing. This provides the effect of being able to generate appropriate features based on the learning results, in addition to the effect provided by the information processing device 2 according to exemplary embodiment 1.
  • the information processing device 3 includes a trajectory data generation unit 302 that generates trajectory data 311 indicating the trajectory of the movement of an object based on the detection result of the object from a plurality of frame images, and an integration unit 304 that integrates the trajectory data 311 and the feature amount generated by the feature amount generation unit 303 to generate integrated data, the feature amount generation unit 303 generates the feature amount using the position information and time information extracted from the trajectory data 311, and the inference unit 305 performs inference using the integrated data.
  • a trajectory data generation unit 302 that generates trajectory data 311 indicating the trajectory of the movement of an object based on the detection result of the object from a plurality of frame images
  • an integration unit 304 that integrates the trajectory data 311 and the feature amount generated by the feature amount generation unit 303 to generate integrated data
  • the feature amount generation unit 303 generates the feature amount using the position information and time information extracted from the trajectory data 311, and the inference unit 305 performs inference using the integrated data.
  • the teacher data 314 is data in which information indicating whether an object is an air bubble or a foreign object is associated as correct answer data with the trajectory data of an object.
  • the teacher data 314 may also include multiple frame images that are the source of the trajectory data.
  • the feature generation unit 303 generates features from the time information and position information included in the trajectory data, and the integration unit 304 integrates the generated features with the trajectory data to generate integrated data.
  • the inference unit 305 then performs inference using the integrated data, thereby obtaining an inference result based on the trajectory data included in the teacher data 314.
  • the learning unit 306 updates the inference model 313 and the feature generation model 312 so that the result of inference based on the trajectory data included in the teacher data 314 approaches the predetermined correct answer data indicated in the teacher data 314.
  • the learning unit 306 may use a gradient descent method to update each of the inference model 313 and the feature generation model 312 so as to minimize a loss function that is the sum of the errors between the inference result and the correct answer data.
  • video images or frame images are not used as is to generate features, but frame images may be used during learning.
  • the similarity between frame images may be used to update the feature generation model 312.
  • the similarity calculation unit 309 calculates the similarity between frame images, and is configured to be used when the similarity is used to update the feature generation model 312.
  • the learning unit 306 updates the feature generation model 312 so that the similarity between multiple frame images is reflected in the similarity between features generated by the feature generation model 312 for the frame images. For example, by adding a normalization term to the loss function described above, the feature generation model 312 can be updated so that the similarity between frame images becomes closer to the similarity of the features.
  • the context that influences the movement of the object used in learning may be the same as or similar to the context that influences the movement of the target object that is the subject of inference. Furthermore, the contexts need only be at least partially the same or similar, and do not have to be entirely the same or similar.
  • the difference identification unit 307 and adjustment unit 308 are used when there is a difference between the context of the movement of the object used in learning and the context of the movement of the target object that is the subject of inference.
  • the adjustment unit 308 adjusts at least one of the time information and the position information used to generate the features so as to absorb the difference between the context of the movement of the object used in learning and the context of the movement of the target object to be inferred.
  • the difference identification unit 307 identifies the difference between the context in which an object used for learning moves and the context in which an object moves, based on the difference between the environment surrounding the object and the environment surrounding the target object.
  • the difference identification unit 307 may also identify the difference between the context in which an object used for learning moves and the context in which an object moves, based on the difference between the object used for learning and the target object to be inferred.
  • FIG. 6 is a diagram showing an example in which a difference occurs between contexts.
  • EX1 shown in FIG. 6, as in FIG. 3, it is assumed that a foreign object confirmation inspection is performed using a control sequence of rotation, rest, and rotation, and the controls included in each control sequence during learning and inference and their execution order are the same. However, the rest period during inference is longer than during learning. More specifically, in the example EX1, during both learning and inference, rotation starts at time t1 and ends at time t2 to enter a rest state, and the movement of the liquid in the container becomes steady at time t3.
  • the second rotation starts at time t4 and ends at time t5
  • the second rotation starts at time t4', which is ⁇ t earlier than time t4.
  • the time at which the second rotation ends is also ⁇ t earlier than time t5 at time t5'.
  • the adjustment unit 308 performs an adjustment by adding ⁇ t to the time indicated in each piece of time information corresponding to the period from time t4' to time t5' among the time information used to generate the feature. This makes it possible to absorb the difference in context between learning and inference. Note that, contrary to example EX1, if the still period during inference is made longer by ⁇ t than during learning, the adjustment unit 308 can perform an adjustment by subtracting ⁇ t from the time indicated in each piece of time information after the end of the still period.
  • the adjustment unit 308 can absorb the difference in context by adjusting the time information accordingly.
  • the adjustment unit 308 can adjust the position information used to generate features so as to absorb the difference.
  • the adjustment unit 308 may absorb the difference between contexts by performing a left-right inversion process on the position information of the object. For example, when coordinate values are used as position information, the adjustment unit 308 may perform a conversion that inverts left-right about a specified axis for each coordinate value during the period in which the pattern of movement is inverted left-right. Also, for example, when the pattern of movement of the object and the pattern of movement of the object during learning are in a rotationally symmetric relationship, the adjustment unit 308 may absorb the difference between contexts by rotationally transforming the position information of the object. Note that the adjustment unit that adjusts the time information and the adjustment unit that adjusts the position information may each be separate blocks.
  • the time information corresponding to each frame image used during inference may differ from the time information corresponding to each frame image used for learning.
  • the time of the frame image at the start of the first rotation among the frame images used for learning is t1.
  • the time of the frame image at the start of the first rotation among the frame images used for inference is t1' (t1' ⁇ t1), a difference of (t1 - t1') will occur between the contexts.
  • the adjustment unit 308 can perform an adjustment by adding the value of (t1 - t1') to the time indicated in each piece of time information used for inference. Furthermore, if the time of the frame image at the start of the first rotation among the frame images used in inference is t1" (t1" > t1), the adjustment unit 308 can perform an adjustment by subtracting the value of (t1" - t1) from the time indicated in each piece of time information used in inference. By making such an adjustment, the relationship between the time indicated in the time information and the control timing can be aligned with that during learning, thereby allowing the feature generation model 312 to output appropriate features.
  • factors that cause differences in context are not limited to control sequences.
  • differences between contexts can arise when there is a difference between the object used in learning and the subject of inference, or when there is a difference between the environment surrounding the object used in learning and the environment surrounding the subject of inference.
  • the difference identification unit 307 identifies the difference between the context in which the object moves and the context in which the object moves, based on at least one of the differences described above, i.e., the difference between the object used in learning and the object to be inferred, and the difference between the environment surrounding the object used in learning and the environment surrounding the object. Therefore, according to the information processing device 3 of this exemplary embodiment, in addition to the effect achieved by the information processing device 2 of exemplary embodiment 1, the effect of being able to automatically identify context differences can be obtained. Furthermore, since the information processing device 3 is equipped with the adjustment unit 308, it is possible to cause the adjustment unit 308 to make adjustments to absorb the differences identified by the difference identification unit 307.
  • example EX2 in Figure 6 shows a case where the viscosity of the liquid sealed in the container is different during learning and inference in a foreign object confirmation inspection that is performed in a sequence of rotation, rest, and rotation. That is, in example EX2, the environment surrounding the object used in learning is different from the environment surrounding the target object. Specifically, the liquid in the container used in inference has a higher viscosity than the liquid used in learning, and therefore, during inference, the time from when the container is brought to rest until the inside of the container becomes steady is shorter than during learning. In other words, the time t3' when the container becomes steady during inference is earlier than the time t3 when the container becomes steady during learning (t3 > t3').
  • the difference identification unit 307 identifies the time t3' at which the liquid stabilized, which is the difference between the contexts at the time of learning and the time of inference, based on the viscosity of the liquid in the container used for inference. If the relationship between the viscosity and the time required for the liquid to stabilize is identified and modeled in advance, the difference identification unit 307 can identify the time t3' using the model and the viscosity of the liquid in the container used for inference.
  • the adjustment unit 308 absorbs the above-mentioned difference by adjusting the time information based on the result of the identification by the difference identification unit 307. Specifically, the adjustment unit 308 performs an adjustment by adding the value of (t3-t3') to the time indicated in each piece of time information from time t3' to time t4.
  • the adjustment unit 308 may adjust all times in the steady state to the same value. In this case, the adjustment unit 308 may replace the times indicated in each piece of time information from time t3' to t4 with, for example, time t3. In this way, the adjustment unit 308 may set the time information for a period in which the context is constant during inference to a constant value. Furthermore, the constant value may be selected from the time values for a period in which an object moved according to a context that was the same as or similar to the above context during learning (the period from time t3 to t4 in the above example).
  • the information processing device 3 includes an adjustment unit 308 that adjusts at least one of the time information and the position information used to generate features so as to absorb the difference between the context of the movement of the object used in learning and the context of the movement of the target object to be inferred. Therefore, according to the information processing device 1 according to this exemplary embodiment, even if there is a difference between the context of the movement of the target object and the context of the movement of the object used in learning, it is possible to obtain the effect that appropriate features can be generated using the same feature generation model 312.
  • (Learning process flow) 7 is a flow diagram showing the flow of processing performed by the information processing device 3 during learning. Note that, when learning is performed, the teacher data 314 and the feature generation model 312 are stored in advance in the storage unit 31.
  • the feature generation model 312 stored in the storage unit 31 may have parameters in an initial state, or may be a model in which learning has progressed to a certain degree.
  • the learning unit 306 acquires the teacher data 314 stored in the memory unit 31.
  • the teacher data 314 is data in which information indicating whether an object is an air bubble or a foreign object is associated as correct answer data with respect to the trajectory data of the object.
  • the teacher data 314 also includes multiple frame images that are the basis of the trajectory data.
  • the feature generation unit 303 uses the teacher data 314 acquired in S31 to generate features according to the context of the object used in learning. More specifically, the feature generation unit 303 generates features by inputting the time information and position information indicated in the trajectory data included in the teacher data 314 acquired in S31 to the feature generation model 312.
  • the integration unit 304 integrates the trajectory data included in the teacher data 314 acquired in S31 with the feature amount generated in S32 to generate integrated data. Then, in S34, the inference unit 305 performs a predetermined inference using the integrated data generated in S33. Specifically, the inference unit 305 inputs the integrated data generated in S33 to the inference model 313 to obtain an inference result, i.e., a determination result as to whether the object is an air bubble or a foreign object.
  • an inference result i.e., a determination result as to whether the object is an air bubble or a foreign object.
  • the similarity calculation unit 309 calculates the similarity between the frame images included in the teacher data 314 acquired in S31.
  • the frame images may be clipped from around a location corresponding to the position information indicated in the trajectory data.
  • the similarity calculation unit 309 may calculate the similarity for each combination of multiple frame images (corresponding to one trajectory data) included in the teacher data 314, or may calculate the similarity for some combinations.
  • the process of S35 may be performed before S36, and may be performed before S32, for example, or in parallel with the processes of S32 to S34.
  • the learning unit 306 updates the feature generation model 312 so that the result of the inference in S34 approaches the predetermined correct answer data indicated in the teacher data 314.
  • the learning unit 306 updates the feature generation model 312 so that the similarity between the frame images calculated in S35 is reflected in the similarity between the features generated by the feature generation model 312 for the frame images.
  • the learning unit 306 determines whether or not to end learning.
  • the condition for ending learning may be determined in advance, and learning may end, for example, when the number of updates of the feature generation model 312 reaches a predetermined number. If the learning unit 306 determines NO in S37, it returns to the process of S31 and acquires new teacher data 314. On the other hand, if the learning unit 306 determines YES in S37, it stores the updated feature generation model 312 in the memory unit 31 and ends the process of FIG. 7.
  • a video image showing a predetermined object (specifically, at least one of an air bubble and a foreign object) or a frame image extracted from the video image may be acquired.
  • the object detection unit 301 detects the object from the acquired frame image, and the trajectory data generation unit 302 generates trajectory data 311 of the detected object.
  • the teacher data 314 is generated by labeling this trajectory data 311 with the correct answer data.
  • the processing after the teacher data 314 is generated is the same as the processing from S32 onwards described above.
  • the method for generating the feature generation model 312 includes: for each of a plurality of frame images extracted from a video capturing an object moving along a predetermined context, inputting time information indicating the timing at which the frame image was captured and position information indicating the detected position of the object in the frame image into the feature generation model 312 for generating features according to the context, performing a predetermined inference regarding the object based on the calculated features (S34); and updating the feature generation model 312 so that the result of the inference approaches predetermined correct answer data (S36).
  • the above configuration makes it possible to generate a feature generation model 312 capable of generating features according to a context.
  • This makes it possible to generate a feature generation model capable of generating features according to a context from time information and location information, and has the effect of making it possible to perform inference that takes into account the context while suppressing computational costs.
  • the method for generating the feature generation model 312 includes calculating the similarity between a plurality of frame images (S35), and in updating the feature generation model 312, the feature generation model 312 is updated so that the similarity between a plurality of frame images is reflected in the similarity between the features generated by the feature generation model 312 for the frame images. Since similar frame images are considered to have similar contexts, the above configuration makes it possible to generate a feature generation model 312 that can generate more valid features that take into account the similarity between frame images.
  • Fig. 8 is a flow diagram showing the flow of processing (inference method) performed by the information processing device 3 during inference.
  • Fig. 8 shows processing after a plurality of frame images extracted from a moving image to be inferred are input to the information processing device 3.
  • the moving image shows an object to be determined as being an air bubble or a foreign object.
  • the information processing device 3 may also perform the processing of extracting frame images from the moving image.
  • the object detection unit 301 detects an object from each of the frame images. Then, in S42, the trajectory data generation unit 302 generates trajectory data 311 indicating the trajectory of the object movement based on the object detection result in S41. Note that the following describes the process when one object and one trajectory data 311 indicating the trajectory of the object movement are generated. When multiple trajectory data 311 are generated, the processes of S43 to S47 described below are performed for each trajectory data 311.
  • the difference identification unit 307 identifies the difference between the context in which the object moves and the context in which the object moves, based on at least one of the difference between the object used in learning and the object to be inferred, and the difference between the environment surrounding the object used in learning and the environment surrounding the object. For example, if the viscosity of the liquid sealed in the container is different during learning and during inference, the difference identification unit 307 may calculate the time at which the liquid in the container becomes steady based on the difference in the viscosity of the liquid, and calculate the difference between this time and the time at which the liquid in the container becomes steady during learning.
  • the adjustment unit 308 adjusts at least one of the time information and the location information used to generate the features so as to absorb the difference between the contexts identified in S43. For example, if a difference in the time at which the features become stable is calculated as described above in S43, the adjustment unit 308 adjusts the time information so as to absorb the time difference. Note that if there is no difference between the contexts, the processes of S43 and S44 are omitted. Furthermore, if at least one of the time information and the location information was normalized during learning, the adjustment unit 308 similarly normalizes the time information and the location information used to generate the features.
  • the feature generation unit 303 generates features according to the context. Specifically, for each of a plurality of frame images corresponding to one piece of trajectory data 311, the feature generation unit 303 extracts position information and time information of an object appearing in the frame image from the trajectory data 311. The feature generation unit 303 then inputs the extracted time information and position information into the feature generation model 312 to generate features. As a result, for each frame image, features according to the context of the object appearing in the frame image are generated.
  • the integration unit 304 integrates the trajectory data 311 generated in S42 with the feature quantities generated in S45 to generate integrated data. Then, in S47, the inference unit 305 performs a predetermined inference regarding the object based on the feature quantities generated in S45. Specifically, the inference unit 305 obtains an inference result by inputting the integrated data reflecting the feature quantities generated in S45 to the inference model 313, and the processing in FIG. 8 ends. Note that the inference unit 305 may output the inference result to the output unit 34, etc., or may store it in the memory unit 31, etc.
  • the inference method includes generating, for each of a plurality of frame images extracted from a video capturing an object moving along a predetermined context, features corresponding to the context of the object appearing in the frame images using position information indicating the detected position of the object in the frame images and time information indicating the timing at which the frame images were captured (S45), and making a predetermined inference about the object based on the generated features (S47).
  • This has the effect of making it possible to perform inference that takes into account the context while keeping computational costs down.
  • each process described in the above exemplary embodiment is arbitrary and is not limited to the above example.
  • an information processing system having the same functions as the information processing devices 1 to 3 can be constructed by a plurality of devices that can communicate with each other.
  • the process in the flow chart of FIG. 7 and the process in the flow chart of FIG. 8 may be executed by different information processing devices (or processors).
  • each process in the flow chart shown in FIG. 7 or FIG. 8 can be shared and executed by a plurality of information processing devices (or processors).
  • the content of the predetermined inference executed by the inference units 11, 22, and 305 is not particularly limited as long as it is related to the object.
  • it may be prediction, conversion, etc.
  • the factors that give rise to a context are also arbitrary.
  • information processing device 2 or 3 it is possible to perform inference that takes into account the context for an object that moves in accordance with a context that arises due to various devices whose operations change at a predetermined cycle, or due to natural phenomena that change at a predetermined cycle, etc.
  • information processing device 1 or 3 it is possible to generate a feature generation model that makes it possible to perform inference that takes into account the above-mentioned context.
  • the movement of moving objects (vehicles, people, etc.) around a traffic light is affected by the periodic light emission control of the traffic light.
  • the moving objects move according to the context resulting from the light emission control of the traffic light.
  • the information processing device 1 or 3 performs a predetermined inference regarding the moving object based on the feature calculated by inputting time information and position information into the feature generation model for each of a plurality of frame images extracted from a video image capturing a moving object moving along the context, and updates the feature generation model so that the inference result approaches predetermined correct answer data.
  • the information processing device 2 or 3 performs a predetermined inference regarding the moving object based on the feature generated using the feature generation model thus generated, thereby obtaining a highly valid inference result that takes the context into account.
  • the content of the inference is not particularly limited, and may be, for example, a position prediction of the moving object after a predetermined time, a behavior classification of the moving object, or detection of abnormal behavior of the moving object. It is preferable that these inferences also take into account interactions between vehicles and pedestrians, between vehicles, etc.
  • Some or all of the functions of the information processing devices 1 to 3 may be realized by hardware such as an integrated circuit (IC chip), or may be realized by software.
  • information processing devices 1 to 3 are realized, for example, by a computer that executes instructions of a program (inference program/learning program), which is software that realizes each function.
  • a computer that executes instructions of a program (inference program/learning program), which is software that realizes each function.
  • An example of such a computer (hereinafter referred to as computer C) is shown in Figure 9.
  • Computer C has at least one processor C1 and at least one memory C2.
  • Memory C2 stores program P for operating computer C as any one of information processing devices 1 to 3.
  • processor C1 reads and executes program P from memory C2, thereby realizing the function of any one of information processing devices 1 to 3.
  • the processor C1 may be, for example, a CPU (Central Processing Unit), GPU (Graphic Processing Unit), DSP (Digital Signal Processor), MPU (Micro Processing Unit), FPU (Floating point number Processing Unit), PPU (Physics Processing Unit), microcontroller, or a combination of these.
  • the memory C2 may be, for example, a flash memory, HDD (Hard Disk Drive), SSD (Solid State Drive), or a combination of these.
  • Computer C may further include a RAM (Random Access Memory) for expanding program P during execution and for temporarily storing various data.
  • Computer C may further include a communications interface for sending and receiving data to and from other devices.
  • Computer C may further include an input/output interface for connecting input/output devices such as a keyboard, mouse, display, and printer.
  • the program P can also be recorded on a non-transitory, tangible recording medium M that can be read by the computer C.
  • a recording medium M can be, for example, a tape, a disk, a card, a semiconductor memory, or a programmable logic circuit.
  • the computer C can obtain the program P via such a recording medium M.
  • the program P can also be transmitted via a transmission medium.
  • a transmission medium can be, for example, a communications network or broadcast waves.
  • the computer C can also obtain the program P via such a transmission medium.
  • An information processing device comprising: a feature generation means for generating features corresponding to the context of an object appearing in a plurality of frame images extracted from a video capturing an object moving along a specified context, using time information indicating the timing at which the frame image was captured and position information indicating the detection position of the object in the frame image; and an inference means for making a specified inference regarding the object based on the features.
  • Appendix 2 The information processing device described in Appendix 1, wherein the feature generation means generates the feature using a feature generation model that learns the relationship between time information indicating a timing at which an object moving in accordance with a context identical or similar to the context was photographed, position information indicating the detection position of the object in the image photographed at that timing, and feature values corresponding to the context of the object at that timing.
  • (Appendix 3) The information processing device according to claim 1, further comprising: a trajectory data generation means for generating trajectory data indicating a trajectory of movement of an object based on a detection result of the object from the plurality of frame images; and an integration means for integrating the trajectory data and a feature generated by the feature generation means to generate integrated data, wherein the feature generation means generates the feature using the position information and the time information extracted from the trajectory data, and the inference means performs the inference using the integrated data.
  • Appendix 5 An information processing device as described in Appendix 4, comprising a difference identification means for identifying a difference between the context in the movement of the object and the context in the movement of the target object based on at least one of a difference between the object and the target object and a difference between an environment surrounding the object and an environment surrounding the target object.
  • An inference method comprising: at least one processor generating, for each of a plurality of frame images extracted from a video capturing an object moving along a predetermined context, features of the object appearing in the frame images corresponding to the context, using position information indicating the detection position of the object in the frame images and time information indicating the timing when the frame images were captured; and making a predetermined inference regarding the object based on the features.
  • An inference program that causes a computer to function as a feature generation means that generates features corresponding to the context of an object appearing in a plurality of frame images extracted from a video of an object moving along a specified context, using time information indicating the timing at which the frame image was captured and position information indicating the detection position of the object in the frame image, and an inference means that makes a specified inference about the object based on the features.
  • a method for generating a feature generation model comprising: at least one processor inputting, for each of a plurality of frame images extracted from a video capturing an object moving in accordance with a predetermined context, time information indicating the timing at which the frame image was captured and positional information indicating the detected position of the object in the frame image into a feature generation model for generating features according to the context, thereby making a predetermined inference about the object based on the calculated features; and updating the feature generation model so that the result of the inference approaches predetermined correct answer data.
  • Appendix 9 9. The method for generating a feature generation model described in Appendix 8, further comprising: at least one processor calculating a similarity between the plurality of frame images; and updating the feature generation model such that the similarity between the plurality of frame images is reflected in a similarity between features generated by the feature generation model for the frame images, in updating the feature generation model.
  • An information processing device including at least one processor, which executes, for each of a plurality of frame images extracted from a video capturing an object moving along a predetermined context, a process of generating a feature amount of the object appearing in the frame image according to the context using time information indicating the timing at which the frame image was captured and position information indicating the detection position of the object in the frame image, and a process of making a predetermined inference regarding the object based on the feature amount.
  • the information processing device may further include a memory, and the memory may store an inference program for causing the processor to execute the process of generating the feature amount and the process of performing the predetermined inference.
  • the inference program may also be recorded on a computer-readable, non-transitory, tangible recording medium.

Landscapes

  • Image Analysis (AREA)

Abstract

Afin de permettre une inférence basée sur le contexte tout en supprimant le coût de calcul, ce dispositif de traitement d'informations (2) comprend : une unité de génération de valeur caractéristique (21) qui utilise des informations de temps indiquant un moment auquel une image de trame a été capturée et des informations de position indiquant la position de détection d'un objet dans l'image de trame pour générer une valeur caractéristique selon le contexte ; et une unité d'inférence (22) qui effectue une inférence sur la base de la valeur caractéristique générée.
PCT/JP2022/046044 2022-12-14 2022-12-14 Dispositif de traitement d'informations, procédé d'inférence, programme d'inférence et procédé de génération de modèle de génération de valeur caractéristique WO2024127554A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/JP2022/046044 WO2024127554A1 (fr) 2022-12-14 2022-12-14 Dispositif de traitement d'informations, procédé d'inférence, programme d'inférence et procédé de génération de modèle de génération de valeur caractéristique

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2022/046044 WO2024127554A1 (fr) 2022-12-14 2022-12-14 Dispositif de traitement d'informations, procédé d'inférence, programme d'inférence et procédé de génération de modèle de génération de valeur caractéristique

Publications (1)

Publication Number Publication Date
WO2024127554A1 true WO2024127554A1 (fr) 2024-06-20

Family

ID=91484702

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2022/046044 WO2024127554A1 (fr) 2022-12-14 2022-12-14 Dispositif de traitement d'informations, procédé d'inférence, programme d'inférence et procédé de génération de modèle de génération de valeur caractéristique

Country Status (1)

Country Link
WO (1) WO2024127554A1 (fr)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021214994A1 (fr) * 2020-04-24 2021-10-28 日本電気株式会社 Système d'inspection
JP7138264B1 (ja) * 2021-10-08 2022-09-15 楽天グループ株式会社 情報処理装置、情報処理方法、情報処理システム、およびプログラム

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021214994A1 (fr) * 2020-04-24 2021-10-28 日本電気株式会社 Système d'inspection
JP7138264B1 (ja) * 2021-10-08 2022-09-15 楽天グループ株式会社 情報処理装置、情報処理方法、情報処理システム、およびプログラム

Similar Documents

Publication Publication Date Title
US11610115B2 (en) Learning to generate synthetic datasets for training neural networks
US11176381B2 (en) Video object segmentation by reference-guided mask propagation
US11321847B2 (en) Foreground-aware image inpainting
US10325181B2 (en) Image classification method, electronic device, and storage medium
US20190095730A1 (en) End-To-End Lightweight Method And Apparatus For License Plate Recognition
KR102338372B1 (ko) 영상으로부터 객체를 분할하는 방법 및 장치
US11514694B2 (en) Teaching GAN (generative adversarial networks) to generate per-pixel annotation
US20200074707A1 (en) Joint synthesis and placement of objects in scenes
CN111639744A (zh) 学生模型的训练方法、装置及电子设备
CN113159073B (zh) 知识蒸馏方法及装置、存储介质、终端
US20210271923A1 (en) Detecting objects in video frames using similarity detectors
EP4053718A1 (fr) Procédé et appareil d'incorporation d'informations de filigrane
WO2022077978A1 (fr) Procédé de traitement vidéo et appareil de traitement vidéo
US20220335672A1 (en) Context-aware synthesis and placement of object instances
US20230281974A1 (en) Method and system for adaptation of a trained object detection model to account for domain shift
CN113762461A (zh) 使用可逆增强算子采用有限数据训练神经网络
US20230081128A1 (en) Picture quality-sensitive semantic segmentation for use in training image generation adversarial networks
CN115661336A (zh) 一种三维重建方法及相关装置
KR20230068989A (ko) 멀티-태스크 모델의 학습을 수행하는 방법 및 전자 장치
US20220005160A1 (en) Electronic device and controlling method of electronic device
US11373352B1 (en) Motion transfer using machine-learning models
WO2022205416A1 (fr) Procédé de génération d'expression faciale basé sur des réseaux antagonistes génératifs
WO2024127554A1 (fr) Dispositif de traitement d'informations, procédé d'inférence, programme d'inférence et procédé de génération de modèle de génération de valeur caractéristique
WO2022015390A1 (fr) Recherche d'architecture neuronale à optimisation matérielle
US20230139994A1 (en) Method for recognizing dynamic gesture, device, and storage medium