CN118160012A

CN118160012A - Adaptive artificial intelligence for three-dimensional object detection using synthetic training data

Info

Publication number: CN118160012A
Application number: CN202280071065.8A
Authority: CN
Inventors: W·M·帕里; M·E·因池奥萨; L·艾伦; D·J·海恩斯; M·A·W·海德
Original assignee: Microsoft Technology Licensing LLC
Current assignee: Microsoft Technology Licensing LLC
Priority date: 2021-11-12
Filing date: 2022-08-18
Publication date: 2024-06-07

Abstract

Embodiments described herein relate to an adaptive AI model for 3D object detection using synthetic training data. For example, the ML model is trained to detect certain items of interest based on a training set that is synthetically generated in real-time during a training process. The training set includes a plurality of images depicting containers virtually holding items of interest. Each image of the training set is a composite of an image including a container containing items of non-interest and an image including items of interest scanned in isolation. A plurality of such images are generated during any given training iteration of the ML model. Once trained, the ML model is configured to detect items of interest in an actual container and output a classification indicating a likelihood that the container includes the items of interest.

Description

Adaptive artificial intelligence for three-dimensional object detection using synthetic training data

Background

Security checkpoints (e.g., airports, courts, etc.) are often equipped with X-ray scanners that enable officers to check citizen baggage for prohibited items (e.g., explosives, liquids, firearms, sharps, partially protected species). Screening procedures are slow, expensive and inaccurate due to the large number of people involved. The main challenge in developing Artificial Intelligence (AI) based solutions is the need for very large manually labeled datasets. Suppliers and regulatory organizations in the industry have spent months and years organizing such training sets for small subsets of items of interest. This requirement to organize large datasets for training AI models is a major resistance to algorithm development, making it impossible to quickly respond to emerging threats (e.g., 3D printed weapons).

Disclosure of Invention

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

The methods, systems, apparatus, and computer-readable storage media described herein relate to an adaptive AI model for three-dimensional (3D) object detection using synthetic training data. According to embodiments described herein, a machine learning model is trained to detect certain items of interest based on training sets that are synthetically generated in real-time during a training process. The training set includes a plurality of images depicting containers (e.g., luggage, bags, vanity, etc.) virtually holding the items of interest. Each image of the training set is a composite of an image including a container containing items of non-interest and an image including items of interest scanned in isolation. To generate the composite image, the image including the item of interest may be modified or transformed (e.g., scaled, rotated, etc.) and then virtually placed in random locations in the container depicted in the image. A plurality of such images are generated during any given training iteration of the machine learning model. Once trained, the machine learning model is configured to detect items of interest in an actual container and output a classification indicating a likelihood that the container includes the items of interest.

Further features and advantages, as well as the structure and operation of various example embodiments, are described in detail below with reference to the accompanying drawings. It is noted that example implementations are not limited to the specific embodiments described herein. Such example embodiments are presented herein for illustrative purposes only. Additional implementations will be apparent to those skilled in the relevant art(s) based on the teachings contained herein.

Drawings

The accompanying drawings, which are incorporated herein and form a part of the specification, illustrate exemplary embodiments of the present application and, together with the description, further serve to explain the principles of the exemplary embodiments and to enable a person skilled in the pertinent art to make and use the exemplary embodiments.

FIG. 1 illustrates a block diagram of an example system for generating an adaptive Artificial Intelligence (AI) model for three-dimensional (3D) object detection in accordance with an example embodiment.

Fig. 2 depicts a diagram of an automatic encoder, according to an example embodiment.

FIG. 3 is a block diagram of a system for generating synthetic training data according to an example embodiment.

FIG. 4 illustrates a flowchart of a method for training a machine learning model using a synthetic training data set, according to an example embodiment.

FIG. 5 illustrates a flowchart of a method for generating a plurality of composite three-dimensional images, according to an example embodiment.

Fig. 6 depicts a block diagram of a 3D image projector according to an example embodiment.

FIG. 7 illustrates a flowchart of a method for training a machine learning model to detect items of interest in different types of containers, according to an example embodiment.

FIG. 8 illustrates a flowchart of a method for selecting an item of interest to train a machine learning model, according to an example embodiment.

FIG. 9 is a block diagram of a system configured to classify new data items via a machine learning model according to an example embodiment.

FIG. 10 illustrates a flowchart of a method for detecting and classifying an item of interest via a machine learning model, according to an example embodiment.

FIG. 11 is a block diagram of an example processor-based computer system that may be used to implement various embodiments.

Features and advantages of implementations described herein will become more apparent from the detailed description set forth below when taken in conjunction with the drawings in which like reference characters identify corresponding elements throughout. In the drawings, like reference numbers generally indicate identical, functionally similar, and/or structurally similar elements. The drawing in which an element first appears is indicated by the leftmost digit(s) in the corresponding reference number.

Detailed Description

I. Introduction to the invention

The specification and drawings disclose a number of example implementations. The scope of the application is not limited to the disclosed implementations, but also covers combinations of the disclosed implementations and modifications to the disclosed implementations. References in the specification to "one implementation," "an example embodiment," "an example implementation," etc., indicate that the implementation described may include a particular feature, structure, or characteristic, but every implementation may not necessarily include the particular feature, structure, or characteristic. Moreover, such phrases are not necessarily referring to the same implementation. Further, when a particular feature, structure, or characteristic is described in connection with an implementation, it is submitted that it is within the knowledge of one skilled in the relevant art(s) to effect such feature, structure, or characteristic in connection with other implementations whether or not explicitly described.

In the discussion, unless otherwise indicated, terms (such as "substantially" and "about") modifying a condition or a relational feature of one or more features of an implementation of the disclosure should be understood to mean that the condition or feature is defined to be within an operationally acceptable tolerance of the implementation of the intended application.

Moreover, it should be understood that the spatial descriptions (e.g., "above," "below," "upward," "left," "right," "downward," "top," "bottom," "vertical," "horizontal," etc.) used herein are for illustrative purposes only, and that actual implementations of the structures described herein may be spatially arranged in any orientation or manner.

Many example embodiments are described below. It is noted that any chapter/sub-chapter titles provided herein are not intended to be limiting. Implementations are described in this document, and any type of implementation may be included under any section/sub-section. Furthermore, the implementations disclosed in any section/sub-section may be combined in any manner with any other implementations described in the same section/sub-section and/or different section/sub-section.

Example implementation

Embodiments described herein relate to an adaptive AI model for three-dimensional (3D) object detection using synthetic training data. According to embodiments described herein, a machine learning model is trained to detect certain items of interest based on training sets that are synthetically generated in real-time during a training process. The training set includes a plurality of images depicting containers (e.g., luggage, bags, vanity, etc.) virtually holding the items of interest. Each image of the training set is a composite of an image including a container containing items of non-interest and an image including items of interest scanned in isolation. To generate the composite image, the image including the item of interest may be modified or transformed (e.g., scaled, rotated, etc.) and then virtually placed in random locations in the container depicted in the image. A plurality of such images are generated during any given training iteration of the machine learning model. Once trained, the machine learning model is configured to detect items of interest in an actual container and output a classification indicating a likelihood that the container includes the items of interest.

The techniques described herein advantageously improve the field of image-based screening by enabling organization of outdated large training data sets, reducing response time to emerging threats from months or even years to just a few days, and requiring only a small number of source images to generate a training set of virtual packaged images. This provides a significant cost savings in terms of development and time. The techniques described herein may be used to identify any number of items of interest in many different contexts.

As described herein, certain techniques are used to minimize the amount of computing resources required to synthesize training data. For example, after the composite image is generated, the composite image may be cropped, and a predetermined number of voxels from the cropped image may be sampled. The sampled points are spaced apart, thereby reducing the likelihood that voxels comprising the edges of the item of interest are sampled. This not only speeds up processing (because each individual voxel is not sampled), but also saves computational resources (e.g., processing cycles, memory, storage, etc.). In addition, it also improves the accuracy of the machine learning model because using such samples reduces the chance that the machine learning model simply learns to identify boundaries of pasted objects (which may lead to inaccurate classification).

Past attempts to improve the screening process have used only 2.5-dimensional methods. The bag is rendered in 3D and rotated in front of the simulated camera lens to create a two-dimensional (2D) image of the bag from different angles. This 2.5D method is deficient in that it allows criminals to hide objects in cluttered bags.

FIG. 1 illustrates a block diagram of an example system 100 for generating an adaptive Artificial Intelligence (AI) model for three-dimensional (3D) object detection in accordance with an example embodiment. As shown in fig. 1, the system 100 includes a synthetic training data generator 102 and a machine learning model 104. The synthetic training data generator 102 is configured to dynamically generate a plurality of 3D images that are used to train the machine learning model 104. Each 3D image of the plurality of 3D images is generated based on a composite of the different images. The machine learning model 104 is configured to detect particular entities or items (also referred to herein as "items of interest") in particular containers depicted in the 3D images. Examples of items of interest include, but are not limited to, animals, bones, weapons, fruits, vegetables, and the like. Examples of containers include, but are not limited to, various types of luggage, boxes, bottles, jars, bags, sacks, and the like.

The synthetic training data generator 102 may be configured to generate a manual (or synthetic) 3D image depicting the items of interest included in the container. For example, the synthetic training data generator 102 may be configured to obtain a first 3D image of the item of interest and obtain a second 3D image of a container that does not include the item of interest. The synthetic training data generator 102 then generates a new 3D image in which the item of interest from the first 3D image is virtually added to the container of the second 3D image. The items of interest may be randomly placed in locations within the container. In addition, the transformation may be performed on the item of interest prior to positioning the item of interest at a particular location in the container. Examples of transformations include, but are not limited to, scaling the item of interest to a different size, rotating the item of interest a certain number of degrees, flipping (or reflecting) the item of interest, and so forth. Using this technique, the synthetic training data generator 102 may generate any number of synthetic 3D images, where in each synthetic 3D image, the item of interest is placed at a different location within the container and/or transformed in a different manner. The synthetic training data generator 102 generates a training data set 106 based on the generated synthetic 3D image and provides the training data set 106 to the machine learning model 104. The training data set 106 includes a generated composite 3D image, which may be represented via one or more feature vectors, each feature vector including a plurality of features (such as, but not limited to, edges, curves, colors, shapes, etc.).

The machine learning model 104 may be an Artificial Neural Network (ANN) configured to learn to classify various items of interest included in different types of containers using the training data set 106. According to an embodiment, the machine learning model 104 is an automatic encoder-based ANN. The auto-encoder based ANN is configured to learn data encoding representing the training data set 106 in a semi-supervised manner. The purpose of an automatic encoder-based ANN is to learn a lower-dimensional representation (e.g., semantic representation) of higher-dimensional data (i.e., training data set 106), typically for dimension reduction, by training the ANN to capture the most important or relevant portion of the 3D image represented by training data set 106.

For example, fig. 2 depicts a diagram of an automatic encoder 200, according to an example embodiment. The auto encoder 200 is an example of an auto encoder for the machine learning model 104. The auto-encoder 200 is configured to learn, for example in a semi-supervised manner, data encoding representing features of the composite 3D image of the training dataset 106. The purpose of the auto-encoder 200 is to learn a lower-dimensional representation (e.g., semantic representation) of the higher-dimensional data (i.e., training data set 106). As shown in fig. 2, the auto encoder includes a plurality of nodes 202 to 244. Nodes 202, 204, 206, 208, 210, and 212 may represent the input layer through which auto-encoder 200 receives feature vector(s) based on training dataset 106.

The automatic encoder 200 generally includes three parts: an encoder, a bottleneck, and a decoder, each of which includes one or more nodes. The encoder may be represented by nodes 202 through 220. The encoder (or encoder network) encodes the input data (i.e., the input feature vector(s) 108) into lower and lower dimensions. That is, the encoder is configured to compress the input data (i.e., the input feature vector(s) 108) into an encoded representation that is typically several orders of magnitude smaller than the input data. The encoder may perform a set of convolution and pooling operations that compress the input data into the bottleneck. Bottlenecks (represented by nodes 222 and 224) are configured to limit the flow of data from the encoder to the decoder to force a compressed knowledge representation of the input feature vector(s) 108. The decoder may be represented by nodes 226 through 244. The decoder (or decoder network) is configured to decode the input feature vector(s) 108 into increasingly higher dimensions. That is, the decoder is configured to decompress the knowledge representation and reconstruct the input feature vector(s) 108 from its encoded form. The decoder may perform a series of up-sampling and transpose convolution operations that reconstruct the compressed knowledge representation output from the bottleneck back into the form of the 3D image represented by the training data set 106. Nodes 234 through 244 may represent output layers by which reconstruction data (representing feature vector(s) based on training data set 106) is represented and/or provided.

An automatic encoder (such as automatic encoder 200) is used for the depth learning technique; specifically, the automatic encoder is an artificial neural network. The loss function used to train an automatic encoder (e.g., automatic encoder 200) is also referred to as a reconstruction loss or error because it is a check of how well the feature vector(s) of the training data set 106 are reconstructed. Each of the nodes 202-244 is associated with a weight that emphasizes the importance of a particular node (also referred to as a neuron). For example, assume that the neural network is configured to classify whether the synthesized 3D image includes ivory. In this case the nodes containing characteristics of the ivory will be weighted more than atypical characteristics of the ivory. The weights of the neural network are learned by training on the training dataset 106. The neural network performs multiple times, changing its weight by back-propagation with respect to the loss function. Essentially, neural networks test data, make predictions, and determine scores that represent their accuracy. It then uses the score to make itself slightly more accurate by updating the weights accordingly. Through the process, the neural network can learn to improve the prediction accuracy of the neural network.

The reconstruction loss or error is typically a mean square error (e.g., the distance between the feature vector(s) of the training data set 106 and its reconstructed version). Each layer of the auto-encoder 200 may have an affine transformation (e.g., wx+b, where x corresponds to a column vector corresponding to samples from a data set (e.g., training data set 106) provided to the auto-encoder 200, W corresponds to a weight matrix, and b corresponds to a bias vector), followed by a nonlinear function (e.g., a rectified linear unit function (or ReLU function) that forces negative values to zero and maintains the values of the non-negative values). In forward pass, the predicted value is calculated after the loss calculation, where all weights of nodes 202 through 244 are initially set to be random and iteratively updated. In a next step, gradients are calculated to alter the weights in a direction that reduces losses. This process (also known as random gradient descent) is repeated until convergence.

Referring again to fig. 1, after training the machine learning model 104 to identify the item of interest, a new data item 108 (e.g., a new 3D image) may be provided to the machine learning model 104, and the machine learning model 104 attempts to classify the new data item 108. For each item of interest detected for a particular new data item, the machine learning model 104 may output a classification 110 that includes a probability (e.g., a value between 0 and 1) that indicates a likelihood that the new data item includes the particular item of interest. For example, the machine learning model 104 may output probabilities of a first item of interest (e.g., gorilla skull) detected in the 3D image and may output probabilities of a second item of interest (e.g., ivory) detected in the 3D image. In addition, the machine learning model 104 may also output the location of the item in the baggage depicted in the 3D image.

FIG. 3 is a block diagram of a system 300 for generating synthetic training data according to an example embodiment. The following description is described with reference to embodiments in which baggage is scanned for certain items of interest. It is noted, however, that the techniques described herein may be used for other purposes.

As shown in fig. 3, system 300 includes a synthetic training data generator 302, a Computed Tomography (CT) scanner 305, a preprocessor 306, a machine learning model 304, and a performance analyzer 316. The synthetic training data generator 302 and the machine learning model 304 are examples of the training data generator 102 and the machine learning model 104, as described above with reference to fig. 1, respectively. CT scanner 305 is configured to generate CT scans of objects (e.g., containers) disposed therein. CT scanner 305 may measure X-ray attenuation using a rotating X-ray tube and a row of detectors placed in its gantry. The X-ray beam is attenuated as it absorbs photons as it passes through various items stored in the container. A reconstruction algorithm is then used to process the plurality of X-ray measurements taken from different angles to produce a tomographic (cross-sectional) image of the container. CT scanner 305 may be any CT scanner known in the art. CT scanner 305 may provide as output one or more image files comprising 3D image data. The image file(s) may be formatted in accordance with DICOM (digital imaging and communications in medicine) format, visualization package (VTK) format, insight segmentation and registration package (ITK) meta-image format (e.g., MHA file), etc.

CT scanner 305 may be used to perform CT scans of two types of entities: (1) Baggage that has been determined to not include any items of interest (shown as "cleared baggage"), and (2) items of interest. The baggage may include a plurality of different containers (e.g., packaged by passengers with various items) and be cleared to not include any items of interest (e.g., via a screening process, such as an airport screening process). Each item of interest provided to CT scanner 305 is scanned in isolation (i.e., no other items are in its vicinity). Each item of interest may be placed in a box (e.g., a cardboard box with a supporting material (such as foam) surrounding the item of interest). For each item of baggage scanned, CT scanner 305 outputs a 3D image file 308. For each isolated item of interest scanned, CT scanner 305 outputs a 3D image file 310. Because image files 308 and 310 are 3D image files, these files include data voxels. A voxel is a 3D simulation of a pixel. Voxels represent values in three-dimensional space. Thus, each voxel of the image file may include a particle density at an X-coordinate, a Y-coordinate, and a Z-coordinate, which represents the location of the voxel within the image. The combined information of voxel coordinates and specific density values may be used to distinguish between different types of materials including, but not limited to, paper, metal, cloth, bone, etc.

A plurality of cleared baggage may be scanned by CT scanner 305 to generate a cleared baggage image library 312. Library 312 may be maintained in a data store, which may be any type of storage device or array of devices. Similarly, a plurality of isolated items of interest may be scanned by the CT scanner 305 to generate an item of interest image library 314. Library 314 may be maintained in a data store, which may be any type of storage device or array of devices. According to an embodiment, the image 310 may be provided to the preprocessor 306 before the image file 310 is stored in the library 314. The preprocessor 306 is configured to remove noise from the image 310. The noise may include support material and/or a box. For example, the preprocessor 306 may perform any of a gaussian smoothing-based noise reduction technique, a thresholding-based noise reduction technique, a convex hull-based noise reduction technique, etc., to remove various types of noise from the image 310. The processed images are stored in a library 314.

The synthetic training data generator 302 is configured to generate training data based on the images stored in the libraries 312 and 314 to train the machine learning model 304. The composite training data generator 302 includes an image selector 318, a 3D image projector 320, a clipper 322, and a point sampler 324. The image selector 318 is configured to select an image file from the library 312 and an image file from the library 314, and provide the pair of images to the 3D image projector 220. For any given training iteration, image selector 318 may select multiple pairs of images (where each pair includes an image from each of libraries 312 and an image from libraries 314) to generate a bulk training set. According to an embodiment, the image selector 318 may select 64 pairs of images.

The image selector 318 may select images from the library 314 in a random manner. Alternatively, image selector 318 may select images from library 314 according to a course learning-based technique. According to this technique, items of interest that are difficult to identify by the machine learning model 304 will have a higher chance to be selected to accelerate the training process while at the same time adjusting the parameters used to load the items into the container to make the task somewhat easier. The image selector 318 may utilize a weighting scheme in which the image 310 including such items of interest are weighted more, thereby increasing the likelihood that such images are selected for training. For example, the performance analyzer 316 may be configured to determine a classification performance score for each item of interest on which the machine learning model 304 is trained. Each classification performance score indicates a performance level of the machine learning model 304 relative to classifying a particular item of interest within a particular container. Each classification performance score may be based on an F-score (also referred to as an F1-score) of the machine learning model 304, which is a measure of the accuracy of the machine learning model 304 on the data set (i.e., training data set 106). The F-score may be defined as the harmonic mean of the accuracy and recall of the machine learning model 304. A relatively low classification performance score for a particular item of interest may mean that the classification generated by the machine learning model 304 for that item of interest is relatively inaccurate and that it is difficult for the machine learning model 304 to identify the particular item of interest. A relatively high classification performance score for a particular item of interest may mean that the classification generated by the machine learning model 304 for that item of interest is relatively accurate. The image selector 318 may be configured to select an image from the library 314 that includes an item of interest that is difficult to classify by the machine learning model 304 based on the classification performance score determined relative to the item of interest. For example, the image selector 318 may select such images with a probability proportional to the classification performance score, wherein the lower the classification performance score, the higher the probability that the image selector 318 will select such images. For example, performance analyzer 316 may provide command 338 to image selector 318. Command 338 may specify classification performance score(s) determined for different item(s) of interest. The image selector 318 may utilize a weighting scheme in which the image 310 including such items of interest are weighted more, thereby increasing the likelihood that such images are selected for training. In response to receiving command 338, image selector 318 may update its weight in proportion to the classification performance score(s). For example, image selector 318 may increase the weight of an image including item(s) of interest having a relatively low classification performance score for selection from library 314.

According to an embodiment, the image selector 318 may select images from the library 312 in a random manner. It is noted that images comprising the same container (e.g., baggage) may be selected in successive iterations. According to another embodiment, the image selector 318 may select images from the library 312 according to a course learning-based technique, wherein the probability of selecting images including different containers is proportional to the average performance of the machine learning model 304 across all categories (e.g., items of interest on which the machine learning model 304 is trained). That is, initially, the same images (including the same baggage) may be used for training until it is determined that the machine learning model 304 is relatively high in performance with respect to identifying items of interest virtually contained in the baggage (e.g., the machine learning model 304 is able to correctly classify items of interest contained in a particular type of baggage more than 90% of the time). For example, the performance analyzer 316 may be configured to determine an average classification score based on an average of the classification performance scores generated for the different items of interest. A relatively high average classification score may indicate that the machine learning model 308 is relatively accurate in classifying items of interest within a particular container. A relatively low high average classification may indicate that the machine learning model 308 is relatively inaccurate in classifying items of interest within a particular container. The image selector 318 may be configured to select images from the library 312 that include different types of containers as the machine learning model 304 better classifies the items of interest within the particular type of container. For example, the image selector 318 may select such images 308 with a probability corresponding to an average classification performance score, where the higher the average classification performance score, the higher the probability that the image selector 318 will select images from the library 312 that include different types of containers. For example, performance analyzer 316 may provide command 340 to image selector 318. Command 340 may specify an average classification performance score. The image selector 318 may utilize a weighting scheme in which the image 310 including such items of interest are weighted more, thereby increasing the likelihood that such images are selected for training. In response to receiving command 338, image selector 318 may update its weight in proportion to the classification performance score(s). For example, as the average classification score increases, the image selector 318 may increase its weight for selecting images from the library 312 that include different types of containers.

Each pair of images selected for a particular training iteration is provided to 3D image projector 320. The 3D image projector 320 is configured to generate an image 326, the image 326 depicting a synthetically (or manually) packaged piece of luggage including the item of interest. That is, image 326 is a composite of an image depicting the cleared item of baggage (selected from library 312) and an image depicting the item of interest (selected from library 314).

To generate the composite image 326,3D, the image projector 320 may convert the image provided thereto into a three-dimensional matrix, where each element in the matrix corresponds to a particular voxel of the respective image. Each cell in the matrix specifies an X-coordinate, a Y-coordinate, a Z-coordinate, and a particle density associated with the voxel. The 3D image projector 320 may randomly select a set of adjacent cells of the three-dimensional matrix generated for the image corresponding to the cleared item of luggage and use the values of the cells of the three-dimensional matrix generated for the image corresponding to the item of interest to adjust the values stored therein. According to an embodiment, the item of interest may be transformed prior to being combined with the cleared bag. For example, the item of interest may be rotated by a randomly determined number of degrees, scaled according to a randomly determined scaling factor, and/or reflected on one or more randomly selected axes (e.g., X-axis, Y-axis, and/or Z-axis).

According to the course learning-based technique described above, as the machine learning model 304 better identifies the item of interest, the amount of transformation when transforming the item of interest increases. For example, the performance analyzer 316 may monitor the classification performance score (when attempting to learn to classify a particular item of interest) and determine whether the classification performance score increases or decreases. As the classification performance score increases (i.e., the machine learning model 304 classifies a particular item of interest better), the performance analyzer 316 may send one or more commands 342 to the 3D image projector 320, the commands 342 causing the 3D image projector 320 to increase the amount by which the particular item of interest is transformed (e.g., scaled and/or rotated). For example, 3D image projector 320 may utilize a scaling factor to determine how much a particular item of interest is to be scaled and a rotation factor (e.g., a defined number of degrees) may be utilized to determine how much a particular item of interest is to be rotated. Command(s) 342 may provide new values for the scaling factor and/or the rotation factor. Alternatively, command(s) 342 may signal 3D image projector 320 that the scaling factor and/or rotation factor is to be updated. The amount by which the scaling factor and/or twiddle factor will be changed may depend on the value of the reconstruction error 336, with the scaling factor and/or twiddle factor increasing as the value of the reconstruction error 336 decreases. The foregoing effectively challenges the machine learning model 304 to learn new scenarios for classifying particular items of interest.

According to an embodiment, after generating composite image 326, 3D image projector 320 may perform various post-processing thereon. For example, 3D image projector 320 may apply a natural logarithm to the particle density at each voxel, normalize the particle density values (e.g., by subtracting the mean, dividing by the standard deviation, etc.), and/or normalize the particle density values such that all particle density values are in the range between 0 and 1.

As described above, to generate a batch training set, multiple pairs of images (e.g., 64 pairs) are provided to 3D image projector 320. Thus, for any given training iteration, 3D image projector 320 generates a plurality of composite images 326 (e.g., 64), each composite image 326 including a particular type of item of interest virtually fit into a random location of a particular cleared piece of luggage. In addition, each item of interest virtually contained in a particular cleared item of luggage may have a different orientation and/or size due to the transformations performed thereon. In training the machine learning model 304, hundreds of thousands of such composite images may be generated. The large training dataset is generated based on a relatively small number of images (i.e., images stored in libraries 312 and 314).

Each composite image 326 generated during a training iteration is provided to the clipper 322. The clipper 322 is configured to window or clip each composite image 326 (e.g., one-fourth of the size of the baggage in the image in each dimension) around the items of interest included therein to generate a clipped image 330. When 3D image projector 320 performs the insertion of the item of interest into the cleared item of luggage, cutter 322 knows the center and location of the item of interest within each composite image 326 via 3D image projector 320. The 3D image projector 320 may provide such information (e.g., voxel coordinates corresponding to the center and location of the item of interest) to the clipper 222. Initially (e.g., during an earlier training iteration), clipping is focused on the item of interest. However, as the machine learning model 304 improves (e.g., as the reconstruction error 336 of the machine learning model 304 decreases), the clipper 322 effectively adds noise to the center of the window (i.e., clipping is offset from the center), thereby adding more background within the window (i.e., other areas of the cleared piece of luggage that do not include the item of interest). The inference for adding noise is that during inference, the items of interest in the piece of luggage that are actually packaged are not known. Thus, a full search of bags with windows of the same size is performed because the machine learning model 304 is trained on the window size. The cropped image 330 is provided to the point sampler 324.

The point sampler 324 is configured to sample a predetermined number of voxels (e.g., 50,000) from each cropped image 330 (e.g., X-coordinate, Y-coordinate, Z-coordinate, and particle density of each voxel). The sampled points are spaced apart, thereby reducing the likelihood that voxels comprising the edges of the item of interest are sampled. This is performed to speed up processing and save computing resources (e.g., processing cycles, memory, storage, etc.). In addition, it also improves the accuracy of the machine learning model 304 because using such samples reduces the chance that the machine learning model 304 simply learns to identify boundaries of pasted objects (which may lead to inaccurate classification). The procedure for selecting voxels to be sampled may be performed to some extent deterministically. For example, at the beginning of training, a heat map for sampling is generated and used throughout the training process. The same heat map is used for all cropped images 330 generated during training. For each voxel, the heat map contains the probability that voxel is contained in the sample. When sampling voxels according to a bitmap, voxels with zero particle density (i.e., empty space) are ignored. For relatively empty baggage, voxels with low sampling probability according to the heat map will be sampled. This is a non-deterministic aspect of the sampling process. The fact that the window is in a different position each time (the heat map moves along the window), a new point cloud sample is obtained each time the process is performed. Sample points (shown as sample points 332) are provided to the machine learning model 304 for training.

The point sampler 324 may also be configured to mark each sampled voxel as being in the background or foreground. A voxel marked as being in the foreground indicates that the voxel includes an item of interest. A voxel marked as being in the background indicates that the voxel does not include the item of interest. Briefly, the item of interest is considered to be in the foreground of the actual packaged piece of luggage, while everything else is considered to be in the background. Such tags (shown as tag 334) are also provided to the machine learning model 304 for training. It is noted that the tag may be generated earlier in the synthetic data generation process. For example, the tag 334 may be generated by the 3D image projector 320 or the clipper 322.

As described above, machine learning model 304 may be trained using course learning-based techniques. Initially, the machine learning model 304 is trained to identify all items of interest for the same context (i.e., a single type of randomly selected baggage). As the performance of the machine learning model 304 increases (as the reconstruction error 336 decreases), the rate at which the baggage changes increases. With continued training, the machine learning model 304 learns to identify items of interest against any background. Similarly, the machine learning model 304 may be initially trained to identify items of interest in a particular orientation (e.g., upright orientation), with increasing amounts of rotation (and/or scaling) about three axes as training progresses and improves.

The machine learning model 304 is configured to receive sample points 332 and labels 334 (e.g., 64 sets of sample points 332 and associated labels 334) generated during a training iteration. Such data is ultimately provided (e.g., in the form of feature vector (s)) to a bottleneck of an automatic encoder (e.g., automatic encoder 200 shown in fig. 2) of machine learning model 304, where an encoded representation of the data is determined. The machine learning model 304 then reconstructs the sampling points. The reconstructed sample point is compared to the original sample point (sample point 332) to determine how well the reconstruction process performed. Based on this performance, the weights of the nodes of the automatic encoder of machine learning model 304 are updated so that performance is further improved in the next iteration. For example, as described above, based on the classification predicted by machine learning model 304 and the actual label (label 334), machine learning model 304 outputs reconstruction error 336. The machine learning model 304 is updated to attempt to reduce the reconstruction error 336 during subsequent training iterations. For example, the weights used to generate data for subsequent layers of the neural network of the machine learning model 304 may be updated.

Thus, the machine learning model 304 may be trained using the synthetic training data set in a number of ways. For example, fig. 4 shows a flowchart 400 of a method for training a machine learning model with a synthetic training data set, according to an example embodiment. The steps of flowchart 400 occur during each of one or more training iterations of a training session of the machine learning model. In an embodiment, the flowchart 400 may be implemented by the system 300 of fig. 3. Accordingly, flowchart 400 will be described with reference to FIG. 3. Other structural and operational embodiments will be apparent to those skilled in the relevant art(s) based on the discussion of flowchart 400 and system 300 of fig. 3.

Flowchart 400 begins with slave step 402. In step 402, a first three-dimensional image is selected. The first three-dimensional image includes the container and does not include the item of interest. For example, referring to FIG. 3, image selector 318 selects a three-dimensional image from library 312.

In step 404, a second three-dimensional image is selected. The second three-dimensional image includes an object of interest. For example, referring to FIG. 3, image selector 318 selects a three-dimensional image from library 314.

In step 406, a plurality of composite three-dimensional images are generated based on the first three-dimensional image and the second three-dimensional image, each of the plurality of composite three-dimensional images including the item of interest. For example, referring to fig. 3,3d image projector 320 generates a plurality of composite three-dimensional images 326 including an item of interest. The items of interest in each of the composite three-dimensional images 326 may have different orientations and/or be positioned in different locations within the container. Additional details regarding the generation of the plurality of composite three-dimensional images 326 are described below with reference to fig. 5.

In step 408, for each of the plurality of composite three-dimensional images, the composite three-dimensional image is cropped around the item of interest included in the composite three-dimensional image to generate a cropped image 330. For example, referring to FIG. 3, the clipper 322 clips the composite three-dimensional image 326 around the item of interest included therein. Thus, the cropped image 330 includes only the item of interest (and/or a relatively small portion of the three-dimensional image 326 surrounding the item of interest).

In step 410, for each of the plurality of composite three-dimensional images, a plurality of voxels associated with the cropped composite three-dimensional image are sampled. For example, referring to fig. 3, point sampler 324 samples a plurality of voxels associated with cropped image 330.

In step 412, a plurality of voxels sampled from each of the plurality of composite three-dimensional images is provided as a training data set to a machine learning model. A machine learning model is trained to detect an item of interest based on a plurality of voxels sampled from each of a plurality of composite three-dimensional images. For example, referring to fig. 3, machine learning model 304 receives as a training dataset a plurality of voxels sampled from each composite three-dimensional image 326 (shown as sampling points 332). A label 334 (indicating whether a particular voxel is in the foreground or background) for each of the sample points 332 is also provided to the machine learning model 304. The machine learning model 304 is trained to detect items of interest based on the sampling points 332 and the tags 334.

Fig. 5 shows a flowchart 500 of a method for generating a plurality of composite three-dimensional images, according to an example embodiment. In an embodiment, flowchart 500 may be implemented by a 3D image projector, such as 3D image projector 600 of fig. 6. Accordingly, the flowchart 500 will be described with reference to fig. 6. Fig. 6 depicts a block diagram of a 3D image projector 600 according to an example embodiment. The 3D image projector 600 is an example of the 3D image projector 320, as described above with reference to fig. 3. As shown in fig. 6, the 3D image projector 600 includes a transformer 502, a transform item inserter 614, a position determiner 612, and a factor adjuster 620. Other structural and operational embodiments will be apparent to those skilled in the relevant art(s) based on the discussion of flowchart 500 and 3D image projector 600 of fig. 6.

Flowchart 500 begins with step 502. In step 502, the item is transformed for each of a plurality of iterations. For example, referring to fig. 6, the transducer 502 is configured to receive an image 610 including an item of interest. Image 610 is an example of image 310 and may be received from library 314 as described above with reference to FIG. 3. The transformer 602 is configured to transform the item of interest included in the image 610.

According to one or more embodiments, transforming the item of interest includes at least one of scaling the item of interest according to a scaling factor or rotating the item of interest according to a rotation factor. For example, referring to fig. 6, transducer 502 may include a scaler 604 and/or a rotator 606. The image 610 may be provided to the sealer 604, which sealer 604 scales the image 610 (i.e., the item of interest included therein) according to a scaling factor 616. The scaling factor 616 may include a multiplier value used to multiply a voxel value corresponding to the item of interest according to the multiplier value. Examples of scaling include, but are not limited to, increasing the size of the item of interest or decreasing the size of the item of interest. According to an embodiment, a relatively small amount of noise may be added to the item of interest, making the machine learning model 304 resilient to slight variations in material (e.g., gorilla skull with reduced bone density) or scanner 305 performance. Noise may be added by multiplying voxel values corresponding to the item of interest by a noise factor. The scaled image may then be rotated by the rotator 606 according to the rotation factor 618. The rotation factor 618 may include a degree value that specifies the degree to which the image 610 is to be rotated. It is noted that alternatively, the image 610 may be first rotated by the rotator 606 and then scaled by the scaler 604. The transformed item of interest (shown as transformed item of interest 622) is provided to the transformed item inserter 614.

In step 504, for each of a plurality of iterations, the transformed item of interest is inserted into a location within a container of the first three-dimensional image to generate a composite three-dimensional image of the plurality of composite three-dimensional images. For example, referring to fig. 6, the location determiner 612 is configured to determine a location (e.g., a random location) within a container included in the image 608. Image 608 is an example of image 308 and may be received from library 312 as described above with reference to FIG. 3. The determined location (shown as location 624) is provided to the conversion-item inserter 614. The transformed item inserter 614 is configured to insert the transformed item of interest 622 at a location 624 of the container included in the image 608 to generate a composite three-dimensional image 626. Composite three-dimensional image 626 is an example of composite three-dimensional image 326, as described above with reference to fig. 3.

In accordance with one or more embodiments, to generate composite image 626,3D, image projector 600 may convert images 608 and 610 into a three-dimensional matrix, where each cell in the matrix corresponds to a particular voxel of the respective image. Each voxel comprises a particle density of the corresponding X-, Y-, and Z-coordinates. The 3D image projector 600 may randomly select a set of adjacent cells of the three-dimensional matrix generated for the image 608 (corresponding to the cleared item of luggage) and use the values of the cells of the three-dimensional matrix generated for the image 610 (corresponding to the item of interest) to adjust the values stored therein.

According to one or more embodiments, as the reconstruction error of the machine learning model decreases, the amount of transformation in transforming the item of interest increases. For example, referring to FIG. 6, according to the course learning-based technique described above, as a machine learning model (e.g., machine learning model 304 shown in FIG. 3) better identifies an item of interest, the amount of transformation when transforming the item of interest increases. For example, when the performance analyzer detects a reduction in a reconstruction error (e.g., reconstruction error 336 shown in fig. 3), the factor adjuster 620 may receive one or more commands 642 from the performance analyzer (e.g., performance analyzer 316 described above with reference to fig. 3). Command(s) 642 are examples of command(s) 342, as described above with reference to fig. 3. In response to receiving command(s) 642, factor adjuster 620 increases the value of scaling factor 616 and/or twiddle factor 618. The amount by which the scaling factor 616 and/or the twiddle factor 618 will be changed may depend on the value of the reconstruction error 336, with the scaling factor 616 and/or twiddle factor 618 increasing as the value of the reconstruction error 336 decreases.

According to one or more embodiments, the machine learning model 304 switches to detecting items of interest in different types of containers based on a reconstruction error of the machine learning model 304. For example, fig. 7 shows a flowchart 700 of a method for training a machine learning model to detect items of interest in different types of containers, according to an example embodiment. In an embodiment, flowchart 700 may be implemented by system 300 of fig. 3. Accordingly, flowchart 700 will be described with reference to FIG. 3. Other structural and operational embodiments will be apparent to those skilled in the relevant art(s) based on the discussion of flowchart 700 and system 300 of fig. 3.

Flowchart 700 begins with step 702. In step 702, an average classification performance score of the machine learning model is determined, the average classification performance score being based on an average of a plurality of classification performance scores, each classification performance score of the plurality of classification performance scores being indicative of a classification performance of the machine learning model relative to a particular item of interest of the plurality of items of interest. For example, referring to fig. 3, performance analyzer 316 may determine an average classification performance score.

In step 704, a third three-dimensional image is selected that includes another container and does not include the item of interest with a probability corresponding to the average classification performance score. For example, referring to fig. 3, the performance analyzer 316 may provide a command 340 (e.g., including an average classification performance score) to the image selector 318, the command 340 causing the image selector 318 to select an image from the library 312 with a probability corresponding to the average classification performance score.

In step 706, a plurality of second composite three-dimensional images are generated based on the third three-dimensional image and the second three-dimensional image. For example, the reference 3D image projector 320 generates a plurality of second composite three-dimensional images based on the third three-dimensional image and the second three-dimensional image. The method then continues in a similar manner as described above with reference to fig. 4, wherein steps 408, 410 and 412 are performed based on the second composite three-dimensional image.

In accordance with one or more embodiments, a three-dimensional image including an item of interest is selected based on a reconstruction error of the machine learning model 304, wherein the three-dimensional image including the item of interest that is difficult to identify by the machine learning model 304 will have a higher chance to be selected to accelerate the training process. For example, FIG. 8 illustrates a flowchart 800 of a method for selecting an item of interest to train a machine learning model, according to an example embodiment. In an embodiment, flowchart 800 may be implemented by system 300 of fig. 3. Accordingly, flowchart 800 will be described with reference to FIG. 3. Other structural and operational embodiments will be apparent to those skilled in the relevant art(s) based on the discussion of flowchart 800 and system 300 of fig. 3. The following is described in the context of selecting the second three-dimensional image at step 404.

Flowchart 800 begins with step 802. In step 802, a classification performance score of a machine learning model is determined. For example, referring to fig. 3, performance analyzer 316 may determine a classification performance score for machine learning model 304.

In step 804, a third three-dimensional image is selected with a probability proportional to the classification performance score. For example, referring to fig. 3, performance analyzer 316 may provide a command 338 (including a classification performance score) to image selector 318, which command 338 causes image selector 318 to select a three-dimensional image including the item of interest from library 314 with a probability proportional to the classification performance score.

After training is completed, the machine learning model 304 is applied to the test set of real packages to establish appropriate thresholds on the confidence level of the machine learning model 304. The threshold may be set individually for each item of interest category. For example, for more threatening objects (e.g., firearms), the organization would be willing to tolerate a higher false positive rate, but a minimum false negative rate is required. On the other hand, for smuggled subjects that do not represent a direct threat to human life, a relatively high false negative rate is acceptable, while the tolerance to high false positives is low. The confidence threshold may be dynamically adjusted in the deployed solution in response to a change in threat level, or based on the identity of the bag owner, or based on an overall increase in threat level.

After training of the machine learning model 304 is completed, the machine learning model 304 is deployed (e.g., at an airport) and used to classify new data items. FIG. 9 is a block diagram of a system 900 configured to classify new data items via a machine learning model, according to an example embodiment. As shown in fig. 9, system 900 includes a CT scanner 905, a tailorar 922, a point sampler 924, and a machine learning model 904.CT scanner 905, cutter 922, point sampler 924, and machine learning model 904 are examples of CT scanner 305, cutter 322, point sampler 324, and machine learning model 304, as described above with reference to fig. 3, respectively.

As shown in fig. 9, a new data item (e.g., an item of baggage) is provided to a CT scanner 905, which scans the baggage and outputs a 3D image 902. The 3D image 902 is provided to a clipper 922. The clipper 922 is configured to generate a plurality of segmented windows 906 (or partial 3D images) of the 3D image 902 by clipping the 3D image 902 a predetermined distance along each of the X, Y, and Z axes (e.g., along each axis quarter). For example, as shown in fig. 9, the 3D image 902 is segmented into 64 windows 906. It is noted that the 3D image 902 may be segmented into any number of windows, and that the number of windows described herein is purely exemplary. Each of the windows 906 is provided to a point sampler 924.

The point sampler 924 is configured to sample a predetermined number of voxels (e.g., 50,000) from each window 906 in a manner similar to that described above with reference to the point sampler 324 of fig. 3. The sampling points (or voxels) (shown as sampling points 908) are provided to the machine learning model 904.

For each window 906, the machine learning model 904 is configured to analyze each of its sampling points 908 and make a determination (e.g., generate a classification) as to whether each of the sampling points 908 is in the foreground (i.e., is part of an item of interest) or in the background (i.e., is not part of an item of interest). Based on the analysis of the sampling points 908 of the one or more windows 906, the machine learning model 904 outputs a final classification 910. The classification 910 includes one or more probabilities. Each of the probability(s) indicates a likelihood that the luggage (corresponding to the 3D image 902) includes the respective item of interest (e.g., 90% probability that the luggage includes illegal ivory, 5% probability that the luggage includes illegal gorilla skull, etc.). According to an embodiment, the classification 910 may be based on each classification generated for each respective sample point of the sample points 908 of the respective window 906. For example, each classification generated for a corresponding window 906 may be averaged together to generate a classification for that window 906. Each class generated for the respective window 906 may then be averaged together to generate class 910. It is noted that other techniques may be utilized to determine the classification 910 based on analysis of the sampling points 908 of the window 906. The classification 910 is provided to an alert generator 912.

The alert generator 912 may be configured to generate an alert in response to the classification 910 indicating that the probability satisfies a threshold condition (e.g., equal condition, greater than condition, less than condition, etc.). If it is determined that the probability meets a threshold condition (e.g., meets or exceeds a predetermined threshold of 90%), an alert 914 may be generated. The alert 914 may be provided to one or more computing devices, displayed via a Graphical User Interface (GUI) of such computing device(s), and/or played back via the computing device(s). For example, alert 914 may include an audio signal played back on a speaker coupled to such computing device(s), activation of one or more light sources (e.g., light bulbs, light Emitting Diodes (LEDs), etc.), a Short Message Service (SMS) message or email message sent to the user's mobile device, or a telephone call made to the user's mobile device, etc. Examples of such computing device(s) include, but are not limited to, any type of fixed or mobile computing device, including mobile computers or mobile computing devices (e.g., Devices, laptop computers, notebook computers, tablet computers such as apple iPad ^TM, netbooks, etc.), wearable computing devices (e.g., head-mounted devices including smart glasses, such as/>Glass ^TM, etc.) or a fixed computing device such as a desktop computer or PC (personal computer).

Thus, machine learning models may be utilized to detect and classify items of interest in a number of ways. For example, fig. 10 shows a flowchart 1000 of a method for detecting and classifying an item of interest via a machine learning model, according to an example embodiment. In an embodiment, flowchart 1000 may be implemented by system 900 of fig. 9. Accordingly, flowchart 900 will be described with continued reference to fig. 9. Other structural and operational embodiments will be apparent to those skilled in the relevant art(s) based on the discussion of flowchart 1000 and system 900 of fig. 9.

Flowchart 1000 begins at step 1002. In step 1002, a first three-dimensional image depicting a container for storing items is received. For example, referring to fig. 9, a cutter 922 receives a three-dimensional image 902 generated by a CT scanner 905. The three-dimensional image 902 depicts a container (e.g., luggage) for storing items.

In step 1004, the first three-dimensional image is segmented into a plurality of segmentation windows. For example, referring to FIG. 9, a clipper 922 segments the three-dimensional image 902 into a plurality of segment windows 906.

In step 1006, a predetermined number of voxels are sampled from each of a plurality of segment windows. For example, referring to fig. 9, the point sampler 924 samples a predetermined number of voxels from each of the windows 906.

In step 1008, voxels sampled from each of the plurality of segment windows are provided as input to a machine learning model configured to generate classifications for the provided voxels, each classification including a probability as to whether the corresponding voxel includes at least a portion of the item of interest. For example, referring to fig. 9, machine learning model 904 receives as input sampled voxels 908. The machine learning model 904 is configured to generate a classification of sampled voxels 908 for each window 906. Each classification includes a probability as to whether the corresponding voxel from the sampled voxel 908 includes at least a portion of the item of interest.

According to one or more embodiments, the machine learning model is an artificial neural network-based machine learning model. For example, referring to fig. 9, the machine learning model 904 is an artificial neural network-based machine learning model.

In step 1010, a final classification is output as to whether the first three-dimensional image includes an item of interest based on the generated classification. For example, referring to fig. 9, the machine learning model 804 outputs a final classification 910 as to whether the three-dimensional image 902 includes an item of interest. The final classification 910 is provided to an alert generator 912.

In step 1012, it is determined that the final classification satisfies a threshold condition. For example, referring to fig. 9, the alert generator 912 determines that the final classification 910 meets the threshold condition.

In step 1014, an alert is generated that the item of interest has been detected in the container. For example, referring to fig. 9, alert generator 912 generates an alert 914 indicating that an item of interest has been detected in the container (i.e., luggage).

Example computer System implementation

The systems and methods described above with reference to fig. 1-10 may be implemented in hardware or hardware in combination with one or both of software and/or firmware. For example, the system 1100 of fig. 11 may be used to implement the composite training data generator 102, the machine learning model 104, the auto encoder 200, the preprocessor 306, the composite training data generator 302, the image selector 318, the 3D image projector 320, the cropping 322, the point sampler 324, the machine learning model 304, the performance analyzer 316, the 3D image projector 600, the transformer 602, the scaler 604, the rotator 606, the transformed item inserter 614, the position determiner 612, the cropping 922, the point sampler 924, the machine learning model 904, and/or any of the components described therein, respectively, and/or the flowcharts 400, 500, 700, 800, and/or 1000, respectively, may be implemented as computer program code/instructions configured to be executed in one or more processors and stored in a computer readable storage medium. Alternatively, any of the synthetic training data generator 102, the machine learning model 104, the automatic encoder 200, the preprocessor 306, the synthetic training data generator 302, the image selector 318, the 3D image projector 320, the tailorar 322, the point sampler 324, the machine learning model 304, the performance analyzer 316, the 3D image projector 600, the transformer 602, the scaler 604, the rotator 606, the transform object inserter 614, the position determiner 612, the tailorar 922, the point sampler 924, the machine learning model 904, and/or any of the components described therein, respectively, and/or the flowcharts 400, 500, 700, 800, and/or 1000 may be implemented in one or more socs (system on a chip). The SoC may include an integrated circuit chip that includes one or more of a processor (e.g., a Central Processing Unit (CPU), microcontroller, microprocessor, digital Signal Processor (DSP), etc.), memory, one or more communication interfaces, and/or other circuitry, and may optionally execute received program code and/or include embedded firmware to perform functions. The description of the system 500 provided herein is provided for illustrative purposes and is not intended to be limiting. Embodiments may be implemented in other types of computer systems, as known to those skilled in the relevant art(s).

As shown in fig. 11, system 1100 includes a processing unit 1102, a system memory 1104, and a bus 1106 that couples various system components including system memory 1104 to processing unit 1102. The processing unit 1102 may include one or more circuits, microprocessors, or microprocessor cores. Bus 1106 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. The system memory 1104 includes Read Only Memory (ROM) 1108 and Random Access Memory (RAM) 1110. A basic input/output system 1112 (BIOS) is stored in ROM 1108.

The system 1100 also has one or more of the following drivers: a hard disk drive 1114 for reading from and writing to a hard disk, a magnetic disk drive 1116 for reading from or writing to a removable magnetic disk 1118, and an optical disk drive 1120 for reading from or writing to a removable optical disk 1122 (such as a CD ROM, DVD ROM, BLU-RAY ^TM disk, or other optical media). The hard disk drive 1114, magnetic disk drive 1116 and optical disk drive 1120 can be connected to the bus 1106 by a hard disk drive interface 1124, a magnetic disk drive interface 1126 and an optical drive interface 1128, respectively. The drives and their associated computer-readable media provide nonvolatile storage of computer-readable instructions, data structures, program modules and other data for the computer. Although a hard disk, a removable magnetic disk, and a removable optical disk are described, other types of computer readable memory devices and storage structures can be used to store data, such as solid state drives, flash memory cards, digital video disks, random Access Memories (RAMs), read Only Memories (ROMs), and the like.

A number of program modules may be stored on the hard disk, magnetic disk, optical disk, ROM, or RAM. These program modules include an operating system 1130, one or more application programs 1132, other program modules 1134, and program data 1136. According to various embodiments, the program modules may include computer program logic executable by the processing unit 1102 to perform any or all of the functions and features of the composite training data generator 102, the machine learning model 104, the automatic encoder 200, the preprocessor 306, the composite training data generator 302, the image selector 318, the 3D image projector 320, the cropping unit 322, the point sampler 324, the machine learning model 304, the performance analyzer 316, the 3D image projector 600, the transformer 602, the scaler 604, the rotator 606, the transformed item inserter 614, the position determiner 612, the cropping unit 922, the point sampler 924, the machine learning model 904, and/or any of the components and/or the flowcharts 400, 500, 700, 800, and/or 1000 described therein, respectively, and/or any of the components described therein, respectively, as described above. Program modules may also include computer program logic that, when executed by the processing unit 1102, causes the processing unit 1102 to perform any of the steps of the flowcharts of fig. 4,5, 7, 8, and/or 10, as described above.

A user may enter commands and information into the system 1100 through input devices such as a keyboard 1138 and pointing device 1140 (e.g., a mouse). Other input devices (not shown) may include a microphone, joystick, game controller, scanner, or the like. In one embodiment, a touch screen is provided in conjunction with display 1144 to allow a user to provide user input via application of a touch (e.g., by a finger or stylus) to one or more points on the touch screen. These and other input devices are often connected to the processing unit 1102 through a serial port interface 1142 that is coupled to the bus 1106, but may be connected by other interfaces, such as a parallel port, game port or a Universal Serial Bus (USB). Such an interface may be a wired or wireless interface.

A display 1144 is connected to bus 1106 via an interface, such as a video adapter 1146. In addition to the display 1144, the system 1100 may include other peripheral output devices (not shown), such as speakers and printers.

The system 1100 is connected to a network 1148 (e.g., a local or wide area network such as the internet) through a network interface 1150, a modem 1152, or other suitable means for establishing communications over the network. The modem 1152, which can be internal or external, is connected to the bus 1106 via the serial port interface 1142.

As used herein, the terms "computer program medium," "computer-readable medium," and "computer-readable storage medium" are used to generally refer to memory devices or storage structures, such as the hard disk associated with hard disk drive 1114, removable magnetic disk 1118, removable optical disk 1122, and other memory devices or storage structures (such as flash memory cards, digital video disks, random Access Memories (RAMs), read Only Memories (ROMs), and the like). Such computer-readable storage media is distinguished from, and does not overlap with (does not include) communication media and propagated signals. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave. The term "modulated data signal" means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wireless media such as acoustic, RF, infrared and other wireless media. Embodiments also relate to such communication media. Embodiments also relate to such communication media that are separate and non-overlapping from embodiments that relate to computer-readable storage media.

As mentioned above, computer programs and modules (including application programs 1132 and other program modules 1134) may be stored on the hard disk, magnetic disk, optical disk, ROM, or RAM. Such computer programs may also be received via network interface 1150, serial port interface 1142, or any other interface type. Such computer programs, when executed or loaded by an application, enable the system 1100 to implement features of embodiments discussed herein. Such computer programs are, therefore, representative of the controllers of system 1100.

Embodiments also relate to computer program products that include software stored on any computer usable medium. Such software, when executed in one or more data processing devices, causes the data processing device(s) to operate as described herein. Embodiments may employ any computer-usable or computer-readable medium known now or in the future. Examples of computer readable media include, but are not limited to, memory devices and storage structures such as RAM, hard drives, solid state drives, floppy disks, CD ROMs, DVD ROMs, compact disks, magnetic tapes, magnetic storage devices, optical storage devices, MEM, nanotechnology-based storage devices, and the like.

Other example embodiments

A system for detecting an item of interest in a container is described herein. The system includes at least one processor circuit; and at least one memory storing program code configured to be executed by the at least one processor circuit, the program code comprising: a tailorar configured to: receiving a first three-dimensional image depicting a container for storing items; segmenting the first three-dimensional image into a plurality of segmentation windows; a point sampler configured to: sampling a predetermined number of voxels from each of a plurality of segment windows; and providing voxels sampled from each of the plurality of segment windows as input to a machine learning model, the machine learning model configured to generate classifications for the provided voxels, each classification including a probability as to whether the respective voxel includes at least a portion of the item of interest, the machine learning model configured to output a final classification as to whether the first three-dimensional image includes the item of interest based on the generated classifications; and an alert generator configured to: determining that the final classification meets a threshold condition; and generating an alert indicating that the item of interest has been detected in the container in response to a determination that the final classification meets the threshold condition.

In an implementation of the system, the machine learning model is an artificial neural network-based machine learning model.

In an implementation of the system, the system further comprises: a synthetic training data generator configured to, during each iteration of a training session of the machine learning model: selecting a second three-dimensional image that includes the container and does not include the item of interest; selecting a third three-dimensional image comprising the item of interest; generating a plurality of composite three-dimensional images based on the second three-dimensional image and the third three-dimensional image, each composite three-dimensional image of the plurality of composite three-dimensional images including the item of interest; for each of a plurality of composite three-dimensional images: cropping the composite three-dimensional image around an object of interest included in the composite three-dimensional image; and sampling a plurality of voxels associated with the cropped composite three-dimensional image; and providing the plurality of voxels sampled from each of the plurality of composite three-dimensional images as a training data set to a machine learning model trained to detect the item of interest based on the plurality of voxels sampled from each of the plurality of composite three-dimensional images.

In an implementation of the system, the composite training generator is configured to generate a plurality of composite three-dimensional images by: for each of a plurality of iterations: transforming the item of interest; and inserting the transformed object of interest into a position within the container of the second three-dimensional image to generate a composite three-dimensional image of the plurality of composite three-dimensional images.

In an implementation of the system, the synthetic training generator is configured to transform the item of interest by performing at least one of: scaling the item of interest according to the scaling factor; or rotate the item of interest according to a rotation factor.

In an implementation of the system, the synthetic training data generator is configured to increase an amount of transformation when transforming the item of interest as the classification performance score of the machine learning model increases.

In an implementation of the system, the system further comprises a performance analyzer configured to determine an average classification performance score of the machine learning model, the average classification performance score based on an average of a plurality of classification performance scores, each classification performance score of the plurality of classification performance scores indicative of classification performance of the machine learning model relative to a particular item of interest of the plurality of items of interest, and wherein the synthetic training data generator is configured to: selecting a fourth three-dimensional image that includes another container and does not include the item of interest with a probability corresponding to the average classification performance score; and generating a plurality of second composite three-dimensional images based on the fourth three-dimensional image and the third three-dimensional image.

In an implementation of the system, the system further comprises a performance analyzer configured to determine a classification performance score of the machine learning model, and wherein the synthetic training data generator is configured to: the third three-dimensional image is selected with a probability proportional to the classification performance score.

A method for detecting an item of interest in a container is also described herein. The method comprises the following steps: receiving a first three-dimensional image depicting a container for storing items; segmenting the first three-dimensional image into a plurality of segmentation windows; sampling a predetermined number of voxels from each of a plurality of segment windows; providing voxels sampled from each of the plurality of segment windows as input to a machine learning model configured to generate classifications for the provided voxels, each classification including a probability as to whether the respective voxel includes at least a portion of the item of interest; outputting a final classification as to whether the first three-dimensional image includes the item of interest based on the generated classification; determining that the final classification meets a threshold condition; and generating an alert indicating that the item of interest has been detected in the container in response to the determination that the final classification meets a threshold condition.

In one implementation of the method, the machine learning model is an artificial neural network-based machine learning model.

In one implementation of the method, the method further comprises: during each iteration of the training session of the machine learning model: selecting a second three-dimensional image that includes the container and does not include the item of interest; selecting a third three-dimensional image comprising the item of interest; generating a plurality of composite three-dimensional images based on the second three-dimensional image and the third three-dimensional image, each composite three-dimensional image of the plurality of composite three-dimensional images including the item of interest; for each of a plurality of composite three-dimensional images: cropping the composite three-dimensional image around an object of interest included in the composite three-dimensional image; and sampling a plurality of voxels associated with the cropped composite three-dimensional image; and providing the plurality of voxels sampled from each of the plurality of composite three-dimensional images as a training data set to a machine learning model trained to detect the item of interest based on the plurality of voxels sampled from each of the plurality of composite three-dimensional images.

In one implementation of the method, generating a plurality of composite three-dimensional images includes: for each of a plurality of iterations: transforming the item of interest; and inserting the transformed object of interest into a position within the container of the second three-dimensional image to generate a composite three-dimensional image of the plurality of composite three-dimensional images.

In one implementation of the method, transforming the item of interest includes at least one of: scaling the item of interest according to the scaling factor; or rotate the item of interest according to a rotation factor.

In one implementation of the method, the amount of transformation when transforming the item of interest is increased as the classification performance score of the machine learning model increases.

In one implementation of the method, the method further comprises: determining an average classification performance score for the machine learning model, the average classification performance score based on an average of a plurality of classification performance scores, each classification performance score of the plurality of classification performance scores indicating a classification performance of the machine learning model relative to a particular item of interest of the plurality of items of interest; selecting a fourth three-dimensional image that includes another container and does not include the item of interest with a probability corresponding to the average classification performance score; and generating a plurality of second composite three-dimensional images based on the fourth three-dimensional image and the third three-dimensional image.

In one implementation of the method, the method further comprises: selecting a third three-dimensional image comprising the item of interest comprises: determining a classification performance score for the machine learning model; and selecting a third three-dimensional image with a probability proportional to the classification performance score.

A computer readable storage medium having program instructions recorded thereon, which when executed by a processor of a computing device, perform a method for detecting an item of interest in a container. The method comprises the following steps: receiving a first three-dimensional image depicting a container for storing items; segmenting the first three-dimensional image into a plurality of segmentation windows; sampling a predetermined number of voxels from each of a plurality of segment windows; providing voxels sampled from each of the plurality of segment windows as input to a machine learning model configured to generate classifications for the provided voxels, each classification including a probability as to whether the respective voxel includes at least a portion of the item of interest; outputting a final classification as to whether the first three-dimensional image includes the item of interest based on the generated classification; determining that the final classification meets a threshold condition; and generating an alert indicating that the item of interest has been detected in the container in response to the determination that the final classification meets a threshold condition.

In an implementation of the computer-readable storage medium, the machine learning model is an artificial neural network-based machine learning model.

In an implementation of the computer readable storage medium, the method further comprises: during each iteration of the training session of the machine learning model: selecting a second three-dimensional image that includes the container and does not include the item of interest; selecting a third three-dimensional image comprising the item of interest; generating a plurality of composite three-dimensional images based on the second three-dimensional image and the third three-dimensional image, each composite three-dimensional image of the plurality of composite three-dimensional images including the item of interest; for each of a plurality of composite three-dimensional images: cropping the composite three-dimensional image around an object of interest included in the composite three-dimensional image; and sampling a plurality of voxels associated with the cropped composite three-dimensional image; and providing the plurality of voxels sampled from each of the plurality of composite three-dimensional images as a training data set to a machine learning model trained to detect the item of interest based on the plurality of voxels sampled from each of the plurality of composite three-dimensional images.

In an implementation of a computer-readable storage medium, generating a plurality of composite three-dimensional images includes: for each of a plurality of iterations: transforming the item of interest; and inserting the transformed object of interest into a position within the container of the second three-dimensional image to generate a composite three-dimensional image of the plurality of composite three-dimensional images.

Conclusion of V

While various exemplary embodiments have been described above, it should be understood that they have been presented by way of example only, and not limitation. It will be understood by those of ordinary skill in the relevant art(s) that various changes in form and details may be made therein without departing from the spirit and scope of the embodiments as defined by the following claims. Thus, the breadth and scope of the present disclosure should not be limited by any of the above-described exemplary embodiments, but should be defined only in accordance with the following claims and their equivalents.

Claims

1. A system for detecting an item of interest in a container, comprising:

at least one processor circuit; and

At least one memory storing program code configured to be executed by the at least one processor circuit, the program code comprising:

A tailorar configured to:

receiving a first three-dimensional image depicting a container for storing items; and

Segmenting the first three-dimensional image into a plurality of segmentation windows;

A point sampler configured to:

sampling a predetermined number of voxels from each of the plurality of segmentation windows; and

Providing the voxels sampled from each of the plurality of segment windows as input to a machine learning model configured to generate classifications for the provided voxels, each classification including a probability as to whether the respective voxel includes at least a portion of the item of interest, the machine learning model configured to output a final classification as to whether the first three-dimensional image includes the item of interest based on the generated classifications; and

An alert generator configured to:

Determining that the final classification meets a threshold condition; and

In response to a determination that the final classification meets the threshold condition, an alert is generated indicating that the item of interest has been detected in the container.

2. The system of claim 1, wherein the machine learning model is an artificial neural network-based machine learning model.

3. The system of claim 1, further comprising:

A synthetic training data generator configured to, during each iteration of a training session for the machine learning model:

selecting a second three-dimensional image that includes the container and does not include the item of interest;

selecting a third three-dimensional image comprising the item of interest;

Generating a plurality of composite three-dimensional images based on the second three-dimensional image and the third three-dimensional image, each composite three-dimensional image of the plurality of composite three-dimensional images including the item of interest;

For each of the plurality of composite three-dimensional images:

Cropping the composite three-dimensional image around the item of interest included in the composite three-dimensional image; and

Sampling a plurality of voxels associated with the cropped composite three-dimensional image; and

Providing the plurality of voxels sampled from each of the plurality of composite three-dimensional images as a training data set to the machine learning model, the machine learning model being trained to detect the item of interest based on the plurality of voxels sampled from each of the plurality of composite three-dimensional images.

4. The system of claim 3, wherein the composite training generator is configured to generate the plurality of composite three-dimensional images by:

For each of a plurality of iterations:

Transforming the item of interest; and

Inserting the transformed object of interest into a position within the container of the second three-dimensional image to generate a composite three-dimensional image of the plurality of composite three-dimensional images.

5. The system of claim 4, wherein the synthetic training generator is configured to transform the item of interest by performing at least one of:

scaling the item of interest according to a scaling factor; or alternatively

Rotating the item of interest according to a rotation factor.

6. The system of claim 4, wherein the synthetic training data generator is configured to increase an amount of transformation in transforming the item of interest when a classification performance score of the machine learning model is increased.

7. The system of claim 3, further comprising:

A performance analyzer configured to determine an average classification performance score of the machine learning model, the average classification performance score based on an average of a plurality of classification performance scores, each classification performance score of the plurality of classification performance scores indicative of the classification performance of the machine learning model relative to a particular item of interest of a plurality of items of interest, wherein the synthetic training data generator is configured to:

selecting a fourth three-dimensional image comprising another container and not comprising the item of interest with a probability corresponding to the average classification performance score; and

A plurality of second composite three-dimensional images are generated based on the fourth three-dimensional image and the third three-dimensional image.

8. The system of claim 3, further comprising:

A performance analyzer configured to determine a classification performance score of the machine learning model, and wherein the synthetic training data generator is configured to:

the third three-dimensional image is selected with a probability proportional to the classification performance score.

9. A method for detecting an item of interest in a container, comprising:

receiving a first three-dimensional image depicting a container for storing items;

Sampling a predetermined number of voxels from each of the plurality of segmentation windows;

Providing the voxels sampled from each of the plurality of segment windows as input to a machine learning model configured to generate classifications for the provided voxels, each classification including a probability as to whether the respective voxel includes at least a portion of the item of interest;

Outputting a final classification as to whether the first three-dimensional image includes the item of interest based on the generated classification;

Determining that the final classification meets a threshold condition; and

In response to the determination that the final classification meets the threshold condition, an alert is generated indicating that the item of interest has been detected in the container.

10. The method of claim 9, wherein the machine learning model is an artificial neural network-based machine learning model.

11. The method of claim 9, further comprising:

during each iteration of a training session for the machine learning model:

selecting a third three-dimensional image comprising the item of interest;

For each of the plurality of composite three-dimensional images:

12. The method of claim 11, wherein generating the plurality of composite three-dimensional images comprises:

For each of a plurality of iterations:

Transforming the item of interest; and

13. The method of claim 12, wherein transforming the item of interest comprises at least one of:

scaling the item of interest according to a scaling factor; or alternatively

Rotating the item of interest according to a rotation factor.

14. The method of claim 12, wherein an amount of transformation in transforming the item of interest increases as a classification performance score of the machine learning model is increased.

15. A computer readable storage medium having program instructions recorded thereon, which when executed by at least one processor, perform the method according to any of claims 9 to 14.