TWI773797B

TWI773797B - System, method and computer program product for tracking multi-joint subjects in an area of real space

Info

Publication number: TWI773797B
Application number: TW107126341A
Authority: TW
Inventors: 喬丹費雪; 丹尼爾菲奇帝; 布蘭登歐格; 約翰諾瓦克; 凱爾多曼; 肯尼士木原; 喬安拉席拉; 大衛瓦德曼
Original assignee: 美商標準認知公司
Priority date: 2017-08-07
Filing date: 2018-07-30
Publication date: 2022-08-11
Also published as: WO2019032305A2; EP3665648A2; TW201911119A; JP7208974B2; WO2019032307A1; CA3072063A1; WO2019032306A1; WO2019032305A3; EP3665649A1; EP3665647A4; JP2020530168A; JP7181922B2; CA3072062A1; JP2020530167A; EP3665648A4; EP3665647A1; WO2019032304A1; EP3665649A4; EP3665615A4; WO2019032306A9

Abstract

Systems and techniques are provided for tracking puts and takes of inventory items by subjects in an area of real space. A plurality of cameras with overlapping fields of view produce respective sequences of images of corresponding fields of view in the real space. In one embodiment, the system includes first image processors, including subject image recognition engines, receiving corresponding sequences of images from the plurality of cameras. The first image processors process images to identify subjects represented in the images in the corresponding sequences of images. The system includes second image processors, including background image recognition engines, receiving corresponding sequences of images from the plurality of cameras. The second image processors mask the identified subjects to generate masked images. Following this, the second image processors process the masked images to identify and classify background changes represented in the images in the corresponding sequences of images.

Description

System, method and computer program product for tracking a multi-joint body in an area of real space

本發明係有關可用於無出納員結帳之系統及其組件。 The present invention relates to a system and components thereof that can be used for cashierless checkout.

影像處理之困難問題係發生在當來自其配置在大空間上之多數相機的影像被用以識別並追蹤主體的動作時。 The difficult problem of image processing occurs when images from many cameras, which are deployed over a large space, are used to recognize and track the movements of the subject.

追蹤真實空間的區域內之主體的動作(諸如購物商店中的人)引起了許多技術上的挑戰。例如，考量購物商店中所部署的此一影像處理系統，其中有多數消費者在購物商店內之貨架間的走道以及開放空間中移動。消費者從貨架取走項目並將那些項目放入其各別的購物車或籃中。消費者亦可將項目放在貨架上，假如他們不想要該項目的話。 Tracking the movements of subjects within areas of real space, such as people in shopping stores, poses a number of technical challenges. For example, consider such an image processing system deployed in a shopping store, where most consumers move in aisles between shelves and in open spaces within a shopping store. Consumers remove items from the shelves and place those items into their respective shopping carts or baskets. Consumers can also put an item on the shelf if they don't want the item.

當消費者正履行這些動作時，消費者之不同部分以及貨架之不同部分或者固持該商店之存貨的其他展示結構將被阻擋於來自不同相機的影像中，由於其他消費者、貨架、及產品陳列等等之存在。同時，任何既定時刻在該商店中可能有許多消費者，使其難以隨時識別及追蹤個人及其動作。 While the consumer is performing these actions, different parts of the consumer and different parts of the shelves or other display structures holding the store's inventory will be blocked from images from different cameras due to other consumers, shelves, and product displays. The existence of waiting. At the same time, there may be many consumers in the store at any given moment, making it difficult to identify and track individuals and their movements over time.

希望提供一種系統，其可更有效率地且自動地識別及追蹤大空間中之主體的放下及取走動作；並履行支援主體與其環境之複雜互動的其他程序，包括諸如無出納員結帳等功能。 It would be desirable to provide a system that more efficiently and automatically recognizes and tracks the putting and taking actions of subjects in large spaces; and performs other procedures that support complex interactions of subjects with their environment, including such as cashierless checkout, etc. Function.

一種系統、及用以操作系統之方法被提供來追蹤藉由真實空間之區域中的主體(諸如人)之改變、及與其環境之主體的其他複雜互動，使用影像處理。藉由影像處理以追蹤改變之此功能造成了電腦工程之複雜問題，有關待處理之影像資料的類型、影像資料之何種處理應履行、及如何從具有高可靠度之影像資料判定動作。文中所述之系統可僅使用來自其配置於真實空間中之上方的相機之影像，以致其無須針對具有感應器等之商店架及地板空間的翻新以供部署於既定設定中。 A system, and method for operating the system, is provided to track changes by subjects (such as people) in areas of real space, and other complex interactions of subjects with their environment, using image processing. This function of tracking changes by image processing creates complex problems in computer engineering regarding the type of image data to be processed, what processing of image data should be performed, and how to determine actions from image data with high reliability. The system described herein can only use images from cameras above it deployed in real space, so that it does not require retrofitting of store shelves and floor space with sensors, etc. for deployment in a given setting.

提供一種用以追蹤藉由包括存貨展示結構之真實空間的區域中之主體的存貨項目之放下及取走的系統及方法，其包含使用配置於該些存貨展示結構之上的複數相機以產生該些存貨展示結構之影像的各別序列在該真實空間中的相應觀看域中，各相機之該觀看域係與該些複數相機中之至少一其他相機的該觀看域重疊。使用影像之這些序列，一種系統及方法被描述為：藉由語意上識別其相關於存貨展示結構上之存貨項目的影像之該些序列中的顯著改變以檢測存貨項目之放下及取走；及使該些語意上的顯著改變與影像之該些序列中所表示的主體關聯。 Provided is a system and method for tracking the placement and removal of inventory items by subjects in an area of physical space that includes inventory display structures, comprising using a plurality of cameras disposed on the inventory display structures to generate the inventory display structures The respective sequences of images of these inventory display structures are in this real In a corresponding viewing field in space, the viewing field of each camera overlaps the viewing field of at least one other camera of the plurality of cameras. Using these sequences of images, a system and method is described for detecting the dropping and removal of inventory items by semantically identifying significant changes in the sequences of images of the inventory items that are associated with them on an inventory display structure; and The semantically significant changes are associated with the subjects represented in the sequences of images.

提供一種用以追蹤藉由真實空間的區域中之主體的存貨項目之放下及取走的系統及方法，其包含使用配置於該些存貨展示結構之上的複數相機以產生該些存貨展示結構之影像的各別序列在該真實空間中的相應觀看域中，各相機之該觀看域係與該些複數相機中之至少一其他相機的該觀看域重疊。使用影像之這些序列，一種系統及方法被描述為藉由處理影像之該些序列中的前台資料以識別主體之姿勢及與該些姿勢相關的存貨項目來檢測存貨項目之放下及取走。 Provided is a system and method for tracking the placement and removal of inventory items by subjects in an area of real space, comprising using a plurality of cameras disposed on the inventory display structures to generate an image of the inventory display structures The respective sequences of images are in corresponding viewing domains in the real space, the viewing domains of each camera overlapping the viewing domains of at least one other camera of the plurality of cameras. Using these sequences of images, a system and method is described for detecting the drop and removal of inventory items by processing foreground data in the sequences of images to identify gestures of subjects and inventory items associated with the gestures.

同時，描述一種系統及方法，其係結合影像之相同序列中的前台處理與背景處理。於此結合的方式中，所提供之該系統及方法包括使用影像之這些序列以藉由處理影像之該些序列中的前台資料以識別主體之姿勢及與該些姿勢相關的存貨項目來檢測存貨項目之放下及取走；及使用影像之這些序列以藉由語意上識別其相關於存貨展示結構上之存貨項目的影像之該些序列中的顯著改變以檢測存貨項目之放下及取走，及使該些語意上的顯著改變與影像之該些序列中所表示的主體關聯。 Also, a system and method are described that combine foreground processing and background processing in the same sequence of images. In this combination, the systems and methods provided include using the sequences of images to detect inventory by processing foreground data in the sequences of images to identify gestures of the subject and inventory items associated with the gestures the dropping and taking of items; and the use of these sequences of images to detect the dropping and removal of inventory items by semantically identifying significant changes in those sequences of images of the inventory items in their relation to the inventory display structure, and The semantically significant changes are associated with the subjects represented in the sequences of images.

於文中所述之實施例中，該系統係使用複數相機以產生該真實空間中之相應觀看域的影像之各別序列。各相機之該觀看域係與該些複數相機中之至少一其他相機的該觀看域重疊。該系統包括第一影像處理器，包括主體影像辨識引擎，接收來自該些複數相機之影像的相應序列。該些第一影像處理器係處理影像以識別影像之該些相應序列中的該些影像中所表示之主體。該系統進一步包括第二影像處理器，包括背景影像辨識引擎，接收來自該些複數相機之影像的相應序列。該些第二影像處理器係遮蔽該些已識別主體以產生已遮蔽影像，及處理該些已遮蔽影像以識別並分類影像之該些相應序列中的該些影像中所表示之背景改變。 In the embodiments described herein, the system uses a plurality of cameras to generate individual sequences of images of corresponding viewing domains in the real space. The viewing field of each camera overlaps the viewing field of at least one other camera of the plurality of cameras. The system includes a first image processor including a subject image recognition engine that receives a corresponding sequence of images from the plurality of cameras. The first image processors process images to identify subjects represented in the images in the corresponding sequences of images. The system further includes a second image processor including a background image recognition engine that receives a corresponding sequence of images from the plurality of cameras. The second image processors mask the identified subjects to generate masked images, and process the masked images to identify and classify background changes represented in the images in the corresponding sequences of images.

於一實施例中，該些背景影像辨識引擎包含卷積神經網路。該系統包括用以使已識別背景改變與已識別主體關聯的邏輯。 In one embodiment, the background image recognition engines include convolutional neural networks. The system includes logic to cause the identified context change to be associated with the identified subject.

於一實施例中，該些第二影像處理器包括背景影像儲存，用以儲存影像之相應序列的背景影像。該些第二影像處理器進一步包括遮蔽邏輯，用以處理影像之該些序列中的影像來取代其表示該些已識別主體之前台影像資料以背景影像資料。該背景影像資料被收集自影像之該些相應序列的該些背景影像以提供該些已遮蔽影像。 In one embodiment, the second image processors include background image storage for storing background images of corresponding sequences of images. The second image processors further include masking logic for processing images in the sequences of images to replace foreground image data and background image data representing the identified subjects. The background image data is collected from the background images of the corresponding sequences of images to provide the masked images.

於一實施例中，該遮蔽邏輯係結合影像之該些序列中的多組N個已遮蔽影像以產生針對各相機之因數化影像的序列。該些第二影像處理器藉由處理因數化影像的該序列以識別並分類背景改變。 In one embodiment, the masking logic combines sets of N masked images in the sequences of images to generate a sequence of factorized images for each camera. The second image processors process the factorized image by of this sequence to identify and classify background changes.

於一實施例中，該些第二影像處理器包括用以產生針對影像之該些相應序列的改變資料結構之邏輯。該些改變資料結構包括已識別背景改變之已遮蔽影像中的座標、該些已識別背景改變之存貨項目主體的識別符及該些已識別背景改變之類別。該些第二影像處理器進一步包括協調邏輯，用以處理來自具有重疊觀看域之多組相機的改變資料結構來找出真實空間中之該些已識別背景改變。 In one embodiment, the second image processors include logic to generate altered data structures for the corresponding sequences of images. The change data structures include the coordinates in the masked image of the identified background changes, the identifiers of the inventory item bodies of the identified background changes, and the category of the identified background changes. The second image processors further include coordination logic to process changing data structures from sets of cameras with overlapping viewing fields to find the identified background changes in real space.

於一實施例中，該些改變資料結構中之已識別背景改變的該些類別係指示該已識別存貨項目是否相對於該背景影像而被加入或移除。 In one embodiment, the categories of identified background changes in the change data structures indicate whether the identified inventory item was added or removed relative to the background image.

於另一實施例中，該些改變資料結構中之已識別背景改變的該些類別係指示該已識別存貨項目是否相對於該背景影像而被加入或移除。該系統進一步包括用以使背景改變與已識別主體關聯的邏輯。最後，該系統包括執行由該些已識別主體取走存貨項目的檢測及由該些已識別主體放下存貨項目於存貨展示結構上的檢測之邏輯。 In another embodiment, the categories of identified background changes in the change data structures indicate whether the identified inventory item was added or removed relative to the background image. The system further includes logic to associate a context change with the identified subject. Finally, the system includes logic to perform detection of the removal of inventory items by the identified entities and detection of the placement of inventory items by the identified entities on the inventory display structure.

於另一實施例中，該系統包括用以使背景改變與已識別主體關聯的邏輯。該系統進一步包括執行由該些已識別主體取走存貨項目的檢測及由該些已識別主體放下存貨項目於存貨展示結構上的檢測之邏輯。 In another embodiment, the system includes logic to cause a context change to be associated with an identified subject. The system further includes logic to perform detection of the removal of inventory items by the identified entities and detection of the placement of inventory items by the identified entities on the inventory display structure.

該系統可包括如文中所述之第三影像處理器，包括前台影像辨識引擎，接收來自該些複數相機之影像的相應序列。該些第三影像處理器係處理影像以識別並分類影像之該些相應序列中的該些影像中所表示之前台改變。 The system may include a third image processor as described herein, including a foreground image recognition engine, receiving a corresponding sequence of images from the plurality of cameras. The third image processors process images to identify and Foreground changes represented in the images in the corresponding sequences of classified images.

提供一種系統及用於操作系統之方法，用以追蹤真實空間中之多關節主體(諸如人)。該系統係使用複數相機以產生該真實空間中之相應觀看域的影像之各別序列。各相機之該觀看域係與該些複數相機中之至少一其他相機的該觀看域重疊。該序列處理影像之該些序列中的影像以產生相應於各影像之關節資料結構的陣列。相應於特定影像之關節資料結構的陣列係藉由關節類型、特定影像之時間、及特定影像中之元件的座標來分類特定影像之元件。該系統接著將相應於不同序列中的影像之關節資料結構的陣列中之元件的座標變換為具有真實空間中之座標的候選關節。最後，該系統係識別符候選關節之群集，其中該些群集包括具有真實空間中之座標的候選關節之各別集合，成為真實空間中之多關節主體。 A system and method for operating an operating system are provided for tracking a multi-joint subject (such as a human) in real space. The system uses a plurality of cameras to generate individual sequences of images of corresponding viewing domains in the real space. The viewing field of each camera overlaps the viewing field of at least one other camera of the plurality of cameras. The sequence processes images in the sequences of images to generate an array of joint data structures corresponding to each image. The array of joint data structures corresponding to a particular image classifies the elements of a particular image by the joint type, the time of the particular image, and the coordinates of the elements in the particular image. The system then transforms the coordinates of the elements in the array of joint data structures corresponding to the images in the different sequences into candidate joints with coordinates in real space. Finally, the system identifies clusters of candidate joints, wherein the clusters comprise respective sets of candidate joints with coordinates in real space, becoming a multi-joint body in real space.

於一實施例中，該些影像辨識引擎包含卷積神經網路。藉由影像辨識引擎之影像的處理包括產生該影像之元件的信心陣列。影像之特定元件的信心陣列包括該特定元件之複數關節類型的信心值。該些信心陣列被用以根據信心陣列來選擇該特定元件之關節資料結構的關節類型。 In one embodiment, the image recognition engines include convolutional neural networks. The processing of the image by the image recognition engine includes a confidence array of elements that generate the image. The confidence array for a particular element of the image includes confidence values for a plurality of joint types for that particular element. The confidence arrays are used to select the joint type of the joint data structure of the particular element according to the confidence array.

於用以追蹤多關節主體之系統的一實施例中，識別候選關節之集合包含根據介於真實空間中之主體的關節之間的物理關係以應用啟發函數來識別候選關節之集合為多關節主體。該處理包括儲存其被識別為多關節主體的關節之該些集合。識別候選關節之集合包括判定在特定時間所取得之影像中所識別的候選關節是否符合其被識別為先前影像中之多關節主體的候選關節之該些集合之一的成員。 In one embodiment of the system for tracking a multi-joint subject, identifying the set of candidate joints includes applying a heuristic function to identify the candidate joints based on physical relationships between the subject's joints in real space. The assembly is a multi-joint body. The process includes storing the sets of joints that are identified as multi-joint bodies. Identifying the set of candidate joints includes determining whether a candidate joint identified in an image taken at a particular time is a member of one of the sets of candidate joints that were identified as candidate joints for a multi-joint subject in a previous image.

於一實施例中，影像之該些序列被同步化以致其由該些複數相機所擷取的影像之該些序列的各者中之影像係代表在主體之移動通過該空間的時間刻度上之單一時點上的真實空間。 In one embodiment, the sequences of images are synchronized such that the images in each of the sequences of images captured by the plurality of cameras represent the time scale of the subject's movement through the space. Real space at a single point in time.

被識別為多關節主體的候選關節之集合的成員之真實空間中的座標係識別該多關節主體之區域中的位置。於某些實施例中，該處理包括真實空間之該區域中的複數多關節主體之位置的同時追蹤。於某些實施例中，該處理包括判定該些複數多關節主體中之一多關節主體何時離開真實空間之該區域。於某些實施例中，該處理包括判定其中該多關節主體在既定時點所面對之方向。於文中所述之實施例中，該系統係使用複數相機以產生該真實空間中之相應觀看域的影像之各別序列。各相機之該觀看域係與該些複數相機中之至少一其他相機的該觀看域重疊。該系統係處理其從該些複數相機所接收的影像之該些序列中的影像以識別該些影像中所表示之主體並產生該些已識別主體之類別。最後，該系統係處理影像之該些序列中的影像之集合的已識別主體之該些類別以檢測由已識別主體取走存貨項目以及由已識別主體放下存貨項目於貨架上。 A coordinate system in real space that is identified as a member of the set of candidate joints of a multi-joint body identifies a location in the region of the multi-joint body. In some embodiments, the processing includes simultaneous tracking of the positions of the plurality of articulated bodies in the region of real space. In some embodiments, the processing includes determining when one of the plurality of articulated bodies leaves the region of real space. In some embodiments, the processing includes determining the direction in which the articulated body is facing at a given point in time. In the embodiments described herein, the system uses a plurality of cameras to generate individual sequences of images of corresponding viewing domains in the real space. The viewing field of each camera overlaps the viewing field of at least one other camera of the plurality of cameras. The system processes images in the sequences of images it receives from the plurality of cameras to identify subjects represented in the images and generate categories of the identified subjects. Finally, the system processes the categories of identified subjects of the set of images in the sequences of images to detect the taking of inventory items by the identified subjects and the placing of inventory items on shelves by the identified subjects.

於一實施例中，該類別係識別該已識別主體是否持有存貨項目。該類別亦識別該已識別主體的手是否接近貨架或者該已識別主體的手是否接近該已識別主體。該手是否接近該已識別主體之該類別可包括該已識別主體的手是否接近一與已識別主體相關的籃子、及是否接近該已識別主體之本體。 In one embodiment, the category identifies whether the identified entity holds inventory items. The class also identifies whether the identified subject's hand is close to the shelf or whether the identified subject's hand is close to the identified subject. The category of whether the hand is close to the identified subject may include whether the identified subject's hand is close to a basket associated with the identified subject, and whether it is close to the body of the identified subject.

藉由所描述之技術，代表該觀看域中之主體的手之影像可被處理以產生時間序列中之複數影像中的該主體的手之類別。來自影像之序列的手之類別可被處理，使用卷積神經網路(於某些實施例中)，以識別藉由該主體之動作。該些動作可為存貨項目之放下及取走(如文中所述之實施例中所述者)、或者為可藉由處理手的影像而解密之其他類型的動作。 With the described techniques, images representing the hands of the subject in the viewing field can be processed to generate categories of the subject's hands in the plurality of images in the time series. The class of hands from the sequence of images can be processed, using a convolutional neural network (in some embodiments), to identify movements by the subject. These actions can be the dropping and removal of inventory items (as described in the embodiments described herein), or other types of actions that can be decrypted by processing the image of the hand.

藉由所描述之技術，影像被處理以識別該觀看域中之主體、並定位該些主體的關節。該些主體的關節之定位可被處理如文中所述以識別其包括該些主體的手之相應影像中的定界框(bounding boxes)。該些定界框內之資料可為相應影像中之該主體的手之已處理類別。來自一已識別主體(其係以此方式而被產生自影像之序列)的手之類別可被處理以識別藉由該主體之動作。 With the described techniques, images are processed to identify subjects in the viewing field and to locate the joints of those subjects. The location of the subject's joints can be processed as described herein to identify bounding boxes in corresponding images that include the subjects' hands. The data within the bounding boxes may be the processed class of the subject's hand in the corresponding image. The class of hands from an identified subject (which was generated from the sequence of images in this way) can be processed to identify movements by the subject.

於包括複數影像辨識引擎(諸如前台及背景影像辨識引擎兩者)之系統中，該系統可執行：由該些已識別主體取走存貨項目的檢測及由該些已識別主體放下存貨項目於存貨展示結構上的檢測之第一集合、以及由該些已識別主體取走存貨項目的檢測及由該些已識別主體放下存貨項目於存貨展示結構上的檢測之第二集合。用以處理檢測之該些第一及第二集合的選擇邏輯可被使用以產生日誌資料結構。日誌資料結構包括針對已識別主體之存貨項目的列表。 In a system that includes a plurality of image recognition engines, such as both foreground and background image recognition engines, the system may perform: detection of inventory items removed by the identified subjects and placement of inventory items in inventory by the identified subjects A first set of detections on the structure is shown, and by these A second set of detections of removal of inventory items by identified entities and detections of inventory items being put down by those identified entities on an inventory display structure. Selection logic to process the first and second sets of detections may be used to generate a log data structure. The log data structure includes a list of inventory items for identified entities.

於文中所述之實施例中，來自複數相機中之相機的影像之該些序列被同步化。相同的相機及影像之相同的序列係由一較佳實施方式中之前台及背景影像處理器兩者所使用。結果，存貨項目之放下及取走的冗餘檢測係使用相同的輸入資料而被執行，以容許高信心(及高準確度)於所得資料中。 In the embodiments described herein, the sequences of images from cameras of the plurality of cameras are synchronized. The same camera and the same sequence of images are used by both the foreground and background image processors in a preferred embodiment. As a result, redundant detection of put-down and take-out of inventory items is performed using the same input data, allowing high confidence (and high accuracy) in the resulting data.

於文中所述之一種技術中，該系統包含藉由識別主體之姿勢及與影像之該些序列中所表示的該些姿勢相關的存貨項目以檢測存貨項目之放下及取走的邏輯。此可使用前台影像辨識引擎配合如文中所述之主體影像辨識引擎來完成。 In one technique described herein, the system includes logic to detect the placement and removal of inventory items by identifying gestures of the subject and inventory items associated with the gestures represented in the sequences of images. This can be done using a foreground image recognition engine in conjunction with a subject image recognition engine as described herein.

於文中所述之另一技術中，該系統包含用以藉由以下方式來檢測存貨項目之放下及取走的邏輯：隨著時間經過語意地識別存貨展示結構(諸如貨架)上之存貨項目的顯著改變以及使該些語意地顯著改變與影像之該些序列中所表示的主體相關聯。此可使用背景影像辨識引擎配合如文中所述之主體影像辨識引擎來完成。 In another technique described herein, the system includes logic to detect the dropping and removal of inventory items by semantically identifying the placement and removal of inventory items on an inventory display structure, such as a shelf, over time. Significant changes and associated semantically significant changes with subjects represented in the sequences of images. This can be accomplished using a background image recognition engine in conjunction with a subject image recognition engine as described herein.

於應用文中所述之系統時，姿勢分析及語意差異分析兩者可被結合，且被執行於來自相機之陣列的同步化影像之相同序列上。 When applying the system described herein, both gesture analysis and semantic difference analysis can be combined and performed on the same data from an array of cameras. step on the same sequence of images.

其可由電腦系統所執行之方法及電腦程式產品亦被描述於文中。 Methods and computer program products, which can be executed by computer systems, are also described herein.

本發明之其他形態及優點可見於圖式、詳細描述及申請專利範圍(其接續於下)之檢閱上。 Other aspects and advantages of the present invention can be found in a review of the drawings, detailed description, and the scope of claims, which continue below.

100:系統 100: System

101a、101b、101c:網路節點 101a, 101b, 101c: network nodes

102:網路節點 102: Network Node

110:追蹤引擎 110: Tracking Engine

112a-112n:影像辨識引擎 112a-112n: Image recognition engine

114:相機 114: Camera

116a:走道 116a: Walkway

116b:走道 116b: Walkway

116n:走道 116n: Walkway

120:調校器 120: Adjuster

140:主體資料庫 140: Subject Database

150:訓練資料庫 150: Training database

160:啟發法資料庫 160: Heuristics Database

170:調校資料庫 170: Tuning Database

181:網路 181: Internet

202:貨架A 202: Shelf A

204:貨架B 204: Shelf B

206:相機A 206: Camera A

208:相機B 208: Camera B

216:觀看域 216: Viewing Domain

218:觀看域 218: Viewing Domain

220:地板 220: Flooring

230:屋頂 230: Roof

412-418:相機 412-418: Cameras

422-428:乙太網路為基的連接器 422-428: Ethernet-Based Connectors

430:儲存子系統 430: Storage Subsystem

432:主機記憶體子系統 432: Host Memory Subsystem

434:隨機存取記憶體(RAM) 434: Random Access Memory (RAM)

436:唯讀記憶體(ROM) 436: Read Only Memory (ROM)

440:檔案儲存子系統 440: File Storage Subsystem

442:RAID 0 442: RAID 0

444:固態硬碟(SSD) 444: Solid State Drive (SSD)

446:硬碟驅動(HDD) 446: Hard Disk Drive (HDD)

450:處理器子系統 450: Processor Subsystem

454:匯流排子系統 454: Busbar Subsystem

462:GPU 1 462: GPU 1

464:GPU 2 464: GPU 2

466:GPU 3 466: GPU 3

481:網路 481: Internet

510:輸入影像 510: Input image

520:過濾器 520: Filter

530:卷積層 530: Convolutional layer

540:輸出矩陣 540: output matrix

702:總體量度計算器 702: Overall Metric Calculator

800:主體資料結構 800: Main data structure

1411:視頻程序 1411: Video Program

1415:場景程序 1415: Scene Program

1452:輸出 1452: output

1453:輸入 1453: input

1457:輸出 1457: output

1502:循環緩衝器 1502: Circular buffer

1504:定界框產生器 1504: Bounding Box Generator

1506:WhatCNN 1506: WhatCNN

1508:WhenCNN 1508:WhenCNN

1510:購物車資料結構 1510: Shopping Cart Data Structure

1520:影像頻道 1520: Video channel

1522:協調邏輯模組 1522: Coordination Logic Module

2002:啟發法 2002: Heuristics

2111:輸入 2111: input

2113:第一卷積層 2113: The first convolutional layer

2115:第二卷積層 2115: Second convolutional layer

2117:方盒 2117: Square Box

2119、2121:層 2119, 2121: Layer

2123:八個卷積層 2123: Eight convolutional layers

2125、2127:卷積層 2125, 2127: Convolutional layer

2129:八個卷積層 2129: Eight convolutional layers

2135:完全連接層 2135: Fully connected layer

2210:「單手」模型 2210: "One-handed" model

2212:卷積層 2212: Convolutional layer

2214:集用層 2214: Aggregate layer

2216:區塊0 2216: Block 0

2218:區塊1 2218: Block 1

2220:區塊2 2220: Block 2

2222:區塊3 2222: Block 3

2310:方盒 2310: Square Box

2312:卷積層 2312: Convolutional layer

2314:批次正規化層 2314: Batch normalization layer

2316:ReLU非線性 2316: ReLU nonlinearity

2318:conv1 2318:conv1

2320:conv2 2320:conv2

2322:conv3 2322:conv3

2324:加法運算 2324: addition operation

2326:跳躍連接 2326: skip connection

2410:完全連接層(FC) 2410: Fully Connected Layer (FC)

2412:再成形運算子 2412: Reshaping Operators

2420:下一層 2420: next level

2422:MatMul 2422:MatMul

2424:運算子 2424: Operator

2426:輸出 2426: output

2502:第一部分 2502: Part 1

2504:第二部分 2504: Part II

2506:第三部分 2506: Part Three

2508:第四部分 2508: Part Four

2510:第五部分 2510: Part V

2512:第六部分 2512: Part VI

2602:第一影像處理器子系統 2602: First Image Processor Subsystem

2604:第二影像處理器子系統 2604: Second Image Processor Subsystem

2606:第三影像處理器子系統 2606: Third Image Processor Subsystem

2608:選擇邏輯組件 2608: Select Logic Components

2702:遮罩邏輯組件 2702: Mask Logic Components

2704:背景影像儲存 2704: Background image storage

2706:因數化影像 2706: Factorized Imagery

2710:位元遮罩計算器 2710: Bit Mask Calculator

2714a-2714n:改變CNN 2714a-2714n: Changes to CNN

2718:協調邏輯組件 2718: Coordination Logic Components

2720:日誌產生器 2720: log generator

2724:遮罩產生器 2724: Mask Generator

圖1闡明一種系統之架構階概圖，其中追蹤引擎係使用由影像辨識引擎所產生的關節資料以追蹤主體。 Figure 1 illustrates an architectural high-level overview of a system in which a tracking engine uses joint data generated by an image recognition engine to track a subject.

圖2為闡明相機配置之購物商店中的走道之側視圖。 Figure 2 is a side view of an aisle in a shopping store illustrating the camera configuration.

圖3為闡明相機配置之購物商店中之圖2的走道之頂部視圖。 3 is a top view of the aisle of FIG. 2 in a shopping store illustrating the camera configuration.

圖4為一種相機及電腦硬體配置，其被組態以主控圖1之影像辨識引擎。 FIG. 4 is a camera and computer hardware configuration configured to host the image recognition engine of FIG. 1 .

圖5闡明卷積神經網路，其係闡明圖1之影像辨識引擎中的關節之識別。 FIG. 5 illustrates a convolutional neural network illustrating the recognition of joints in the image recognition engine of FIG. 1 .

圖6顯示用以儲存關節資訊之範例資料結構。 Figure 6 shows an example data structure for storing joint information.

圖7闡明具有總體量度計算器之圖1的追蹤引擎。 Figure 7 illustrates the tracking engine of Figure 1 with an overall metric calculator.

圖8顯示用以儲存包括相關關節之資訊的主體之範例資料結構。 FIG. 8 shows an example data structure for a body used to store information including related joints.

圖9為流程圖，其闡明藉由圖1之系統以追蹤主體的程序步驟。 FIG. 9 is a flowchart illustrating the procedural steps of tracking a subject by the system of FIG. 1 .

圖10為流程圖，其顯示圖9之相機校準步驟的更詳細程序步驟。 FIG. 10 is a flowchart showing more detailed procedural steps of the camera calibration step of FIG. 9 .

圖11為流程圖，其顯示圖9之視頻程序步驟的更詳細程序步驟。 FIG. 11 is a flowchart showing more detailed procedural steps of the video procedural steps of FIG. 9 .

圖12A為流程圖，其顯示圖9之場景程序的更詳細程序步驟之第一部分。 FIG. 12A is a flowchart showing the first part of the more detailed program steps of the scenario program of FIG. 9 .

圖12B為流程圖，其顯示圖9之場景程序的更詳細程序步驟之第二部分。 FIG. 12B is a flow chart showing the second part of the more detailed program steps of the scenario program of FIG. 9 .

圖13為其中使用圖1之系統的實施例之環境的圖示。 13 is an illustration of an environment in which an embodiment of the system of FIG. 1 is used.

圖14為圖1之系統的實施例中之視頻及場景程序的圖示。 FIG. 14 is an illustration of a video and scene program in an embodiment of the system of FIG. 1 .

圖15A為概圖，其顯示具有包括關節CNN、WhatCNN及WhenCNN之多數卷積神經網路(CNN)的管線，用以產生真實空間中之每主體的購物車資料結構。 Figure 15A is a schematic diagram showing a pipeline with multiple convolutional neural networks (CNNs) including articulated CNN, WhatCNN, and WhenCNN to generate a per-agent shopping cart data structure in real space.

圖15B顯示來自多數相機之多數影像頻道以及用於該些主體和其各別購物車資料結構之協調邏輯。 Figure 15B shows most image channels from most cameras and the coordination logic for the subjects and their respective shopping cart data structures.

圖16為流程圖，其闡明用以識別並更新真實空間中之主體的程序步驟。 Figure 16 is a flow chart illustrating the procedural steps for identifying and updating subjects in real space.

圖17為流程圖，其顯示用以處理主體之手關節來識別存貨項目的程序步驟。 Figure 17 is a flow chart showing procedural steps for processing the subject's hand joints to identify inventory items.

圖18為流程圖，其顯示用於每手關節之存貨項目的時間序列分析以產生每主體之購物車資料結構的程序步驟。 Figure 18 is a flow chart showing inventory for each hand joint Procedural steps for time series analysis of items to generate a shopping cart data structure for each subject.

圖19為圖15A之系統的實施例中之WhatCNN模型的圖示。 Figure 19 is an illustration of the WhatCNN model in an embodiment of the system of Figure 15A.

圖20為圖15A之系統的實施例中之WhenCNN模型的圖示。 20 is an illustration of the WhenCNN model in an embodiment of the system of FIG. 15A.

圖21提出一識別卷積層之維度的WhatCNN模型之範例架構。 Figure 21 presents an example architecture of a WhatCNN model that identifies the dimensions of the convolutional layers.

圖22提出用於手影像之類別的WhatCNN之實施例的高階方塊圖。 Figure 22 presents a high-level block diagram of an embodiment of WhatCNN for the class of hand images.

圖23提出圖22中所提出之WhatCNN模型的高階方塊圖之第一方塊的細節。 FIG. 23 presents details of the first block of the high-order block diagram of the WhatCNN model proposed in FIG. 22 .

圖24提出圖22中所提出之範例WhatCNN模型中的完全連接層中之運算子。 Figure 24 presents the operators in the fully connected layer in the example WhatCNN model presented in Figure 22.

圖25為一儲存為針對WhatCNN模型之訓練資料集的部分之影像檔的範例名稱。 Figure 25 is an example name of an image file stored as part of the training data set for the WhatCNN model.

圖26為一種用以追蹤藉由真實空間之區域中的主體之改變的系統之高階架構，其中選擇邏輯係於使用背景語意差異(diffing)的第一檢測與使用前台區提議的冗餘檢測之間選擇。 Figure 26 is a high-level architecture of a system for tracking changes by subject in an area of real space, where the selection logic is between a first detection using background semantic diffing and a redundant detection using foreground area proposals choice between.

圖27提出其實施圖26之系統的子系統之組件。 FIG. 27 presents the components of the subsystems which implement the system of FIG. 26. FIG.

圖28A為流程圖，其顯示用於判定存貨事件及購物車資料結構之產生的詳細程序步驟之第一部分。 Figure 28A is a flowchart showing the first part of the detailed procedural steps for determining inventory events and generation of a shopping cart data structure.

圖28B為流程圖，其顯示用於判定存貨事件及購物車資料結構之產生的詳細程序步驟之第二部分。 Figure 28B is a flowchart showing the second part of the detailed procedural steps for determining the inventory event and the generation of the shopping cart data structure.

以下描述被提出以致能任何熟悉本技術人士執行及使用本發明，且被提供於特定應用及其需求的背景。對於所揭露實施例之各種修改將是那些熟悉此技藝人士能輕易瞭解的，且文中所定義之一般性原理可被應用於其他實施例及應用而不背離本發明之精神及範圍。因此，本發明不欲被限制於所示之實施例而是被賦予符合文中所揭露之原理及特徵的最寬範圍。 The following description is presented to enable any person skilled in the art to make and use the present invention, and is provided in the context of a particular application and its requirements. Various modifications to the disclosed embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments and applications without departing from the spirit and scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown but is to be accorded the widest scope consistent with the principles and features disclosed herein.

System Overview

主技術之一種系統及各個實施方式係參考圖1-28A/28B而被描述。系統及程序係參考圖1而被描述，依據實施方式之系統的架構階概圖。因為圖1為架構圖，所以某些細節被省略以增進描述之清晰。 One system and various embodiments of the main technology are described with reference to Figures 1-28A/28B. The system and procedures are described with reference to FIG. 1 , an architectural level overview of the system according to an embodiment. Since FIG. 1 is an architectural diagram, some details are omitted to improve the clarity of the description.

圖1之討論係如下。首先，描述系統之元件，接續以其互連。接著，更詳細地描述該系統中之元件的使用。 The discussion of Figure 1 is as follows. First, the elements of the system are described, followed by their interconnection. Next, the use of the elements in the system is described in more detail.

圖1提供一種系統100之方塊圖階圖示。系統100包括相機114、主控影像辨識引擎112a、112b、和112n之網路節點、部署於該網路上之網路節點(或節點)中的追蹤引擎110、校準器120、主體資料庫140、訓練資料庫 150、啟發法資料庫160(用於關節啟發法、用於放下及取走啟發法、及其他用以協調和組合如以下所述之多數影像辨識引擎的輸出之其他啟發法)、校準資料庫170、及通訊網路或網路181。網路節點可主控僅一影像辨識引擎、或數個影像辨識引擎，如文中所述者。系統亦可包括存貨資料庫及其他支援資料。 FIG. 1 provides a block diagram level illustration of a system 100 . The system 100 includes a camera 114, a network node hosting the image recognition engines 112a, 112b, and 112n, a tracking engine 110 deployed in the network node (or nodes) on the network, a calibrator 120, a subject database 140, training database 150. Heuristics database 160 (for joint heuristics, for drop and take heuristics, and other heuristics for coordinating and combining the output of most image recognition engines as described below), calibration database 170, and a communication network or network 181. A network node may host only one image recognition engine, or several image recognition engines, as described herein. The system may also include an inventory database and other supporting data.

如文中所使用，網路節點為一種裝附至網路之可定址硬體裝置或虛擬裝置，且能夠透過通訊頻道以傳送、接收、或傳遞資訊至或自其他網路節點。其可被部署為硬體網路節點之電子裝置的範例包括電腦、工作站、膝上型電腦、手持式電腦、及智慧型手機之所有種類。網路節點可被實施於雲端為基的伺服器系統中。組態成網路節點之多於一個虛擬裝置可使用單一實體裝置來實施。 As used herein, a network node is an addressable hardware device or virtual device attached to a network and capable of transmitting, receiving, or communicating information to or from other network nodes through a communication channel. Examples of electronic devices that can be deployed as hardware network nodes include all kinds of computers, workstations, laptops, handheld computers, and smart phones. Network nodes can be implemented in cloud-based server systems. More than one virtual device configured as a network node can be implemented using a single physical device.

為了簡化之緣故，僅顯示三個主控影像辨識引擎之網路節點於系統100中。然而，任何數目的主控影像辨識引擎之網路節點均可透過網路181而被連接至追蹤引擎110。同時，影像辨識引擎、追蹤引擎及文中所述之其他處理引擎可使用分散式架構中之多於一個網路節點來執行。 For the sake of simplicity, only three network nodes that host the image recognition engine are shown in the system 100 . However, any number of network nodes hosting the image recognition engine may be connected to the tracking engine 110 through the network 181 . At the same time, the image recognition engine, tracking engine, and other processing engines described herein may be executed using more than one network node in a distributed architecture.

現在將描述系統100之元件的互連。網路181耦合網路節點101a、101b、和101c，各別地，其係主控影像辨識引擎112a、112b、和112n、主控追蹤引擎110之網路節點102、調校器120、主體資料庫140、訓練資料庫150、關節啟發法資料庫160、及調校資料庫170。相機114 透過主控影像辨識引擎112a、112b、和112n之網路節點而被連接至追蹤引擎110。於一實施例中，相機114被安裝於購物商店(諸如超級市場)中以致其具有重疊觀看域之相機114(二或更多)的集合被置於各走道上方以擷取該商店中之真實空間的影像。於圖1中，兩個相機被配置於走道116a上方，兩個相機被配置於走道116b上方，而三個相機被配置於走道116n上方。相機114被安裝於具有重疊觀看域之走道上方。於此一實施例中，相機被組態以如下目標：移動在購物商店之走道中的消費者係於任何時點出現在二或更多相機的觀看域中。 The interconnection of the elements of system 100 will now be described. Network 181 couples network nodes 101a, 101b, and 101c, respectively, which host image recognition engines 112a, 112b, and 112n, network node 102 that hosts tracking engine 110, tuner 120, and host data Library 140 , training database 150 , joint heuristics database 160 , and tuning database 170 . Camera 114 Connected to the tracking engine 110 through the network node hosting the image recognition engines 112a, 112b, and 112n. In one embodiment, cameras 114 are installed in a shopping store (such as a supermarket) such that a set of cameras 114 (two or more) with overlapping viewing fields are placed over each aisle to capture the reality in the store image of space. In FIG. 1, two cameras are disposed above the aisle 116a, two cameras are disposed above the aisle 116b, and three cameras are disposed above the aisle 116n. Camera 114 is mounted above the walkway with overlapping viewing fields. In this embodiment, the cameras are configured to target that a consumer moving in the aisles of a shopping store is present in the viewing fields of two or more cameras at any point in time.

相機114可於時間上被彼此同步化，以致其影像被同時地(或時間上接近地)並以相同的影像擷取率來擷取。相機114可以預定速率傳送影像之各別連續串流至主控影像辨識引擎112a-112n之網路節點。於其同時地(或時間上接近地)覆蓋真實空間之區域的所有相機中所擷取的影像被同步化，由於其同步化影像可被識別於處理引擎中如代表具有真實空間中之固定位置的主體之不同視角。例如，於一實施例中，相機係以每秒30框(fps)之速率將影像框傳送至各別主控影像辨識引擎112a-112n之網路節點。各框具有時戳、相機之識別(縮寫為「相機_id」)、及框識別(縮寫為「框_id」)，連同影像資料。 The cameras 114 may be synchronized with each other in time so that their images are captured simultaneously (or close in time) and at the same image capture rate. The camera 114 may transmit respective continuous streams of images at a predetermined rate to the network nodes hosting the image recognition engines 112a-112n. The images captured in all cameras that cover an area of real space simultaneously (or close in time) are synchronized, since their synchronized images can be identified in the processing engine as representing a fixed position in real space different perspectives of the subject. For example, in one embodiment, the cameras transmit image frames at a rate of 30 frames per second (fps) to the respective network nodes hosting the image recognition engines 112a-112n. Each frame has a timestamp, an identification of the camera (abbreviated as "camera_id"), and a frame identification (abbreviated as "frame_id"), along with image data.

安裝於走道上方之相機被連接至各別影像辨識引擎。例如，於圖1中，安裝於走道116a上方的兩個相機被連接至主控影像辨識引擎112a之網路節點101a。同樣地，安裝於走道116b上方的兩個相機被連接至主控影像辨識引擎112b之網路節點101b。於網路節點或節點101a-101n中所主控的各影像辨識引擎112a-112n係分別地處理從各於所示範例中之一相機所接收的影像框。 Cameras mounted above the walkway are connected to respective image recognition engines. For example, in FIG. 1, two cameras mounted above the walkway 116a are connected to the network node 101a that hosts the image recognition engine 112a. same Ground, the two cameras installed above the walkway 116b are connected to the network node 101b that hosts the image recognition engine 112b. Each image recognition engine 112a-112n hosted in a network node or nodes 101a-101n processes image frames received from one of the cameras in each of the illustrated examples, respectively.

於一實施例中，各影像辨識引擎112a、112b、及112n被實施為深學習演算法，諸如卷積神經網路(縮寫為CNN)。於此一實施例中，CNN係使用訓練資料庫150來訓練。於文中所述之實施例中，真實空間中之主體的影像辨識係根據識別並群集該些影像中可辨識的關節，其中關節之群組可被歸屬於一各別主體。針對此關節為基的分析，訓練資料庫150具有針對主體之每一不同類型的關節之影像的大型集合。於購物商店之範例實施例中，主體為移動於貨架之間的走道中之消費者。於一範例實施例中，於CNN之訓練期間，系統100被稱為「訓練系統」。在使用訓練資料庫150以訓練CNN之後，CNN被切換至產生模式以即時地處理購物商店中之消費者的影像。於一範例實施例中，於產生期間，系統100被稱為運行時間系統(亦稱為推理系統)。各影像辨識引擎中之CNN係產生影像之關節資料結構的陣列於影像之其各別串流中。於如文中所述之一實施例中，關節資料結構之陣列係針對各已處理影像而被產生，以致其各影像辨識引擎112a-112n係產生關節資料結構之陣列的輸出串流。來自具有重疊觀看域之相機的關節資料結構之這些陣列被進一步處理以形成關節之群組，並識別關節之此等群組為主體。 In one embodiment, each image recognition engine 112a, 112b, and 112n is implemented as a deep learning algorithm, such as a convolutional neural network (abbreviated as CNN). In this embodiment, the CNN is trained using the training database 150 . In the embodiments described herein, image recognition of subjects in real space is based on identifying and clustering identifiable joints in the images, wherein groups of joints can be assigned to a respective subject. For this joint-based analysis, the training database 150 has a large collection of images for each different type of joint of the subject. In an exemplary embodiment of a shopping store, the subjects are consumers moving in aisles between shelves. In an exemplary embodiment, during the training of a CNN, the system 100 is referred to as a "training system." After using the training database 150 to train the CNN, the CNN is switched to a generation mode to process images of consumers in a shopping store in real time. In an exemplary embodiment, during generation, the system 100 is referred to as a runtime system (also known as an inference system). The CNN in each image recognition engine generates an array of the image's joint data structures in its respective stream of the image. In one embodiment as described herein, an array of joint data structures is generated for each processed image such that each of its image recognition engines 112a-112n generates an output stream of the array of joint data structures. These arrays of joint data structures from cameras with overlapping viewing fields are further processed to form groups of joints and to identify these groups of joints as hosts.

相機114被調校在將CNN切換至產生模式之前。調校器120係調校該些相機並將調校資料儲存於調校資料庫170中。 The camera 114 was calibrated before switching the CNN to production mode. The calibrator 120 calibrates the cameras and stores the calibration data in the calibration database 170 .

追蹤引擎110(於網路節點102上主控)接收來自影像辨識引擎112a-112n之主體的關節資料結構之陣列的連續串流。追蹤引擎110係處理關節資料結構的陣列並將相應於不同序列中的影像之關節資料結構的陣列中之元件的座標變換為具有真實空間中之座標的候選關節。針對同步化影像之各集合，遍及該真實空間所識別之候選關節的組合可被考量(為了類比之目的)為如同候選關節之星系。針對各後續時點，候選關節之移動被記錄以致其該星系隨著時間而改變。追蹤引擎110之輸出被儲存於主體資料庫140中。 Tracking engine 110 (hosted on network node 102) receives a continuous stream of arrays of joint data structures from the bodies of image recognition engines 112a-112n. The tracking engine 110 processes the array of joint data structures and transforms the coordinates of elements in the array of joint data structures corresponding to images in different sequences into candidate joints with coordinates in real space. For each set of synchronized images, combinations of candidate joints identified throughout the real space may be considered (for analogy purposes) as galaxies like candidate joints. For each subsequent time point, the movement of the candidate joint is recorded so that its galaxy changes over time. The output of the tracking engine 110 is stored in the subject database 140 .

追蹤引擎110使用邏輯以將具有真實空間中之座標的候選關節之群組或集合識別為該真實空間中之主體。為了類比之目的，候選點之各集合係如同在各時點上之候選關節的群集。候選關節的群集可隨著時間而移動。 The tracking engine 110 uses logic to identify groups or sets of candidate joints with coordinates in real space as subjects in the real space. For analogy purposes, each set of candidate points is like a cluster of candidate joints at each point in time. The cluster of candidate joints can move over time.

用以識別候選關節之集合的該邏輯包含根據真實空間中之主體之間的物理關係之啟發函數。這些啟發函數被用以將候選關節之集合識別為主體。啟發函數被儲存於啟發法資料庫160中。追蹤引擎110之輸出被儲存於主體資料庫140中。因此，候選關節之該些集合包含其具有與其他各別候選關節之依據啟發參數的關係之各別候選關節、以及其已被識別(或可被識別)為各別主體之既定集合中的候選關節之子集。 The logic to identify the set of candidate joints includes heuristic functions based on physical relationships between subjects in real space. These heuristic functions are used to identify the set of candidate joints as subjects. Heuristic functions are stored in the heuristic database 160 . The output of the tracking engine 110 is stored in the subject database 140 . Thus, the sets of candidate joints include the respective candidate joints that have a relationship with the other respective candidate joints based on the heuristic parameters, and the given set of which has been identified (or can be identified) as the respective subject A subset of candidate joints in .

通過網路181之實際通訊路徑可為透過公共及/或私人網路之點對點。通訊可透過多種網路182而發生，例如，私人網路、VPN、MPLS電路、或網際網路；並可使用適當的應用程式編程介面(API)及資料交換格式，例如，表示狀態轉移(REST)、JavaScript^TM物件記法(JSON)、可延伸式標示語言(XML)、簡單物件存取協定(SOAP)、Java^TM訊息服務(JMS)、及/或Java平台模組系統。所有該些通訊均可被加密。通訊通常係透過網路，諸如LAN(區域網路)、WAN(廣域網路)、電話網路(公用切換式電話網路PSTN))、對話啟動協定(SIP)、無線網路、點對點網路、星形網路、符記環網路、集線器網路、網際網路，包括經由諸如EDGE、3G、4G LTE、Wi-Fi、及WiMAX等協定之行動網際網路。此外，諸如使用者名稱/通行碼、開放式授權(OAuth)、Kerberos、SecureID、數位憑證及更多等多種授權和鑑別技術可被使用以確保通訊。 The actual communication path through network 181 may be point-to-point through public and/or private networks. Communication can occur over a variety of networks 182, such as private networks, VPNs, MPLS circuits, or the Internet; and can use appropriate application programming interfaces (APIs) and data exchange formats, such as representation state transfer (REST) ), JavaScript ^™ Object Notation (JSON), Extensible Markup Language (XML), Simple Object Access Protocol (SOAP), Java ^™ Message Service (JMS), and/or the Java Platform Module System. All of these communications can be encrypted. Communication is usually through networks such as LAN (Local Area Network), WAN (Wide Area Network), telephone network (Public Switched Telephone Network (PSTN)), Session Initiation Protocol (SIP), wireless network, peer-to-peer network, Star networks, token ring networks, hub networks, the Internet, including mobile Internet via protocols such as EDGE, 3G, 4G LTE, Wi-Fi, and WiMAX. Additionally, various authorization and authentication techniques such as username/passcode, Open Authorization (OAuth), Kerberos, SecureID, digital credentials, and more can be used to secure communications.

文中所揭露之技術可被實施於任何電腦實施系統之背景下，包括資料庫系統、多租戶環境、或關連式資料庫實施方式，如Oracle^TM相容的關連式實施方式、IBM DB2 Enterprise Server^TM相容的關連式實施方式、MySQL^TM或PostgreSQL^TM相容的關連式實施方式或者Microsoft SQL Server^TM相容的關連式實施方式；或者NoSQL^TM非關連式資料庫實施方式，諸如Vampire^TM相容的非關連式實施方式、Apache Cassandra^TM相容的非關連式實施方式、BigTable^TM相容的非關連式實施方式或者HBase^TM或DynamoDB^TM相容的非關連式實施方式。此外，所揭露之技術可使用以下來實施：不同的編程模型，如MapReduce^TM、大塊同步編程、MPI基元等等；或不同的可擴縮批次和串流管理系統，如Apache Storm^TM、Apache Spark^TM、Apache Kafka^TM、Apache Flink^TM、Truviso^TM、Amazon Elasticsearch Service^TM、Amazon Web Services^TM(AWS)、IBM Info-Sphere^TM、Borealis^TM、及Yahoo！S4^TM。 The techniques disclosed herein may be implemented in the context of any computer-implemented system, including database systems, multi-tenant environments, or relational database implementations, such as Oracle ^™ compliant relational implementations, IBM DB2 Enterprise Server ^™ A compatible relational implementation, a MySQL ^™ or PostgreSQL ^™ compatible relational implementation, or a Microsoft SQL Server ^™ compatible relational implementation; or a NoSQL ^™ non-relational database implementation such as a Vampire ^™ compatible Non-Associative Implementation, Apache Cassandra ^™ Compliant Non-Associative Implementation, BigTable ^™ Compliant Non-Associative Implementation, or HBase ^™ or DynamoDB ^™ Compliant Non-Associative Implementation. Furthermore, the disclosed techniques can be implemented using: different programming models, such as MapReduce ^™ , chunk synchronous programming, MPI primitives, etc.; or different scalable batch and stream management systems, such as Apache Storm ^™ , Apache Spark ^™ , Apache Kafka ^™ , Apache Flink ^™ , Truviso ^™ , Amazon Elasticsearch Service ^™ , Amazon Web Services ^™ (AWS), IBM Info-Sphere ^™ , Borealis ^™ , and Yahoo! ^S4TM .

Camera configuration

相機114被配置以追蹤三維(縮寫為3D)真實空間中之多關節單體(或主體)。於購物商店之範例實施例中，真實空間可包括其中銷售項目被堆疊於貨架中的購物商店之區域。真實空間中的點可由(x,y,z)座標系統來表示。該系統所被部署之真實空間的區域中之各點係由二或更多相機114之觀看域所涵蓋。 The camera 114 is configured to track a multi-joint monolith (or body) in three-dimensional (abbreviated 3D) real space. In an example embodiment of a shopping store, the real space may include an area of the shopping store in which sale items are stacked in shelves. A point in real space can be represented by an (x, y, z) coordinate system. Each point in the area of real space where the system is deployed is covered by the viewing field of two or more cameras 114 .

於購物商店中，貨架及其他存貨展示結構可被配置以多種方式，諸如沿著購物商店之牆壁、或者於形成走道之列中或者兩種配置之組合。圖2顯示貨架(其形成走道116a)之配置，從走道116a之一端所見。兩個相機(相機A 206及相機B 208)被置於走道116a上方，以一段離開存貨展示結構(諸如貨架)之上的購物商店之屋頂230及地板220的預定距離。相機114包含配置於上方並具有涵蓋真實空間中之存貨展示結構及地板區域的各別部分之觀看域的相機。被識別為主體的候選關節之集合的成員之真實空間中的座標係識別該主體之地板區域中的位置。於購物商店之範例實施例中，真實空間可包括購物商店中之所有地板220，以供存貨可從該處被存取。相機114被放置且定向以致其地板220及貨架的區域可由至少兩個相機所看見。相機114亦覆蓋貨架202和204之至少部分以及貨架202和204前方之地板空間。相機角度被選擇以具有陡峭觀點(筆直向下)、及有角度的觀點(其提供消費者之更完整的身體影像)兩者。於一範例實施例中，相機114被組態以八(8)英尺高或更高，遍及該購物商店。圖13提出此一實施例之圖示。 In a shopping store, shelving and other inventory display structures may be configured in a variety of ways, such as along the walls of the shopping store, or in rows forming aisles, or a combination of the two configurations. Figure 2 shows the arrangement of the shelves (which form the aisle 116a), seen from one end of the aisle 116a. Two cameras (camera A 206 and camera B 208) are placed above the aisle 116a at a predetermined distance from the roof 230 and floor 220 of the shopping store above an inventory display structure such as a shelf. Camera 114 includes a camera disposed above and having a viewing field that covers various parts of the inventory display structure and floor area in real space. camera. The coordinate system in real space of the members of the set of candidate joints identified as the subject identifies the position in the floor area of the subject. In the example embodiment of a shopping store, the real space may include all of the floors 220 in the shopping store from where inventory can be accessed. Camera 114 is positioned and oriented so that its area of floor 220 and shelves can be seen by at least two cameras. Camera 114 also covers at least a portion of shelves 202 and 204 and the floor space in front of shelves 202 and 204 . Camera angles are chosen to have both a steep view (straight down), and an angled view (which provides a more complete body image of the consumer). In an exemplary embodiment, cameras 114 are configured to be eight (8) feet tall or higher throughout the shopping store. Figure 13 presents an illustration of such an embodiment.

於圖2中，相機206及208具有重疊觀看域，其覆蓋介於貨架A 202與貨架B 204之間的空間，各別地以重疊觀看域216及218。真實空間中之位置被表示為真實空間座標系統中之(x,y,z)點。「x」及「y」代表二維(2D)平面(其可為購物商店之地板220)上之位置。值「z」為一種組態中在地板220上之2D平面上方的點之高度。 In FIG. 2, cameras 206 and 208 have overlapping viewing fields, which cover the space between shelf A 202 and shelf B 204, with overlapping viewing fields 216 and 218, respectively. A position in real space is represented as a (x, y, z) point in a real space coordinate system. "x" and "y" represent locations on a two-dimensional (2D) plane, which may be the floor 220 of a shopping store. The value "z" is the height of the point above the 2D plane on the floor 220 in one configuration.

圖3闡明從圖2之頂部所觀看的走道116a，其進一步顯示走道116a上方之相機206及208的位置之範例配置。相機206及208被放置更接近於走道116a之相反端。相機A 206被置於離開貨架A 202一段預定距離，而相機B 208被置於離開貨架B 204一段預定距離。於另一實施例中，其中多於兩個相機被置於走道上方，該些相機被置於與彼此相等的距離上。於此一實施例中，兩個相機被放置接近於相反端而第三個相機被置於該走道的中間。應理解：數個不同的相機配置是可能的。 3 illustrates walkway 116a as viewed from the top of FIG. 2, further showing an example configuration of the locations of cameras 206 and 208 above walkway 116a. Cameras 206 and 208 are placed closer to opposite ends of aisle 116a. Camera A 206 is positioned a predetermined distance from shelf A 202 and camera B 208 is positioned a predetermined distance from shelf B 204 . In another embodiment, where more than two cameras are placed above the walkway, the cameras are placed at equal distances from each other. In this embodiment, the two cameras are placed close to at the opposite end and a third camera is placed in the middle of the aisle. It should be understood that several different camera configurations are possible.

camera adjustment

相機調校器120履行兩種類型的調校：內部及外部。於內部調校中，相機114之內部參數被調校。內部相機參數之範例包括聚焦長度、主要點、偏斜、魚眼係數，等等。用於內部相機調校之多種技術可被使用。一此種技術係由張(Zhang)所提出於「用於相機調校之彈性新技術(A flexible new technique for camera calibration)」，其係發佈於IEEE Transactions on Pattern Analysis and Machine Intelligence,Volume 22,No.11,2000年11月。 Camera adjuster 120 performs two types of adjustments: internal and external. In the internal adjustment, the internal parameters of the camera 114 are adjusted. Examples of internal camera parameters include focus length, principal point, skew, fisheye coefficient, and the like. Various techniques for in-house camera calibration can be used. One such technique was proposed by Zhang in "A flexible new technique for camera calibration", which was published in IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 22, No. 11, November 2000.

於外部調校中，外部相機參數被調校以產生用以將2D影像資料變換為真實空間中之3D座標的映射參數。於一實施例中，一主體(諸如人)被引入真實空間。該主體移動穿越該真實空間，於一通過每個相機114之觀看域的路徑上。在該真實空間中之任何既定點上，該主體係出現於其形成3D場景之至少兩個相機的觀看域中。然而，該兩個相機具有相同3D場景之不同的觀點於其各別的二維(2D)影像平面中。3D場景中之特徵(諸如該主體之左手腕)係由其各別2D影像平面中之不同位置上的兩個相機所觀看。 In external tuning, external camera parameters are tuned to generate mapping parameters for transforming 2D image data into 3D coordinates in real space. In one embodiment, a subject, such as a person, is introduced into the real space. The subject moves through the real space, on a path through the viewing field of each camera 114 . At any given point in the real space, the host system appears in the viewing field of at least two cameras which form the 3D scene. However, the two cameras have different views of the same 3D scene in their respective two-dimensional (2D) image planes. Features in the 3D scene, such as the subject's left wrist, are viewed by two cameras at different locations in their respective 2D image planes.

點對應被建立於每一對相機與既定場景的重疊觀看域之間。因為各相機具有相同3D場景之不同觀點，所以點對應為二像素位置(一位置係來自具有重疊觀看域之各相機的位置)，其代表3D場景中之相同點的投影。許多點對應係使用影像辨識引擎112a-112n的結果而針對各3D場景被識別，以供外部調校之目的。影像辨識引擎將關節之位置識別為各別相機114之2D影像平面中的像素之(x,y)座標，諸如列與行數。於一實施例中，關節為該主體之19個不同類型的關節之一。隨著該主體移動通過不同相機之觀看域，追蹤引擎110接收其用於來自每影像之相機114的調校之該主體的19個不同類型的關節之各者的(x,y)座標。 Point correspondences are established between each pair of cameras and the overlapping viewing fields of a given scene. Because each camera has a different view of the same 3D scene, So a point corresponds to a two-pixel location (one location from each camera with overlapping viewing fields) that represents the projection of the same point in the 3D scene. A number of point correspondences are identified for each 3D scene using the results of the image recognition engines 112a-112n for external calibration purposes. The image recognition engine recognizes the positions of the joints as (x,y) coordinates, such as column and row numbers, of pixels in the 2D image plane of the respective camera 114 . In one embodiment, the joint is one of 19 different types of joints in the body. As the subject moves through the viewing fields of the different cameras, the tracking engine 110 receives its (x,y) coordinates for each of the subject's 19 different types of joints for calibration from the camera 114 of each image.

例如，考量來自相機A之影像及來自相機B之影像，兩者均在相同時點被取得且具有重疊觀看域。有來自相機A之影像中的像素，其係相應於來自相機B之同步化影像中的像素。考量其有某物件或表面之特定點於相機A和相機B之觀點中以及該點被擷取於兩影像框之像素中。於外部相機調校中，多數此種點被識別且被稱為相應點。因為有一主體於調校期間之相機A和相機B的觀看域中，所以此主體之關鍵關節被識別，例如，左手腕的中心。假如這些關鍵關節可見於來自相機A和相機B兩者之影像框中，則假設這些代表相應點。此程序被重複於許多影像框以建立針對具有重疊觀看域之所有相機對的對應點之大型集合。於一實施例中，影像係以30 FPS(每秒框數)或更多之速率及全RGB(紅、綠、及藍)顏色之720像素的解析度來被串流出所有相機。這些影像具有一維陣列(亦稱為平坦陣列)之形式。 For example, consider the image from camera A and the image from camera B, both acquired at the same point in time and having overlapping viewing fields. There are pixels in the image from camera A that correspond to pixels in the synchronized image from camera B. Consider that there is a particular point of an object or surface in the view of camera A and camera B and the point is captured in the pixels of the two image frames. In external camera calibration, most of these points are identified and called corresponding points. Since there is a subject in the viewing field of Camera A and Camera B during the calibration, key joints of this subject are identified, eg, the center of the left wrist. If these key joints are visible in image frames from both camera A and camera B, these are assumed to represent corresponding points. This procedure is repeated for many image frames to create a large set of corresponding points for all camera pairs with overlapping viewing fields. In one embodiment, images are streamed out of all cameras at a rate of 30 FPS (frames per second) or more and a resolution of 720 pixels for full RGB (red, green, and blue) color. These images have a one-dimensional array (also known as in the form of a flat array).

以上針對主體而收集的大量影像可被用以判定介於具有重疊觀看域的相機之間的對應點。考量具有重疊觀看域之兩個相機A和B。通過相機A和B之中心以及3D場景中之關節位置(亦稱為特徵點)的平面被稱為「核面(epipolar plane)」。該核面與相機A和B之2D影像平面的交點係定義「核線」。給定這些對應點，判定一變換，其可準確地將來自相機A之對應點準確地映射至其被確保交叉相機B之影像框中的該對應點之相機B的觀看域中之核線。使用針對主體而收集如上之影像框，該變換被產生。於本技術中已知其此變換為非線性的。一般形式係進一步已知為需要針對各相機之鏡頭的徑向形變之補償，以及移動至和自投射空間之非線性座標變換。於外部相機調校中，對於理想的非線性變換之近似係藉由解決非線性最佳化問題來判定。此非線性最佳化函數係由追蹤引擎110所使用以識別不同影像辨識引擎112a-112n之輸出(關節資料結構之陣列)中的相同關節，處理具有重疊觀看域之相機114的影像。內部及外部相機調校之結果被儲存於調校資料庫170中。 The large number of images collected above for subjects can be used to determine corresponding points between cameras with overlapping viewing fields. Consider two cameras A and B with overlapping viewing fields. The plane passing through the centers of cameras A and B and joint positions (also called feature points) in the 3D scene is called the "epipolar plane". The intersection of this epipolar plane with the 2D image planes of cameras A and B defines the "epidemic". Given these corresponding points, a transformation is determined that accurately maps the corresponding point from camera A to the epipolar line in camera B's viewing field which is guaranteed to intersect the corresponding point in camera B's image frame. Using the image frame collected above for the subject, the transformation is generated. This transformation is known in the art to be nonlinear. The general form is further known to require compensation for radial deformations of the lenses of each camera, as well as non-linear coordinate transformations to and from projection space. In external camera tuning, the approximation to the ideal nonlinear transformation is determined by solving the nonlinear optimization problem. This nonlinear optimization function is used by the tracking engine 110 to identify the same joints in the outputs (arrays of joint data structures) of the different image recognition engines 112a-112n, processing images from cameras 114 with overlapping viewing fields. The results of the internal and external camera calibrations are stored in the calibration database 170 .

可使用多種技術以判定真實空間中之相機114的影像中之點的相對位置。例如，Longuet-Higgins出版了「用以從兩個投影重建場景之電腦演算法」於Nature,Volume 293，1981年九月10日。此論文係提出了從相關對的透視投影計算場景之三維結構，當介於兩投影之間的空間關係是未知的時。Longuet-Higgins的論文提出了一種技術以判定真實空間中之各相機相對於其他相機的位置。此外，他們的技術容許真實空間中之主體的三角測量，其係使用來自具有重疊觀看域之相機114的影像以識別z座標(距離地板的高度)之值。真實空間中之任意點(例如，真實空間之一角落中的貨架之末端)被指定為真實空間之(x,y,z)座標系統上的(0,0,0)點。 Various techniques may be used to determine the relative positions of points in the image of camera 114 in real space. For example, Longuet-Higgins published "Computer Algorithms for Scene Reconstruction from Two Projections" in Nature, Volume 293, September 10, 1981. This paper proposes to compute the 3D structure of the scene from the perspective projections of related pairs, when the space between the two projections is relationship is unknown. The Longuet-Higgins paper proposes a technique to determine the position of each camera in real space relative to other cameras. Furthermore, their technique allows triangulation of subjects in real space using images from cameras 114 with overlapping viewing fields to identify the value of the z-coordinate (height from the floor). Any point in real space (eg, the end of a shelf in a corner of real space) is designated as a point (0,0,0) on the (x,y,z) coordinate system of real space.

於本技術之一實施例中，外部調校之參數被儲存於兩個資料結構中。第一資料結構係儲存本質參數。本質參數係表示從3D座標變為2D影像座標之投影變換。第一資料結構含有每相機之本質參數，如以下所示。資料值均為數值的浮點數字。此資料結構係儲存3x3本質矩陣，表示為「K」及形變係數。形變係數包括六個徑向形變係數及兩個切向形變係數。徑向形變係發生在當光射線更接近於鏡頭之邊緣而彎曲(相較於在其光學中心之彎曲)時。切向形變係發生在當鏡頭與影像平面並非平行時。以下資料結構僅顯示第一相機之值。類似的資料係針對所有相機114而被儲存。 In one embodiment of the present technology, externally tuned parameters are stored in two data structures. The first data structure stores essential parameters. Essential parameters represent the projection transformation from 3D coordinates to 2D image coordinates. The first data structure contains the essential parameters of each camera, as shown below. The data values are all numeric floating point numbers. This data structure stores a 3x3 essential matrix, denoted "K" and the deformation coefficients. The deformation coefficients include six radial deformation coefficients and two tangential deformation coefficients. Radial deformation occurs when a light ray bends closer to the edge of the lens than at its optical center. Tangential deformation occurs when the lens is not parallel to the image plane. The following data structure shows the values for the first camera only. Similar data is stored for all cameras 114 .

第二資料結構係儲存每對相機：3x3基礎矩陣(F)、3x3基本矩陣(E)、3x4投影矩陣(P)、3x3旋轉矩陣(R)及3x1變換向量(t)。此資料被用以將一相機的參考框中之點轉換至另一相機的參考框。針對各對相機，八個單應性係數亦被儲存以映射地板220之平面從一相機至另一相機。基礎矩陣為介於相同場景的兩個影像之間的關係，其係限制來自該場景之點的投影可發生於兩個影像中的何處。基本矩陣亦為介於相同場景的兩個影像之間的關係，在其相機被調校之條件下。投影矩陣係從3D真實空間提供向量空間投影至主體。旋轉矩陣被用以履行歐幾里德空間中之旋轉。變換向量「t」代表幾何變換，其係以相同距離移動一圖形或一空間之每一點於既定方向。單應性_地板_係數被用以結合其由具有重疊觀看域之相機所觀看到的地板220上之主體的特徵之影像。第二資料結構被顯示於下。類似的資料係針對所有對相機而被儲存。如先前所指示，x代表數值的浮點數字。 The second data structure stores each pair of cameras: 3x3 base moments Matrix (F), 3x3 Fundamental Matrix (E), 3x4 Projection Matrix (P), 3x3 Rotation Matrix (R) and 3x1 Transform Vector (t). This data is used to convert points in one camera's reference frame to another camera's reference frame. For each pair of cameras, eight homography coefficients are also stored to map the plane of the floor 220 from one camera to the other. A fundamental matrix is a relationship between two images of the same scene that constrains where in the two images a projection from a point of the scene can occur. The fundamental matrix is also the relationship between two images of the same scene, with their cameras tuned. Projection matrices provide vector space projections from 3D real space to the subject. Rotation matrices are used to perform rotations in Euclidean space. The transformation vector "t" represents a geometric transformation, which moves each point of a figure or a space in a given direction by the same distance. The homography_floor_coefficients are used to combine images of features of the subject on the floor 220 as viewed by cameras with overlapping viewing fields. The second data structure is shown below. Similar data is stored for all pairs of cameras. As indicated previously, x represents a numeric floating point number.

network configuration

圖4提出一種主控影像辨識引擎之網路的架構400。該系統包括複數網路節點101a-101n於所示的實施例中。於此一實施例中，網路節點亦被稱為處理平台。處理平台101a-101n及相機412、414、416、...418被連接至網路481。 FIG. 4 presents an architecture 400 of a network that hosts an image recognition engine. The system includes a plurality of network nodes 101a-101n in the embodiment shown. In this embodiment, the network node is also referred to as a processing platform. The processing platforms 101a - 101n and cameras 412 , 414 , 416 , . . . 418 are connected to a network 481 .

圖4顯示其連接至網路之複數相機412、414、416、...418。大量相機可被部署於特定系統中。於一實施例中，相機412至418係各別地使用乙太網路為基的連接器422、424、426、及428而被連接至網路481。於此一實施例中，乙太網路為基的連接器具有每秒十億位元之資料轉移速度，亦稱為十億位元乙太網路。應理解：於其他實施例中，相機114被連接至該網路，使用其可具有比十億位元乙太網路更快或更慢的資料轉移速率之其他類型的網路連接。同時，於替代實施例中，一組相機可被直接地連接至各處理平台，且該些處理平台可被耦合至網路。 Figure 4 shows a plurality of cameras 412, 414, 416, . . . 418 connected to the network. Numerous cameras can be deployed in a particular system. In one embodiment, cameras 412-418 are connected to network 481 using Ethernet-based connectors 422, 424, 426, and 428, respectively. In this embodiment, the Ethernet-based connector has a data transfer speed of gigabits per second, also known as Gigabit Ethernet. It should be understood that in other embodiments, the camera 114 is connected to the network using other types of network connections that may have faster or slower data transfer rates than Gigabit Ethernet. Also, in alternative embodiments, a set of cameras may be directly connected to each processing platform, and the processing platforms may be coupled to a network.

儲存子系統430係儲存基本編程及資料架構，其提供本發明之某些實施例的功能。例如，實施複數影像辨識引擎之功能的各個模組可被儲存於儲存子系統430中。儲存子系統430為電腦可讀取記憶體之範例，其包含非暫態資料儲存媒體，具有儲存於記憶體中之電腦指令，其可由電腦所執行以履行文中所述之資料處理及影像處理的所有或任何組合，包括邏輯以：識別真實空間中之改變、追蹤主體及檢測真實空間之區域中的存貨項目之放下及取走，藉由如文中所述之程序。於其他範例中，電腦指令可被儲存於其他類型的記憶體中，包括可攜式記憶體，其包含可由電腦所讀取之非暫態資料儲存媒體或媒體。 Storage subsystem 430 stores the basic programming and data structures that provide the functionality of certain embodiments of the present invention. For example, various modules implementing the functions of the plurality of image recognition engines may be stored in the storage subsystem 430 . The storage subsystem 430 is an example of a computer readable memory, which includes a non-transitory data storage medium having computer instructions stored in the memory that can be executed by the computer to perform the data processing and image processing described herein. All or any combination, including logic to: identify Change, track the subject, and detect the placing and removal of inventory items in the area of the real space, by the procedure as described in the text. In other examples, computer instructions may be stored in other types of memory, including portable memory, including non-transitory data storage media or media readable by a computer.

這些軟體模組通常係由處理器子系統450所執行。主機記憶體子系統432通常包括數個記憶體，包括主隨機存取記憶體(RAM)434(用於程式執行期間儲存指令及資料)及唯讀記憶體(ROM)436(其中係儲存固定指令)。於一實施例中，RAM 434被使用為緩衝器，用以儲存來自其被連接至平台101a之相機114的視頻串流。 These software modules are typically executed by processor subsystem 450 . Host memory subsystem 432 typically includes several memories, including main random access memory (RAM) 434 (for storing instructions and data during program execution) and read only memory (ROM) 436 (where fixed instructions are stored) ). In one embodiment, RAM 434 is used as a buffer to store video streams from cameras 114 which are connected to platform 101a.

檔案儲存子系統440提供用於程式及資料檔案之持久儲存。於一範例實施例中，儲存子系統440包括四個120十億位元組(GB)固態硬碟(SSD)於RAID 0(獨立硬碟之冗餘陣列)配置(以數字442所識別)中。於範例實施例(其中CNN被用以識別主體之關節)中，RAID 0 442被用以儲存訓練資料。於訓練期間，其非於RAM 434中之訓練資料被讀取自RAID 0 442。類似地，當影像被記錄以供訓練之目的時，其非於RAM 434中之資料被儲存於RAID 0 442中。於範例實施例中，硬碟驅動(HDD)446為10兆位元組儲存。其具有比RAID 0 442儲存更慢的存取速度。固態硬碟(SSD)444含有作業系統及用於影像辨識引擎112a之相關檔案。 File storage subsystem 440 provides persistent storage for program and data files. In an example embodiment, storage subsystem 440 includes four 120-gigabyte (GB) solid-state drives (SSDs) in a RAID 0 (redundant array of independent drives) configuration (identified by numeral 442) . In an example embodiment where a CNN is used to identify joints of the subject, RAID 0 442 is used to store training data. During training, its training data not in RAM 434 is read from RAID 0 442. Similarly, when images are recorded for training purposes, their data not in RAM 434 is stored in RAID 0 442. In an example embodiment, the hard disk drive (HDD) 446 is 10 megabytes of storage. It has slower access speed than RAID 0 442 storage. Solid State Drive (SSD) 444 contains the operating system and associated files for image recognition engine 112a.

於一範例組態中，三個相機412、414、及 416被連接至處理平台101a。各相機具有專屬圖形處理單元GPU 1 462、GPU 2 464、及GPU 3 466，用以處理由相機所傳送的影像。應理解：少於或多於三個相機可被連接至每處理平台。因此，更少或更多的GPU被組態於網路節點中，以致其各相機具有專屬的GPU以處理接收自該相機之影像框。處理器子系統450、儲存子系統430及GPU 462、464、和466係使用匯流排子系統454來通訊。 In an example configuration, three cameras 412, 414, and 416 is connected to the processing platform 101a. Each camera has dedicated graphics processing units GPU 1 462, GPU 2 464, and GPU 3 466 for processing the images transmitted by the cameras. It should be understood that less or more than three cameras may be connected to each processing platform. Therefore, fewer or more GPUs are configured in a network node so that each of its cameras has a dedicated GPU for processing image frames received from that camera. Processor subsystem 450 , storage subsystem 430 , and GPUs 462 , 464 , and 466 communicate using bus subsystem 454 .

數個周邊裝置(諸如網路介面子系統、使用者介面輸出裝置、及使用者介面輸入裝置)亦被連接至流排子系統454，其形成處理平台101a之部分。這些子系統及裝置被有意地未顯示於圖4中以增進說明之清晰。雖然流排子系統454被概略地顯示為單一匯流排，但匯流排子系統之替代實施例可使用多數匯流排。 Several peripheral devices, such as a network interface subsystem, user interface output devices, and user interface input devices, are also connected to the streaming subsystem 454, which forms part of the processing platform 101a. These subsystems and devices are intentionally not shown in FIG. 4 to improve clarity of illustration. Although the bus subsystem 454 is shown diagrammatically as a single bus, alternative embodiments of the bus subsystem may use multiple buses.

於一實施例中，相機412可使用Chameleon3 1.3 MP Color USB3 Vision(Sony ICX445)來實施，其具有1288x964之解析度、30 FPS之框率、及以每影像1.3百萬像素(MegaPixels)，利用具有300-∞之工作距離(mm)的變焦鏡頭、具有98.2°-23.8°之1/3”感應器的觀看域。 In one embodiment, the camera 412 may be implemented using a Chameleon3 1.3 MP Color USB3 Vision (Sony ICX445) with a resolution of 1288x964, a frame rate of 30 FPS, and at 1.3 megapixels per image (MegaPixels) using 300-∞ working distance (mm) zoom lens with 1/3" sensor viewing field of 98.2°-23.8°.

Convolutional Neural Network

處理平台中之影像辨識引擎係以預定速率接收影像之連續串流。於一實施例中，該些影像辨識引擎包含卷積神經網路(縮寫為CNN)。 The image recognition engine in the processing platform receives a continuous stream of images at a predetermined rate. In one embodiment, the image recognition engines include convolutional neural networks (abbreviated as CNNs).

圖5闡明藉由以數字500表示之CNN的影像框之處理。輸入影像510為由以列及行所配置之影像像素所組成的矩陣。於一實施例中，輸入影像510具有1280像素之寬度、720像素之高度及紅、藍、和綠(亦稱為RGB)之3頻道。該些頻道被想像為堆疊在彼此上之三個1280x720的二維影像。因此，輸入影像具有如圖5中所示之1280x720x3的維度。 FIG. 5 illustrates the image frame by the CNN represented by the number 500 processing. The input image 510 is a matrix of image pixels arranged in columns and rows. In one embodiment, the input image 510 has a width of 1280 pixels, a height of 720 pixels, and 3 channels of red, blue, and green (also known as RGB). The channels are imagined as three 1280x720 2D images stacked on top of each other. Therefore, the input image has dimensions of 1280x720x3 as shown in FIG. 5 .

2x2過濾器520係與輸入影像510卷積。於此實施例中，當過濾器與輸入卷積時無填補被應用。接續於此，非線性函數被應用至已卷積影像。於本實施例中，已校正的線性單元(ReLU)啟動被使用。非線性函數之其他範例包括S形(sigmoid)、雙曲線正切(tanh)及ReLU之變化，諸如漏ReLU。搜尋被履行以找出超參數值。超參數為C₁,C₂,.....,C_N，其中C_N表示卷積層「N」之頻道數。N及C之典型值被顯示於圖5。於CNN中有二十五(25)層，如由N等於25所表示。C之值為層1至25之各卷積層中的頻道數。於其他實施例中，額外特徵被加至CNN 500，諸如殘餘連接、擠壓激發模組、及多重解析度。 The 2x2 filter 520 is convolved with the input image 510. In this embodiment, no padding is applied when the filter is convolved with the input. Continuing here, a nonlinear function is applied to the convoluted image. In this embodiment, Corrected Linear Unit (ReLU) activation is used. Other examples of nonlinear functions include sigmoid, hyperbolic tangent (tanh), and variations of ReLU, such as leaky ReLU. A search is performed to find hyperparameter values. The hyperparameters are C ₁ , C ₂ , ....., _CN , where _CN represents the number of channels of the convolutional layer "N". Typical values of N and C are shown in FIG. 5 . There are twenty-five (25) layers in the CNN, as represented by N equal to 25. The value of C is the number of channels in each of the convolutional layers from layers 1 to 25. In other embodiments, additional features are added to the CNN 500, such as residual connections, squeeze excitation modules, and multiple resolutions.

在用於影像分類之典型CNN中，影像之大小(寬度及高度維度)係隨著該影像通過卷積層被處理而被減小。其對於特徵識別是有幫助的，因為目標是預測輸入影像之類別。然而，於所示的實施例中，輸入影像之大小(亦即，影像寬度及高度維度)未被減小，因為其目標不僅是識別該影像框中之關節(亦稱為特徵)，而是同時亦識別該影像中之其位置，因此其可被映射至真實空間中之座標。因此，如圖5中所示，隨著該處理進行通過CNN之卷積層，該影像之寬度及高度維度維持不變，於此範例中。 In a typical CNN for image classification, the size (width and height dimensions) of an image is reduced as the image is processed through convolutional layers. It is helpful for feature recognition since the goal is to predict the class of the input image. However, in the embodiment shown, the size of the input image (ie, the image width and height dimensions) is not reduced because the goal is not only to identify joints (aka features) in the image frame, but It also identifies its position in the image, so it can be mapped to a seat in real space mark. Thus, as shown in Figure 5, the width and height dimensions of the image remain unchanged as the process proceeds through the convolutional layers of the CNN, in this example.

於一實施例中，CNN 500識別符該影像之各元件上的該些主體之19個可能的關節。該些可能的關節可被群集為兩種類：足部關節及非足部關節。第19類型的關節類別是針對該主體之所有非關節特徵(亦即，未被歸類為關節之影像的元件)。 In one embodiment, CNN 500 identifies 19 possible joints of the bodies on elements of the image. The possible joints can be clustered into two categories: foot joints and non-foot joints. Type 19 joint classes are for all non-joint features of the subject (ie, elements not classified as images of joints).

足部關節： Foot joints:

腳踝關節(左及右) Ankle (left and right)

非足部關節： Non-foot joints:

脖子 neck

鼻子 nose

眼睛(左及右) Eyes (left and right)

耳朵(左及右) Ears (left and right)

肩膀(左及右) Shoulders (left and right)

手肘(左及右) Elbow (left and right)

手腕(左及右) Wrist (left and right)

臀部(左及右) Buttocks (left and right)

膝蓋(左及右) Knee (left and right)

不是關節 not a joint

如可看出，為了本說明書之目的，「關節」是真實空間中之主體的可追蹤特徵。關節可相應於該些主體上之生理關節、或其他特徵(諸如眼睛、或鼻子)。 As can be seen, for the purposes of this specification, a "joint" is a traceable feature of a subject in real space. The joints may correspond to physical joints on the bodies, or other features (such as eyes, or noses).

對於輸入影像之串流的第一組分析係識別真實空間中之主體的可追蹤特徵。於一實施例中，此被稱為「關節分析」。於此一實施例中，用於關節分析之CNN被稱為「關節CNN」。於一實施例中，關節分析被履行每秒三十次，在接收自相應相機之每秒三十框上。該分析被時間上同步化(亦即，一秒的1/30^th)，來自所有相機114之影像被分析於相應的關節CNN中以識別真實空間中之所有主體的關節。來自複數相機之來自單一時刻的影像之此分析的結果被儲存為「快照」。 The first set of analyses on the stream of input images identifies traceable features of subjects in real space. In one embodiment, this is called "joint analysis." In this embodiment, the CNN used for joint analysis is called "joint CNN". In one embodiment, joint analysis is performed thirty times per second, on thirty frames per second received from the corresponding cameras. The analysis is time-synchronized (ie, 1/ ^30th of a second), and images from all cameras 114 are analyzed in corresponding joint CNNs to identify joints of all subjects in real space. The results of this analysis of images from multiple cameras from a single moment in time are stored as "snapshots".

快照可為來自某一時刻之所有相機114的影像之含有關節資料結構的陣列之字典的形式，其代表由該系統所覆蓋之真實空間的區域內之候選關節的群集。於一實施例中，快照被儲存於主體資料庫140中。 A snapshot may be in the form of a dictionary of images from all cameras 114 at a time containing an array of joint data structures representing clusters of candidate joints within the region of real space covered by the system. In one embodiment, the snapshots are stored in the subject database 140 .

於此範例CNN中，softmax函數被應用至卷積層530之最終層中的影像之每一元件。softmax函數係將任意實值之K維向量變換至範圍[0,1](其向上加至1)中的實值之K維向量。於一實施例中，影像之元件為單一像素。softmax函數係將各像素的任意實值之19維陣列(亦稱為19維向量)轉換至範圍[0,1](其向上加至1)中的實值之19維信心陣列。影像框中之像素的19維係相應於CNN之最終層中的19頻道，其進一步相應於該些主體之19個類型的關節。 In this example CNN, the softmax function is applied to each element of the image in the final layer of convolutional layer 530. The softmax function transforms any real-valued K-dimensional vector into a real-valued K-dimensional vector in the range [0,1] (which adds up to 1). In one embodiment, the element of the image is a single pixel. The softmax function converts an arbitrary real-valued 19-dimensional array (also known as a 19-dimensional vector) for each pixel to a real-valued 19-dimensional confidence array in the range [0,1] (which adds up to 1). The 19 dimensions of the pixels in the image frame correspond to the 19 channels in the final layer of the CNN, which further correspond to the 19 types of joints of the subjects.

大量圖片元件可被分類為一影像中之19個類型的關節的各者之一，根據針對該影像之來源相機的觀看域中之主體的數目。 A large number of picture elements can be classified into one of 19 types of joints in an image, according to the number of subjects in the viewing field of the source camera for the image.

影像辨識引擎112a-112n係處理影像以產生針對該影像之元件的信心陣列。影像之特定元件的信心陣列包括該特定元件之複數關節類型的信心值。影像辨識引擎112a-112n之每一者(各別地)產生每影像之信心陣列的輸出矩陣540。最後，各影像辨識引擎產生相應於每影像之信心陣列的各輸出矩陣540之關節資料結構的陣列。相應於特定影像之關節資料結構的陣列係藉由關節類型、特定影像之時間、及特定影像中之元件的座標來分類特定影像之元件。各影像中之特定元件的關節資料結構之關節類型係根據信心陣列之值來選擇。 Image recognition engines 112a-112n process the images to generate Confidence array of elements for this image. The confidence array for a particular element of the image includes confidence values for a plurality of joint types for that particular element. Each of the image recognition engines 112a-112n (respectively) generates an output matrix 540 of confidence arrays per image. Finally, each image recognition engine generates an array of joint data structures for each output matrix 540 corresponding to the confidence array for each image. The array of joint data structures corresponding to a particular image classifies the elements of a particular image by the joint type, the time of the particular image, and the coordinates of the elements in the particular image. The joint type of the joint data structure of a particular element in each image is selected based on the values of the confidence array.

該些主體之各關節可被視為分佈於輸出矩陣540中而成為熱映圖。熱映圖可被解析以顯示具有針對各關節類型之最高值(峰值)的影像元件。理想地，針對具有特定關節類型之高值的既定圖片元件，在與該既定圖片元件之某一距離外的周遭圖片元件將具有針對該關節類型之較低的值，以致其具有該關節類型之特定關節的位置可被識別於影像空間座標中。相應地，該影像元件之信心陣列將具有針對該關節之最高信心值以及針對剩餘18個類型的關節之較低的信心值。 The joints of the bodies can be viewed as being distributed in the output matrix 540 as a heatmap. The heat map can be resolved to display image elements with the highest values (peaks) for each joint type. Ideally, for a given picture element with a high value for a particular joint type, surrounding picture elements that are outside a certain distance from the given picture element will have lower values for that joint type, so that they have the same The location of a particular joint can be identified in image space coordinates. Accordingly, the confidence array for the image element will have the highest confidence value for that joint and lower confidence values for the remaining 18 types of joints.

於一實施例中，來自各相機114之影像的批次係由各別影像辨識引擎所處理。例如，六個連續時戳影像被依序地處理於一批次中以善用快取同調性。針對CNN 500之一層的參數被載入記憶體中並應用於六個影像框之該批次。接著針對下一層的參數被載入記憶體中並應用於六個影像之該批次。此被重複於CNN 500中之所有卷積層 530。快取同調性減少了處理時間並增進影像辨識引擎之性能。 In one embodiment, batches of images from each camera 114 are processed by respective image recognition engines. For example, six consecutive time-stamped images are processed sequentially in a batch to take advantage of cache coherence. Parameters for one layer of the CNN 500 are loaded into memory and applied to the batch of six image frames. The parameters for the next layer are then loaded into memory and applied to the batch of six images. This is repeated for all convolutional layers in CNN 500 530. Cache coherence reduces processing time and improves image recognition engine performance.

於一此類實施例中，針對三維(3D)卷積，CNN 500之性能的進一步增進係藉由共用橫跨該批次中之影像框的資訊來達成。如此係協助關節之更精確的識別並減少錯誤肯定。例如，影像框中之特徵(其中橫跨既定批次中之多數影像框的像素值不會改變)很可能是靜態物件(諸如貨架)。針對橫跨既定批次中之影像框的相同像素之值的改變係指示其此像素很可能是關節。因此，CNN 500可更專注於處理該像素以正確地識別其由該像素所識別的關節。 In one such embodiment, for three-dimensional (3D) convolution, a further improvement in the performance of the CNN 500 is achieved by sharing information across the image frames in the batch. This assists in more accurate identification of joints and reduces false positives. For example, features in image frames in which pixel values across most image frames in a given batch do not change are likely to be static objects (such as shelves). A change in value for the same pixel across an image frame in a given batch indicates that this pixel is likely to be a joint. Therefore, the CNN 500 can focus more on processing the pixel to correctly identify its joint identified by the pixel.

joint data structure

CNN 500之輸出為針對每相機之各影像的信心陣列之矩陣。信心陣列之矩陣被變換為關節資料結構之陣列。如圖6中所示之關節資料結構被用以儲存各關節之資訊。關節資料結構600係識別相機(影像係從該相機所接收)之2D影像空間中的特定影像中的元件之x及y位置。關節數係識別其已識別的關節之類型。例如，於一實施例中，該些值的範圍係從1至19。1之值指示其該關節為左腳踝，2之值指示其該關節為右腳踝，依此類推。關節之類型係使用針對輸出矩陣540中之該元件的信心陣列來選擇。例如，於一實施例中，假如相應於左腳踝關節之值為針對該影像元件之信心陣列中的最高者，則該關節數之值為「1」。 The output of CNN 500 is a matrix of confidence arrays for each image of each camera. The matrix of confidence arrays is transformed into an array of joint data structures. The joint data structure shown in Figure 6 is used to store the information of each joint. The joint data structure 600 identifies the x and y positions of elements in a particular image in the 2D image space of the camera from which the image is received. The joint number identifies the type of joint it has identified. For example, in one embodiment, the values range from 1 to 19. A value of 1 indicates that the joint is the left ankle, a value of 2 indicates that the joint is the right ankle, and so on. The type of joint is selected using the confidence array for that element in output matrix 540 . For example, in one embodiment, if the value corresponding to the left ankle joint is the highest in the confidence array for that image element, then the value of the number of joints is "1".

信心數係指示於預測該關節時之CNN 500中的信心之程度。假如信心數之值很高，則表示CNN對於其預測是有信心的。整數Id被指派給關節資料結構以獨特地識別它。接續於以上映射後，每影像之信心陣列的輸出矩陣540被轉換為各影像之關節資料結構的陣列。 The confidence number indicates the degree of confidence in the CNN 500 in predicting the joint. If the confidence number is high, it means that the CNN is confident in its prediction. An integer Id is assigned to the joint data structure to uniquely identify it. Following the above mapping, the output matrix 540 of the confidence array for each image is converted into an array of joint data structures for each image.

影像辨識引擎112a-112n接收來自相機114之影像的序列並處理影像以產生如上所述之關節資料結構的相應陣列。針對特定影像之關節資料結構的陣列係藉由關節類型、特定影像之時間、及特定影像中之元件的座標來分類特定影像之元件。於一實施例中，影像辨識引擎112a-112n為卷積神經網路CNN 500，該關節類型為該些主體之19個類型的關節之一，特定影像之時間為由來源相機114針對該特定影像所產生的影像之時戳，而座標(x,y)係識別2D影像平面上之該元件的位置。 Image recognition engines 112a-112n receive sequences of images from camera 114 and process the images to generate corresponding arrays of joint data structures as described above. The array of joint data structures for a particular image classifies the elements of a particular image by the joint type, the time of the particular image, and the coordinates of the elements in the particular image. In one embodiment, the image recognition engines 112a-112n are convolutional neural networks CNN 500, the joint type is one of the 19 types of joints of the subjects, and the time of a specific image is determined by the source camera 114 for the specific image The timestamp of the generated image, and the coordinates (x,y) identify the location of the element on the 2D image plane.

於一實施例中，關節分析包括履行k最近鄰居、高斯之混合、各種影像形態變換、及各輸入影像上之關節CNN的組合。該結果包含關節資料結構之陣列，其可被儲存以環緩衝器中之位元遮罩的形式，其係將影像數映射至各時刻的位元遮罩。 In one embodiment, joint analysis includes performing k-nearest neighbors, a mixture of Gaussians, various image morphological transformations, and a combination of joint CNNs on each input image. The result contains an array of joint data structures, which can be stored as bit masks in the ring buffer, which map the number of images to the bit masks at each instant.

tracking engine

追蹤引擎110係組態成接收由影像辨識引擎112a-112n所產生之關節資料結構的陣列，相應於來自具有重疊觀看域之相機的影像序列中之影像。每影像之關節資料結構的陣列係由影像辨識引擎112a-112n傳送至追蹤引擎110，經由如圖7中所示之網路181。追蹤引擎110將相應於不同序列中的影像之關節資料結構的陣列中之元件的座標變換為具有真實空間中之座標的候選關節。追蹤引擎110包含邏輯以將具有真實空間中之座標的候選關節之集合(關節之群集)識別為該真實空間中之主體。於一實施例中，追蹤引擎110係累積來自既定時刻之所有相機的影像辨識引擎之關節資料結構的陣列，並將此資訊儲存為主體資料庫140中之字典，以供用於識別候選關節之群集。該字典可被配置以密鑰-值對的形式，其中密鑰為相機id且值為來自該相機之關節資料結構的陣列。於此一實施例中，此字典被用於啟發法為基的分析以判定候選關節及關節之指派給主體。於此一實施例中，追蹤引擎110之高階輸入、處理及輸出被闡明於表1中。 Tracking engine 110 is configured to receive an array of joint data structures generated by image recognition engines 112a-112n, corresponding to An image in an image sequence of cameras with overlapping viewing fields. The array of joint data structures for each image is sent by image recognition engines 112a-112n to tracking engine 110 via network 181 as shown in FIG. The tracking engine 110 transforms the coordinates of the elements in the array of joint data structures corresponding to the images in the different sequences into candidate joints with coordinates in real space. Tracking engine 110 includes logic to identify sets of candidate joints (clusters of joints) with coordinates in real space as subjects in real space. In one embodiment, tracking engine 110 accumulates an array of image recognition engine joint data structures from all cameras at a given time, and stores this information as a dictionary in master database 140 for use in identifying clusters of candidate joints . The dictionary can be configured as a key-value pair, where the key is the camera id and the value is an array of joint data structures from that camera. In this embodiment, this dictionary is used for heuristic-based analysis to determine candidate joints and assignments of joints to the subject. In this embodiment, the high-level inputs, processing, and outputs of the tracking engine 110 are set forth in Table 1.

Cluster joints into candidate joints

追蹤引擎110沿著兩維度以接收關節資料結構的陣列：時間及空間。沿著時間維度，追蹤引擎係依序地接收由每相機之影像辨識引擎112a-112n所處理的關節資料結構的有時戳陣列。關節資料結構包括在一段時間週期期間相同主體之相同關節的多數實例，於來自具有重疊觀看域之相機的影像中。特定影像中之元件的(x,y)座標在關節資料結構的依序有時戳陣列中通常將是不同的，由於特定關節所屬之主體的移動。例如，被歸類為左手腕關節之二十個圖片元件可出現在來自特定相機的許多依序有時戳影像中，各左手腕關節具有其可隨著影像而改變或不變的真實空間中之位置。結果，於許多關節資料結構的依序有時戳陣列中之二十個左手腕關節資料結構600可代表在某期間於真實空間中之相同的二十個關節。 The tracking engine 110 receives an array of joint data structures along two dimensions: time and space. Along the time dimension, the tracking engine sequentially receives the time-stamped array of joint data structures processed by each camera's image recognition engine 112a-112n. The joint data structure includes most instances of the same joint of the same subject during a period of time in images from cameras with overlapping viewing fields. The (x,y) coordinates of elements in a particular image will typically be different in the sequential timestamp array of the joint data structure due to the movement of the body to which the particular joint belongs. For example, twenty picture elements classified as left wrist joints may appear in many sequential time-stamped images from a particular camera, each left wrist joint having its real space that may or may not change with the image the location. As a result, the twenty left wrist joint data structures 600 in the sequential timestamp array of many joint data structures may represent the same twenty joints in real space at a certain period.

因為具有重疊觀看域之多數相機係覆蓋真實空間中之各位置，所以在任何既定時刻，相同關節可出現在相機114之一個以上的影像中。相機114被時間上同步化，因此，追蹤引擎110係接收來自具有重疊觀看域之多數相機的特定關節之關節資料結構，於任何既定時刻。此為空間維度(兩個維度：時間及空間的第二個)，追蹤引擎110係沿著該空間維度以接收關節資料結構的陣列中之資料。 Because most cameras with overlapping viewing fields cover locations in real space, the same joint may appear in more than one image of one of the cameras 114 at any given time. The cameras 114 are time-synchronized so that the tracking engine 110 receives joint data structures for a particular joint from a plurality of cameras with overlapping viewing fields, at any given time. This is the spatial dimension (the second of the two dimensions: time and space) along which the tracking engine 110 receives data in the array of joint data structures.

追蹤引擎110使用啟發法資料庫160中所儲存之啟發法的初始集合以識別來自關節資料結構的陣列之候選關節資料結構。其目標是在一段時間週期內將總體量度減至最小。總體量度計算器702係計算總體量度。總體量度為以下所述之多數值的總和。直覺地，總體量度之值在當如下情形下時是最小的：由追蹤引擎110沿著時間及空間維度所接收之關節資料結構的陣列中之關節被正確地指派給各別主體。例如，考量具有在走道中移動之消費者的購物商店之實施例。假如消費者A之左手腕被不正確地指派給消費者B，則總體量度之值將增加。因此，將各消費者之各關節的總體量度減至最小是一個最佳化問題。用以解決此問題之一選項是嘗試關節的所有可能連接。然而，此可能變為難處理的，隨著消費者之數目增加。 The tracking engine 110 uses the heuristics stored in the heuristic database 160 An initial set of heuristics to identify candidate joint data structures from an array of joint data structures. The goal is to minimize the overall measure over a period of time. The overall metric calculator 702 calculates the overall metric. The overall measure is the sum of many of the values described below. Intuitively, the value of the overall metric is minimal when the joints in the array of joint data structures received by the tracking engine 110 along the temporal and spatial dimensions are correctly assigned to the respective subjects. For example, consider an embodiment of a shopping store with consumers moving in aisles. If Consumer A's left wrist is incorrectly assigned to Consumer B, the value of the overall metric will increase. Therefore, minimizing the overall measure of each joint of each consumer is an optimization problem. One option to solve this problem is to try all possible connections of the joint. However, this can become intractable as the number of consumers increases.

用以解決此問題之第二種方式是使用啟發法以減少其被識別為單一主體的候選關節之集合的成員之關節的可能組合。例如，左手腕關節不得屬於在空間中遠離一主體之其他關節的該主體，由於關節之相對位置的已知生理學特性。類似地，具有在位置上隨著影像的小改變之左手腕關節不太可能屬於具有來自在時間上遠離的影像之相同位置上的相同節點之主體，因為主體不被預期能以極高的速度移動。這些初始啟發法被用以建立時間及空間上的邊界，針對其可被歸類為特定主體之候選關節的群集。於特定時間及空間邊界內之關節資料結構中的關節被視為「候選關節」，以供指派給如真實空間中所出現之主體的候選關節之集合。這些候選關節包括於來自相同相機之來自多數影像的關節資料結構的陣列中所識別的關節，於一段時間週期(時間維度)並橫跨具有重疊觀看域之不同相機(空間維度)。 A second way to address this problem is to use heuristics to reduce the possible combinations of joints that are identified as members of the set of candidate joints for a single subject. For example, the left wrist joint must not belong to a body that is spatially distant from other joints of a body due to the known physiology of the relative positions of the joints. Similarly, a left wrist joint with a small change in position from image to image is unlikely to belong to a subject with the same node at the same location from an image that is distant in time, because subjects are not expected to be able to move at very high velocities move. These initial heuristics are used to establish temporal and spatial boundaries for which clusters can be classified as candidate joints for a particular subject. Joints in a joint data structure within a particular temporal and spatial boundary are considered "candidate joints" for assignment to a set of candidate joints for subjects as they appear in real space. These candidate joints are included from the same camera Joints identified in the array of joint data structures from most images, over a period of time (temporal dimension) and across different cameras with overlapping viewing fields (spatial dimension).

Foot joints

關節可被劃分以供一種將關節分組成為群集(成為如以上關節之列表中所示的足部和非足部關節)之程序的目的。於目前範例中之左及右腳踝關節類型被視為足部關節，以供此程序之目的。追蹤引擎110可開始使用足部關節以識別特定主體之候選關節的集合。於購物商店之實施例中，消費者之足部是在如圖2中所示之地板220上。相機114至地板220之距離是已知的。因此，當結合其來自相應於具有重疊觀看域之相機的影像之資料關節資料結構的陣列之足部關節的關節資料結構時，追蹤引擎110可假設一已知深度(沿著z軸之距離)。足部關節之值深度為零，亦即，真實空間之(x,y,z)座標系統中的(x,y,0)。使用此資訊，影像追蹤引擎110係應用單應性映射以結合來自具有重疊觀看域之相機的足部關節之關節資料結構，以識別候選足部關節。使用此映射，於影像空間中之(x,y)座標中的關節之位置被轉換至真實空間中之(x,y,z)座標中的位置，導致候選足部關節。此程序被分離地履行以使用各別關節資料結構來識別候選左及右足部關節。 The joints may be partitioned for the purpose of a procedure of grouping the joints into clusters (being foot and non-foot joints as shown in the list of joints above). The left and right ankle joint types in the current example are considered foot joints for the purposes of this procedure. The tracking engine 110 may begin using the foot joints to identify a set of candidate joints for a particular subject. In the shopping store embodiment, the consumer's feet are on the floor 220 as shown in FIG. 2 . The distance from camera 114 to floor 220 is known. Thus, the tracking engine 110 can assume a known depth (distance along the z-axis) when combining its joint data structures for the foot joints from an array of data joint data structures corresponding to images of cameras with overlapping viewing fields . The value depth of the foot joint is zero, that is, (x, y, 0) in the (x, y, z) coordinate system of real space. Using this information, the image tracking engine 110 applies a homography map to combine the joint data structures of the foot joints from cameras with overlapping viewing fields to identify candidate foot joints. Using this mapping, the positions of joints in (x,y) coordinates in image space are transformed to positions in (x,y,z) coordinates in real space, resulting in candidate foot joints. This procedure is performed separately to identify candidate left and right foot joints using respective joint data structures.

接續於此，追蹤引擎110可結合候選左足部關節與候選右足部關節(將其指派給候選關節之集合)以產生主體。來自候選關節之星系的其他關節可被鏈結至該主體以建立該產生的主體之部分或所有關節類型的群集。 Continuing here, the tracking engine 110 may combine the candidate left foot joints with the candidate right foot joints (which are assigned to the set of candidate joints) to generate living subject. Other joints from the galaxy of candidate joints may be linked to the body to create clusters of some or all joint types of the resulting body.

假如僅有一左候選足部關節及一右候選足部關節，則表示在該特定時間僅有一主體於該特定空間中。追蹤引擎110產生具有屬於其關節集合之左及右候選足部關節的新主體。該主體被存在主體資料庫140中。假如有多數候選左及右足部關節，則總體量度計算器702嘗試將各候選左足部關節結合至各候選右足部關節以產生主體以致其總體量度之值被減至最小。 If there is only one left candidate foot joint and one right candidate foot joint, it means that there is only one subject in the specific space at the specific time. The tracking engine 110 generates a new body with left and right candidate foot joints belonging to its joint set. The subject is stored in subject database 140 . If there are a majority of candidate left and right foot joints, the overall metric calculator 702 attempts to combine each candidate left foot joint to each candidate right foot joint to generate a body such that the value of its overall metric is minimized.

non-foot joints

為了識別來自特定時間及空間邊界內之關節資料結構的陣列之候選非足部關節，追蹤引擎110係使用從任何既定相機A至其相鄰相機B(具有重疊觀看域)之非線性變換(亦稱為基礎矩陣)。非線性變換係使用單一多關節主體來計算且被儲存於如上所述之調校資料庫170中。例如，針對具有重疊觀看域之兩個相機A及B，候選非足部關節被識別如下。在相應於來自相機A之影像框中的元件之關節資料結構的陣列中的非足部關節被映射至來自相機B之同步化影像框中的核線。由相機A的特定影像之關節資料結構的陣列中之關節資料結構所識別的關節(亦稱為機器視覺文獻中之特徵)將出現在相應的核線上，假如其出現在相機B之影像中的話。例如，假如來自相機A之關節資料結構中的關節為左手腕關節，則相機B之影像中的核線上之左手腕關節係代表來自相機B之觀點的相同左手腕關節。相機A及B之影像中的這兩個點為真實空間中之3D場景中的相同點之投影且被稱為「共軛對」。 To identify candidate non-foot joints from an array of joint data structures within specific temporal and spatial boundaries, the tracking engine 110 uses a nonlinear transformation (also called the fundamental matrix). Non-linear transformations are computed using a single multi-joint body and stored in the tuning database 170 as described above. For example, for two cameras A and B with overlapping viewing fields, candidate non-foot joints are identified as follows. Non-foot joints in the array of joint data structures corresponding to elements in the image frame from camera A are mapped to epipolar lines in the synchronized image frame from camera B. A joint identified by a joint data structure in the array of joint data structures for a particular image of camera A (also known as a feature in machine vision literature) would appear on the corresponding epipolar line if it appeared in the image of camera B . For example, if the joint in the joint data structure from camera A is the left wrist joint, then the The left wrist joint on the epipolar line represents the same left wrist joint from camera B's view. These two points in the images of cameras A and B are projections of the same point in the 3D scene in real space and are called "conjugate pairs".

機器視覺技術(諸如由Longuet-Higgins發佈於論文中之技術，名稱為「用以重建來自兩個投影之場景的電腦演算法」，於Nature,Volume 293，1981年九月10日)被應用至相應點之共軛對以判定於真實空間中距離地板220之關節的高度。上述方法之應用需要介於具有重疊觀看域之相機之間的預定映射。該資料被儲存在調校資料庫170中而成為於上述相機114之調校期間所判定的非線性函數。 Machine vision techniques (such as the technique published by Longuet-Higgins in a paper entitled "Computer Algorithms for Reconstructing Scenes from Two Projections", Nature, Volume 293, September 10, 1981) were applied to Conjugate pairs of corresponding points are determined to determine the height of the joint from the floor 220 in real space. Application of the above method requires a predetermined mapping between cameras with overlapping viewing fields. This data is stored in the calibration database 170 as a nonlinear function determined during the calibration of the camera 114 described above.

追蹤引擎110接收相應於來自具有重疊觀看域之相機的影像之序列中的影像之關節資料結構的陣列，並將相應於不同序列中的影像之關節資料結構的陣列中之元件的座標變換為具有真實空間中之座標的候選非足部關節。已識別的候選非足部關節係使用總體量度計算器702而被群集為具有真實空間中之座標的主體之集合。總體量度計算器702計算總體量度值並嘗試藉由檢查非足部關節之不同組合以將該值減至最小。於一實施例中，該總體量度為組織於四個種類中之啟發法的總和。用以識別候選關節之集合的該邏輯包含根據真實空間中之主體的關節之間的物理關係之啟發函數，用以將候選關節之集合識別為主體。介於關節之間的物理關係之範例被考量於如下所述之啟發法中。 Tracking engine 110 receives an array of joint data structures corresponding to images in a sequence of images from cameras having overlapping viewing fields, and transforms the coordinates of elements in the array of joint data structures corresponding to images in different sequences to have Candidate non-foot joints for coordinates in real space. The identified candidate non-foot joints are clustered using population metric calculator 702 as a set of subjects with coordinates in real space. The global metric calculator 702 calculates a global metric value and attempts to minimize this value by examining different combinations of non-foot joints. In one embodiment, the overall metric is the sum of the heuristics organized in four categories. The logic for identifying the set of candidate joints includes a heuristic function for identifying the set of candidate joints as a subject based on the physical relationship between the subject's joints in real space. Examples of physical relationships between joints are considered in the heuristics described below.

first kind of heuristics

第一種類的啟發法包括量度，用以確定在相同或不同時刻於相同相機視角中介於兩個提議的主體-關節位置之間的相似度。於一實施例中，這些量度為浮點值，其中較高的值表示關節之兩列表極可能屬於相同主體。考量購物商店之範例實施例，該些量度係判定沿著時間維度從一影像至下一影像之於一相機中介於消費者的相同關節之間的距離。給定相機412之觀看域中的相機A，第一組量度係判定從來自相機412之一影像至來自相機412之下一影像的介於人A之關節的各者之間的距離。該些量度被應用至來自相機114的每影像之關節資料結構的陣列中之關節資料結構600。 A first class of heuristics includes metrics to determine the similarity between two proposed body-joint positions in the same camera view at the same or different time instants. In one embodiment, these metrics are floating point values, where a higher value indicates that the two lists of joints are most likely to belong to the same subject. Considering the example embodiment of a shopping store, the metrics determine the distance between the same joints of the consumer in a camera from one image to the next along the time dimension. Given camera A in the viewing field of camera 412, a first set of metrics determines the distance between each of the joints of person A from one image from camera 412 to the next image from camera 412. These metrics are applied to the joint data structure 600 in the array of joint data structures per image from the camera 114 .

於一實施例中，第一種類的啟發法中之兩個範例量度被列出於下： In one embodiment, two example metrics of the first type of heuristic are listed below:

1.介於地板上之兩個主體的左腳踝關節與地板上之兩個主體的右腳踝關節之間的歐幾里德2D座標距離之倒數(使用針對來自特定相機之特定影像的x,y座標值)係加總在一起。 1. The inverse of the Euclidean 2D coordinate distance between the left ankle joints of the two subjects on the floor and the right ankle joints of the two subjects on the floor (using x,y for a specific image from a specific camera coordinate values) are added together.

2.介於影像框中之主體的每一對非足部關節之間的歐幾里德2D座標距離之總和。 2. The sum of the Euclidean 2D coordinate distances between each pair of non-foot joints of the subject in the image frame.

The second kind of heuristic

第二種類的啟發法包括量度，用以確定在相同時刻介於來自多數相機之觀看域的兩個提議的主體-關節位置之間的相似度。於一實施例中，這些量度為浮點值，其中較高的值表示關節之兩列表極可能屬於相同主體。考量購物商店之範例實施例，第二組量度係判定在相同時刻之來自二或更多相機(具有重疊觀看域)的影像框中介於消費者的相同關節之間的距離。 A second type of heuristic involves metrics, which are used to determine Similarity between two proposed subject-joint positions from the viewing field of most cameras at the same time. In one embodiment, these metrics are floating point values, where a higher value indicates that the two lists of joints are most likely to belong to the same subject. Considering the example embodiment of a shopping store, a second set of metrics determines the distance between the same joints of the consumer between image frames from two or more cameras (with overlapping viewing fields) at the same time.

於一實施例中，第二種類的啟發法中之兩個範例量度被列出於下： In one embodiment, two example metrics of the second type of heuristic are listed below:

1.介於地板上之兩個主體的左腳踝關節與地板上之兩個主體的右腳踝關節之間的歐幾里德2D座標距離之倒數(使用針對來自特定相機之特定影像的x,y座標值)係加總在一起。第一主體之腳踝關節位置被投影至相機，其中第二主體通過單應性映射為可見的。 1. The inverse of the Euclidean 2D coordinate distance between the left ankle joints of the two subjects on the floor and the right ankle joints of the two subjects on the floor (using x,y for a specific image from a specific camera coordinate values) are added together. The ankle joint position of the first subject is projected to the camera, where the second subject is visible through the homography mapping.

2.介於一線與一點之間的歐幾里德2D座標之倒數的所有對關節之總和，其中該線為從具有第一主體於其觀看域中之第一相機至具有第二主體於其觀看域中之第二相機的影像之關節的核線，而該點為來自第二相機之影像中的第二主體之關節。 2. The sum of all pairs of joints of the reciprocal Euclidean 2D coordinates between a line and a point, where the line is from the first camera with the first subject in its viewing field to the second subject in it The epipolar line of the joint in the image of the second camera in the viewing field, and this point is the joint of the second body in the image from the second camera.

The third kind of heuristic

第三種類的啟發法包括量度，用以確定在相同時刻於相同相機視角中介於提議的主體-關節位置的所有關節之間的相似度。考量購物商店之範例實施例，此種類的量度係判定在來自一相機之一框中介於消費者的關節之間的距離。 A third class of heuristics includes metrics to determine the similarity between all joints at the same time and in the same camera view at the proposed body-joint position. Considering the example embodiment of a shopping store, such The class metric determines the distance between the consumer's joints in a frame from a camera.

The fourth kind of heuristic

第四種類的啟發法包括量度，用以確定介於提議的主體-關節位置之間的相異度。於一實施例中，這些量度為浮點值。較高的值表示關節之兩列表更可能不是相同的主體。於一實施例中，此種類中之兩範例量度包括： A fourth class of heuristics includes metrics to determine the degree of dissimilarity between proposed body-joint positions. In one embodiment, these metrics are floating point values. A higher value indicates that the two lists of joints are more likely not to be the same body. In one embodiment, two example metrics in this category include:

1.介於兩個提議的主體的脖子關節之間的距離。 1. The distance between the neck joints of the two proposed subjects.

2.介於兩主體之間的介於多對關節之間的距離之總和。 2. The sum of the distances between pairs of joints between two bodies.

於一實施例中，其可憑經驗地被判定之各個臨限值被應用至以上列出的量度如下所述： In one embodiment, various threshold values, which can be determined empirically, are applied to the metrics listed above as follows:

1.臨限值，用以決定量度值何時夠小以考量其一關節屬於一已知主體。 1. Threshold value to determine when the metric is small enough to consider one of its joints belonging to a known subject.

2.臨限值，用以判定何時有太多潛在的候選主體，其一關節可屬於具有太好的量度相似度分數。 2. Threshold value to determine when there are too many potential candidates, one of which may belong to a metric similarity score that is too good.

3.臨限值，用以判定關節之集合何時具有夠高的量度相似度以被視為新主體，先前未出現在真實空間中。 3. Threshold value to determine when a set of joints has a high enough metric similarity to be considered a new subject, not previously present in real space.

4.臨限值，用以判定主體何時不再位於真實空間中。 4. Threshold value to determine when the subject is no longer in the real space.

5.臨限值，用以判定追蹤引擎110何時已產生錯誤並已混淆兩主體。 5. Threshold value to determine when the tracking engine 110 has generated an error and confused the two agents.

追蹤引擎110包括用以儲存其被識別為主體之關節的集合之邏輯。用以識別候選關節之集合的邏輯包括邏輯，用以判定在特定時間所取得之影像中所識別的候選關節是否符合其被識別為先前影像中之主體的候選關節之該些集合之一的成員。於一實施例中，追蹤引擎110係於規律的間隔比較主體之目前的關節位置與該相同主體之先前記錄的關節位置。此比較容許追蹤引擎110更新該真實空間中之主體的關節位置。此外，使用此方式，追蹤引擎110識別錯誤肯定(亦即，錯誤識別的主體)並移除其不再出現於該真實空間中之主體。 Tracking engine 110 includes logic to store the set of joints it identifies as a subject. The logic to identify the set of candidate joints includes logic to determine whether a candidate joint identified in an image taken at a particular time is a member of one of the sets of candidate joints that were identified as a subject in a previous image . In one embodiment, the tracking engine 110 compares a subject's current joint positions with previously recorded joint positions for the same subject at regular intervals. This comparison allows the tracking engine 110 to update the body's joint positions in the real space. Furthermore, using this approach, the tracking engine 110 identifies false positives (ie, misidentified subjects) and removes subjects that no longer appear in the real space.

考量購物商店實施例之範例，其中追蹤引擎110係於較早時刻產生消費者(主體)，然而，在某時間後，追蹤引擎110不具有該特定消費者之目前關節位置。其表示消費者被不正確地產生。追蹤引擎110從主體資料庫140刪除不正確地產生的主體。於一實施例中，追蹤引擎110亦使用上述程序以從真實空間移除肯定地識別的主體。考量購物商店之範例，當消費者離開購物商店時，追蹤引擎110便從主體資料庫140刪除相應的消費者記錄。於一此類實施例中，追蹤引擎110更新主體資料庫140中之此消費者的記錄以指示其「消費者已離開該商店」。 Consider an example of a shopping store embodiment where the tracking engine 110 generates a customer (subject) at an earlier time, however, after a certain time, the tracking engine 110 does not have the current joint position for that particular customer. It indicates that the consumer was generated incorrectly. Tracking engine 110 deletes incorrectly generated subjects from subject database 140 . In one embodiment, the tracking engine 110 also uses the above procedure to remove positively identified subjects from real space. Considering the shopping store example, when a consumer leaves the shopping store, the tracking engine 110 deletes the corresponding consumer record from the subject database 140 . In one such embodiment, the tracking engine 110 updates the customer's record in the subject database 140 to indicate that "the customer has left the store."

於一實施例中，追蹤引擎110嘗試藉由同時地應用足部及非足部啟發法以識別主體。如此導致該些主體之連接關節的「島」。隨著追蹤引擎110沿著時間及空間維度處理關節資料結構的進一步陣列，島的大小增加。最終地，關節之島合併至關節之其他島以形成主體，其被接著儲存於主體資料庫140中。於一實施例中，追蹤引擎110係維持未指派關節之記錄於一段預定的時間週期。於此時間期間，追蹤引擎嘗試將未指派關節指派給現存主體或者從這些未指派關節產生新的多關節單體。追蹤引擎110在一段預定的時間週期後丟棄該些未指派關節。應理解：於其他實施例中，除了以上所列出之外的不同啟發法被用以識別並追蹤主體。 In one embodiment, the tracking engine 110 attempts to identify the subject by applying foot and non-foot heuristics simultaneously. This results in "islands" of the connecting joints of the bodies. The size of the islands increases as the tracking engine 110 processes further arrays of joint data structures along the temporal and spatial dimensions. Finally, islands of joints are merged into other islands of joints to form bodies, which are then stored in body database 140 . In one embodiment, the tracking engine 110 maintains a record of unassigned joints for a predetermined period of time. During this time, the tracking engine attempts to assign unassigned joints to existing bodies or to generate new multi-joint monomers from these unassigned joints. The tracking engine 110 discards the unassigned joints after a predetermined period of time. It should be understood that in other embodiments, different heuristics than those listed above are used to identify and track subjects.

於一實施例中，連接至主控追蹤引擎110之節點102的使用者介面輸出裝置係顯示該真實空間中之各主體的位置。於一此類實施例中，輸出裝置之顯示係於規律的間隔被再新以該些主體的新位置。 In one embodiment, a user interface output device connected to node 102 of master tracking engine 110 displays the position of each subject in the real space. In one such embodiment, the display of the output device is refreshed at regular intervals with the new positions of the subjects.

main data structure

主體之關節係使用上述的量度而被彼此連接。於如此做時，追蹤引擎110產生新主體並藉由更新其各別的關節位置以更新現存主體之位置。圖8顯示用以儲存主體之主體資料結構800。資料結構800將主體相關的資料儲存為密鑰-值字典。該密鑰為框_數而值為另一密鑰-值字典，其中密鑰為相機_id而值為(主體的)18個關節的列表，具有真實空間中之其位置。該主體資料被儲存在主體資料庫140中。每一新主體亦被指派獨特的識別符，其被用以存取主體資料庫140中之該主體的資料。 The joints of the bodies are connected to each other using the metrics described above. In doing so, the tracking engine 110 generates new bodies and updates the positions of existing bodies by updating their respective joint positions. FIG. 8 shows a subject data structure 800 for storing subjects. Data structure 800 stores subject-related data as a key-value dictionary. The key is the box_number and the value is another key-value dictionary, where the key is the camera_id and the value is a list of 18 joints (of the subject) with their positions in real space. The subject data is stored in subject database 140 . Each new subject is also assigned a unique identifier, which is used to access that subject's data in subject database 140 .

於一實施例中，系統係識別主體之關節並產生該主體之骨骼。該骨骼被投影入真實空間以指示真實空間中之該主體的位置及定向。此亦被稱為機器視覺之領域中的「姿勢估計」。於一實施例中，系統將真實空間中之主體的定向及位置顯示於圖形使用者介面(GUI)上。於一實施例中，影像分析是匿名的，亦即，透過關節分析所產生之指派給主體的獨特識別符並不會識別真實空間中之任何特定主體的個人身份細節(諸如名字、電子郵件地址、住址、信用卡號碼、銀行帳戶號碼、駕照號碼，等等)。 In one embodiment, the system identifies the joints of the subject and generates The skeleton of the subject is born. The bone is projected into real space to indicate the position and orientation of the subject in real space. This is also known as "pose estimation" in the field of machine vision. In one embodiment, the system displays the orientation and position of the subject in real space on a graphical user interface (GUI). In one embodiment, the image analysis is anonymous, that is, the unique identifiers assigned to subjects generated through joint analysis do not identify personally identifiable details (such as names, email addresses, etc.) of any particular subject in real space. , address, credit card number, bank account number, driver's license number, etc.).

Program flow of subject tracing

闡明邏輯之數個流程圖被描述於文中。邏輯可被實施為：使用如上所述而組態之處理器，其係使用儲存在由該些處理器可存取且可執行之記憶體中的電腦程式來編程；以及於其組態中，係藉由專屬邏輯硬體(包括場可編程積體電路)、及藉由專屬邏輯硬體與電腦程式之組合。利用文中之所有流程圖，應理解：許多步驟可被結合、被平行地履行、或被履行於不同的序列中，而不影響所達成的功能。於某些情況下，如讀者所將理解：步驟之重新配置將達成相同的結果，僅當某些其他改變亦被同時執行時。於其他情況下，如讀者所將理解：步驟之重新配置將達成相同的結果，僅當某些條件被滿足時。再者，應理解：文中之流程圖僅顯示其有關於實施例之理解的步驟，且應瞭解：用以完成其他功能之各種其他步驟可被履行在那些所顯示者之前、之後及之間。 Several flowcharts illustrating the logic are described in the text. Logic can be implemented: using processors configured as described above, programmed using a computer program stored in memory accessible and executable by the processors; and in its configuration, By means of dedicated logic hardware (including field programmable integrated circuits), and by a combination of dedicated logic hardware and computer programs. Using all flowcharts herein, it should be understood that many of the steps may be combined, performed in parallel, or performed in different sequences without affecting the functionality achieved. In some cases, as the reader will understand, a reconfiguration of steps will achieve the same result, only if some other change is also performed concurrently. In other cases, as the reader will understand, reconfiguration of steps will achieve the same result, only if certain conditions are met. Furthermore, it should be understood that the flowcharts herein only show steps which are relevant to an understanding of the embodiments, and that it should be understood that various other steps to accomplish other functions may be performed before, after, and between those shown.

圖9為流程圖，其闡明用以追蹤主體的程序步驟。該程序開始於步驟902。具有真實空間中之區域的觀看域之相機114被調校於程序步驟904。視頻程序係由影像辨識引擎112a-112n所履行於步驟906。於一實施例中，視頻程序被履行於每相機以處理從各別相機所接收之影像框的批次。來自各別影像辨識引擎112a-112n之所有視頻程序的輸出被提供為由追蹤引擎110所履行之場景程序的輸入於步驟908。場景程序識別新主體並更新現存主體之關節位置。於步驟910，檢查是否有更多影像框待處理。假如有更多影像框，則該程序於步驟906繼續，否則該程序於步驟914結束。 FIG. 9 is a flowchart illustrating the procedural steps for tracking a subject. The procedure begins at step 902. The camera 114 with the viewing field of the region in real space is calibrated at program step 904 . The video process is performed at step 906 by the image recognition engines 112a-112n. In one embodiment, a video program is executed per camera to process batches of image frames received from respective cameras. The output of all video programs from the respective image recognition engines 112a - 112n is provided as input to the scene program executed by the tracking engine 110 at step 908 . The scene program recognizes the new body and updates the joint positions of the existing body. At step 910, it is checked whether there are more frames to be processed. If there are more image frames, the process continues at step 906 , otherwise the process ends at step 914 .

程序步驟904「調校真實空間中之相機」之更詳細的程序步驟被提出於圖10之流程圖中。調校程序開始於步驟1002，藉由識別真實空間之(x,y,z)座標的(0,0,0)點。於步驟1004，具有位置(0,0,0)於其觀看域中之第一相機被調校。相機調校之更多細節係較早被提出於本申請案中。於步驟1006，具有與第一相機之重疊觀看域的下一相機被調校。於步驟1008，檢查是否有更多相機待調校。該程序被重複於步驟1006直到所有相機114均被調校。 More detailed procedural steps of procedural step 904 "Calibrate the camera in real space" are presented in the flowchart of FIG. 10 . The calibration procedure begins at step 1002 by identifying the (0,0,0) point of the (x,y,z) coordinate in real space. At step 1004, a first camera with position (0,0,0) in its viewing field is calibrated. Further details of camera calibration were presented earlier in this application. At step 1006, the next camera with an overlapping viewing field with the first camera is calibrated. In step 1008, it is checked whether there are more cameras to be adjusted. This procedure is repeated at step 1006 until all cameras 114 have been calibrated.

於下一程序步驟1010中，主體被引入真實空間中以識別介於具有重疊觀看域之相機之間的對應點之共軛對。此程序之某些細節被描述於上。該程序係針對每一對重疊相機而被重複於步驟1012。假如沒有更多相機則該程序結束(步驟1014)。 In the next procedural step 1010, the subject is introduced into real space to identify conjugate pairs of corresponding points between cameras with overlapping viewing fields. Some details of this procedure are described above. The process is repeated at step 1012 for each pair of overlapping cameras. If there are no more cameras then the routine ends (step 1014).

圖11中之流程圖顯示「視頻程序」步驟906之更詳細的步驟。於步驟1102，每相機之k連續有時戳影像被選擇為一批次以供進一步處理。於一實施例中，k=6之值係根據網路節點101a-101n中之視頻程序的可用記憶體來計算，該些網路節點101a-101n係各別地主控影像辨識引擎112a-112n。於下一步驟1104中，影像之大小被設為適當尺寸。於一實施例中，影像具有1280像素之寬度、702像素之高度及三個頻道RGB(代表紅、綠及藍色)。於步驟1106，複數經訓練的卷積神經網路(CNN)係處理該些影像並產生每影像之關節資料結構的陣列。CNN之輸出為每影像之關節資料結構的陣列(步驟1108)。此輸出被傳送至場景程序，於步驟1110。 The flow chart in FIG. 11 shows the more detailed steps of the "Video Procedure" step 906 . At step 1102, k consecutive time-stamped images per camera are selected as a batch for further processing. In one embodiment, the value of k=6 is calculated based on the available memory of the video programs in the network nodes 101a-101n, which respectively host the image recognition engines 112a-112n . In the next step 1104, the size of the image is set to the appropriate size. In one embodiment, the image has a width of 1280 pixels, a height of 702 pixels, and three channels RGB (representing red, green, and blue). At step 1106, a complex trained convolutional neural network (CNN) processes the images and generates an array of joint data structures per image. The output of the CNN is an array of joint data structures per image (step 1108). This output is sent to the scene program, at step 1110.

圖12A為流程圖，其顯示圖9中之「場景程序」步驟908的更詳細步驟之第一部分。場景程序係結合來自多數視頻程序之輸出，於步驟1202。於步驟1204，檢查關節資料結構係識別足部關節或者非足部關節。假如該關節資料結構屬於足部關節，則單應性映射被應用以結合相應於來自具有重疊觀看域之相機的影像之關節資料結構，於步驟1206。此程序識別候選足部關節(左及右足部關節)。於步驟1208，啟發法被應用於步驟1206中所識別的候選足部關節上以將候選足部關節之集合識別為主體。於步驟1210檢查候選足部關節之該集合是否屬於現存主體。假如為否，則新主體被產生於步驟1212。否則，該現存主體被更新於步驟1214。 FIG. 12A is a flowchart showing the first part of the more detailed steps of the "Scenario Procedure" step 908 of FIG. 9 . The scene program is combined with output from most video programs, at step 1202 . At step 1204, checking the joint data structure identifies foot joints or non-foot joints. If the joint data structure belongs to a foot joint, then a homography map is applied to combine the joint data structure corresponding to images from cameras with overlapping viewing fields, at step 1206 . This procedure identifies candidate foot joints (left and right foot joints). At step 1208, a heuristic is applied to the candidate foot joints identified in step 1206 to identify the set of candidate foot joints as subjects. At step 1210 it is checked whether the set of candidate foot joints belongs to an existing subject. If not, a new subject is created at step 1212. Otherwise, the existing principal is updated at step 1214.

流程圖12B係顯示「場景程序」步驟908的更詳細步驟之第二部分。於步驟1240，非足部關節之資料結構被結合自相應於來自具有重疊觀看域之相機的影像之序列中的影像之關節資料結構的多數陣列。此係藉由以下方式來履行：將來自第一相機之來自第一影像的對應點映射至來自具有重疊觀看域之第二相機的第二影像。此程序之某些細節被描述於上。於步驟1242，啟發法被應用至候選非足部關節。於步驟1246，判定候選非足部關節是否屬於現存主體。假如是的話，該現存主體被更新於步驟1248。否則，該候選非足部關節被再次處理於步驟1250(在一段預定時間後)以使其與現存主體匹配。於步驟1252，檢查該非足部關節是否屬於現存主體。假如是的話，該主體被更新於步驟1256。否則，該關節被丟棄於步驟1254。 Flowchart 12B shows the second part of the more detailed steps of "Scenario Procedure" step 908 . At step 1240, the data structures of the non-foot joints are combined from a plurality of arrays of joint data structures corresponding to images in a sequence of images from cameras with overlapping viewing fields. This is performed by mapping corresponding points from the first image from the first camera to the second image from the second camera with overlapping viewing fields. Some details of this procedure are described above. At step 1242, heuristics are applied to the candidate non-foot joints. At step 1246, it is determined whether the candidate non-foot joint belongs to an existing subject. If so, the existing principal is updated at step 1248. Otherwise, the candidate non-foot joint is processed again at step 1250 (after a predetermined period of time) to match it with an existing subject. At step 1252, it is checked whether the non-foot joint belongs to an existing body. If so, the body is updated at step 1256. Otherwise, the joint is discarded at step 1254.

於一範例實施例中，用以識別新主體、追蹤主體及去除主體(其已離開真實空間或者被不正確地產生)之程序被實施為由運行時間系統(亦稱為推理系統)所履行的「單體內聚演算法」之部分。單體是以上被稱為主體之關節的群集。單體內聚演算法係識別真實空間中之單體並更新真實空間中之關節的位置以追蹤單體之移動。 In an exemplary embodiment, the procedures to identify new subjects, track subjects, and remove subjects that have left real space or were incorrectly generated are implemented as performed by a runtime system (also known as an inference system). Part of the "Monomer Cohesion Algorithm". A single body is a cluster of joints referred to above as a body. Cell cohesion algorithms identify cells in real space and update the positions of joints in real space to track cell movement.

圖14提出視頻程序1411及場景程序1415之圖示。於所示的實施例中，顯示四個視頻程序，各處理來自一或更多相機114之影像。視頻程序係處理如上所述之影像並識別每框之關節。於一實施例中，各視頻程序識別2D座標、信心數、關節數及獨特ID，針對每框每關節。所有視頻程序之輸出1452被提供為輸入1453至場景程序1415。於一實施例中，場景程序產生每時刻之關節密鑰-值字典，其中該密鑰為相機識別符而該值為關節之陣列。該些關節被再投影入具有重疊觀看域之相機的觀點。再投影的關節被儲存為密鑰-值字典，並可被用以產生針對各相機中之各影像的前台主體遮罩，如以下所討論。此字典中之密鑰為關節id與相機id之組合。該字典中之值為其被再投影入目標相機之觀點的關節之2D座標。 FIG. 14 presents an illustration of a video program 1411 and a scene program 1415. In the embodiment shown, four video programs are displayed, each processing images from one or more cameras 114 . The video program processes the images as described above and identifies the joints of each frame. In one embodiment, each video program identifies 2D coordinates, confidence numbers, joint numbers, and unique IDs, for each frame per joint. all The output 1452 of the video program is provided as input 1453 to the scene program 1415. In one embodiment, the scene program generates a per-time joint key-value dictionary, where the key is a camera identifier and the value is an array of joints. The joints are reprojected into the view of the cameras with overlapping viewing fields. The reprojected joints are stored as key-value dictionaries and can be used to generate foreground subject masks for each image in each camera, as discussed below. The key in this dictionary is the combination of the joint id and the camera id. The values in this dictionary are the 2D coordinates of the joints whose viewpoints are reprojected into the target camera.

場景程序1415產生輸出1457，其包含在某一時刻之真實空間中的所有主體之列表。該列表包括每主體之密鑰-值字典。該密鑰為主體之獨特識別符而該值為另一密鑰-值字典，以該密鑰為框數而該值為相機-主體關節密鑰-值字典。相機-主體關節密鑰-值字典為每主體字典，其中該密鑰為相機識別符而該值為關節之列表。 The scene program 1415 produces an output 1457 that contains a list of all subjects in real space at a certain moment in time. The list includes a key-value dictionary for each subject. The key is the unique identifier of the subject and the value is another key-value dictionary with the key as the frame number and the value as the camera-subject joint key-value dictionary. The camera-body joint key-value dictionary is a per-body dictionary, where the key is the camera identifier and the value is a list of joints.

Image analysis to identify and track inventory items per entity

用以追蹤真實空間之區域中藉由主體之存貨項目的放下及取走之系統及各種實施方式係參考圖15A至25而被描述。系統及程序係參考圖15A而被描述，依據實施方式之系統的架構階概圖。因為圖15A為架構圖，所以某些細節被省略以增進描述之清晰。 A system and various embodiments for tracking the placing and taking of inventory items by subject in the area of real space are described with reference to FIGS. 15A-25 . The system and process are described with reference to FIG. 15A, an architectural level overview of the system according to an embodiment. Since FIG. 15A is an architectural diagram, some details are omitted to improve the clarity of the description.

Architecture of Multiple CNN Pipelines

圖15A為卷積神經網路之管線(亦稱為多CNN 管線)的高階架構，其處理從相機114所接收之影像框以產生真實空間中之各主體的購物車資料結構。文中所述之系統包括如上所述之每相機影像辨識引擎，用以識別並追蹤多關節主體。替代的影像辨識引擎可被使用，包括其中僅有一「關節」被辨識並追蹤於每個體之範例，或者涵蓋空間及時間之其他特徵或其他類型的影像資料被利用以辨識並追蹤其被處理的真實空間中之主體。 Figure 15A is a pipeline of a convolutional neural network (also known as a multi-CNN Pipeline) high-level architecture that processes image frames received from camera 114 to generate a shopping cart data structure for subjects in real space. The system described herein includes a per-camera image recognition engine as described above for identifying and tracking multi-joint subjects. Alternative image recognition engines may be used, including instances in which only one "joint" is identified and tracked per body, or other features covering space and time or other types of image data are utilized to identify and track its processed Subjects in real space.

多CNN管線係平行地運行於每相機，從各別相機移動影像至影像辨識引擎112a-112n，經由每相機之循環緩衝器1502。於一實施例中，該系統係由三個子系統所組成：第一影像處理器子系統2602、第二影像處理器子系統2604及第三影像處理器子系統2606。於一實施例中，第一影像處理器子系統2602包括影像辨識引擎112a-112n，其被實施為卷積神經網路(CNN)且被稱為關節CNN 112a-112n。如相關於圖1所述，相機114可於時間上被彼此同步化，以致其影像被同時地(或時間上接近地)並以相同的影像擷取率來擷取。於其同時地(或時間上接近地)覆蓋真實空間之區域的所有相機中所擷取的影像被同步化，由於其同步化影像可被識別於處理引擎中如代表在具有真實空間中之固定位置的主體之某一時刻的不同視角。 Multiple CNN pipelines run in parallel on each camera, moving images from the respective camera to the image recognition engines 112a-112n via the circular buffer 1502 per camera. In one embodiment, the system consists of three subsystems: a first image processor subsystem 2602 , a second image processor subsystem 2604 and a third image processor subsystem 2606 . In one embodiment, the first image processor subsystem 2602 includes image recognition engines 112a-112n, which are implemented as convolutional neural networks (CNNs) and are referred to as joint CNNs 112a-112n. As described in relation to FIG. 1 , the cameras 114 may be synchronized with each other in time such that their images are captured simultaneously (or close in time) and at the same image capture rate. The images captured in all cameras which cover the area of real space simultaneously (or temporally close) are synchronized, since their synchronized images can be identified in the processing engine as represented by the fixed in real space Different perspectives of a moment in time on the subject of the location.

於一實施例中，相機114被安裝於購物商店(諸如超級市場)中以致其具有重疊觀看域之相機(二或更多)的集合被置於各走道上方以擷取該商店中之真實空間的影像。有N個相機於真實空間中，然而，為了簡化，僅有一相機被顯示於圖17A中為相機(i)，其中i之值的範圍係從1至N。各相機產生相應於其各別觀看域之真實空間的影像之序列。 In one embodiment, cameras 114 are installed in a shopping store (such as a supermarket) such that a set of cameras (two or more) with overlapping viewing fields are placed over each aisle to capture the real space in the store image. There are N cameras in real space, however, for simplicity, only one camera is shown in FIG. 17A as camera( i ), where the value of i ranges from 1 to N. Each camera produces a sequence of images corresponding to the real space of its respective viewing field.

於一實施例中，相應於來自各相機之影像的序列之影像框係以每秒30框(fps)之速率被傳送至各別影像辨識引擎112a-112n。各影像框具有時戳、相機之識別(縮寫為「相機_id」)、及框識別(縮寫為「框_id」)，連同影像資料。影像框被儲存於每相機114之循環緩衝器1502(亦稱為環緩衝器)中。循環緩衝器1502儲存來自各別相機114之連續有時戳影像框之集合。 In one embodiment, image frames corresponding to the sequence of images from each camera are transmitted to the respective image recognition engines 112a-112n at a rate of 30 frames per second (fps). Each image frame has a timestamp, an identification of the camera (abbreviated as "camera_id"), and a frame identification (abbreviated as "frame_id"), along with image data. Image frames are stored in a circular buffer 1502 (also referred to as a ring buffer) per camera 114 . The circular buffer 1502 stores a collection of consecutive time-stamped image frames from the respective cameras 114.

關節CNN處理每相機之影像框的序列並識別出現在其各別觀看域中之各主體的18個不同類型的關節。相應於具有重疊觀看域之相機的關節CNN 112a-112n之輸出被結合以將來自各相機之2D影像座標的關節之位置映射至真實空間之3D座標。每主體(j)之關節資料結構800(其中j等於1至x)識別真實空間中之主體(j)的關節之位置。主體資料結構800之細節被提出於圖8中。於一範例實施例中，關節資料結構800為各主體之關節的二階密鑰-值字典。第一密鑰為框_數而該值為第二密鑰-值字典，以該密鑰為相機_id而該值為指派給主體之關節的列表。 The joint CNN processes the sequence of image frames per camera and identifies 18 different types of joints for each subject appearing in its respective viewing field. The outputs of the joint CNNs 112a-112n corresponding to cameras with overlapping viewing fields are combined to map the positions of the joints from the 2D image coordinates of each camera to the 3D coordinates in real space. The joint data structure 800 of each subject (j), where j equals 1 to x, identifies the positions of the joints of subject (j) in real space. Details of the master data structure 800 are presented in FIG. 8 . In an exemplary embodiment, the joint data structure 800 is a second-order key-value dictionary for the joints of each subject. The first key is the frame_number and the value is the second key-value dictionary, with the key as the camera_id and the value as a list of joints assigned to the subject.

包含由關節資料結構800所識別之主體以及來自每相機之影像框的序列之相應影像框的資料集被提供為輸入至第三影像處理器子系統2606中之定界框產生器1504。第三影像處理器子系統進一步包含前台影像辨識引擎。於一實施例中，前台影像辨識引擎係語意地辨識前台中之重要物件(亦即，購物者、其手以及存貨項目)，因為其係相關於(例如)來自各相機之影像中隨著時間經過的存貨項目之放下及取走。於圖15A中所示之範例實施方式中，前台影像辨識引擎被實施為WhatCNN 1506及WhenCNN 1508。定界框產生器1504實施用以處理資料集之邏輯，來指明其包括影像之序列中的影像中之已識別主體的手之影像的定界框。定界框產生器1504係識別每相機之各來源影像框中的手關節之位置，使用相應於各別來源影像框之多關節資料結構800中的手關節之位置。於一實施例中，其中主體資料結構中之關節的座標係指示3D真實空間座標中的關節之位置，定界框產生器係將來自3D真實空間座標之關節位置映射至各別來源影像之影像框中的2D座標。 A data set containing the subject identified by the joint data structure 800 and the corresponding image frame from the sequence of image frames per camera is provided as input to the bounding box generator 1504 in the third image processor subsystem 2606 . The third image processor subsystem further includes a foreground image recognition engine engine. In one embodiment, the foreground image recognition engine semantically identifies important items in the foreground (ie, the shopper, their hands, and inventory items) as it relates to, for example, images from various cameras over time. Drop-off and removal of inventory items that have been passed. In the example implementation shown in FIG. 15A, the foreground image recognition engines are implemented as WhatCNN 1506 and WhenCNN 1508. The bounding box generator 1504 implements logic to process the data set to specify its bounding box that includes the image of the identified subject's hand in the images in the sequence of images. The bounding box generator 1504 identifies the positions of the hand joints in the respective source image frames of each camera, using the positions of the hand joints in the multi-joint data structure 800 corresponding to the respective source image frames. In one embodiment, where the coordinates of the joints in the host data structure indicate the positions of the joints in 3D real space coordinates, the bounding box generator maps the joint positions from the 3D real space coordinates to the images of the respective source images 2D coordinates in the box.

定界框產生器1504產生針對影像框中之手關節的定界框於每相機114之循環緩衝器中。於一實施例中，定界框為影像框之128像素(寬度)x128像素(高度)部分，以該手關節位於該定界框之中心。於其他實施例中，定界框之大小為64像素x64像素或32像素x32像素。針對來自相機之影像框中的m個主體，可以有最多2m個手關節，因而有2m個定界框。然而，實際上於影像框中有少於2m個手可見，因為由於其他主體或其他物件的阻擋。於一範例實施例中，主體的手位置被推斷自手肘及手腕關節的位置。例如，主體的右手位置被外推，其係使用右手肘(識別為p1)及右手腕(識別為p2)的位置為外推_量*(p2-p1)+p2，其中外推_量等於0.4。於另一實施例中，關節CNN 112a-112n係使用左及右手影像來訓練。因此，於此一實施例中，關節CNN 112a-112n直接地識別每相機之影像框中的手關節之位置。每影像框之手位置係由定界框產生器1504所使用以產生每已識別手關節之定界框。 The bounding box generator 1504 generates bounding boxes for the hand joints in the image frame in the circular buffer of each camera 114 . In one embodiment, the bounding box is a portion of 128 pixels (width) x 128 pixels (height) of the image frame, and the hand joint is located at the center of the bounding box. In other embodiments, the size of the bounding box is 64 pixels x 64 pixels or 32 pixels x 32 pixels. For m subjects in the image frame from the camera, there can be up to 2m hand joints, and thus 2m bounding boxes. However, there are actually less than 2m hands visible in the image frame due to obstruction by other subjects or other objects. In an exemplary embodiment, the subject's hand position is inferred from the position of the elbow and wrist joints. For example, the subject's right hand position is extrapolated to use the right elbow (recognition The positions of the right wrist (recognized as p1) and the right wrist (identified as p2) are extrapolated_quantity*(p2-p1)+p2, where extrapolated_quantity is equal to 0.4. In another embodiment, the joint CNNs 112a-112n are trained using left and right hand images. Therefore, in this embodiment, the joint CNNs 112a-112n directly identify the positions of the hand joints in the image frame of each camera. The hand position per image frame is used by the bounding box generator 1504 to generate a bounding box for each identified hand joint.

WhatCNN 1506為一種卷積神經網路，其被訓練以處理影像中之已指明定界框來產生已識別主體之手的類別。一經訓練的WhatCNN 1506係處理來自一相機之影像框。於購物商店之範例實施例中，針對各影像框中之各手關節，WhatCNN 1506係識別該手關節是否為空的。WhatCNN 1506亦識別手關節中之存貨項目的SKU(庫存保持單元)數、指示手關節中之項目為非SKU項目(亦即，其不屬於購物商店存貨)的信心值、以及影像框中之手關節位置的背景。 WhatCNN 1506 is a convolutional neural network trained to process specified bounding boxes in the imagery to generate classes of identified subjects' hands. A trained WhatCNN 1506 processes image frames from a camera. In the example embodiment of the shopping store, for each hand joint in each image frame, WhatCNN 1506 identifies whether the hand joint is empty. WhatCNN 1506 also identifies the SKU (Stock Keeping Unit) number of an item in stock in the hand joint, a confidence value indicating that the item in the hand joint is a non-SKU item (ie, it does not belong to shopping store inventory), and the hand in the image frame Background for joint positions.

所有相機114之WhatCNN模型1506的輸出係由單一WhenCNN模型1508針對預定的時間窗來處理。於購物商店之範例中，WhenCNN 1508針對主體之兩手履行時間序列分析以識別主體是否從貨架取走商店存貨項目或者將商店存貨項目放在貨架上。購物車資料結構1510(亦稱為包括存貨項目之列表的日誌資料結構)針對每主體而被產生以保存與該主體關聯的購物車(或籃)中之商店存貨項目的記錄。 The output of the WhatCNN model 1506 for all cameras 114 is processed by a single WhenCNN model 1508 for a predetermined time window. In the shopping store example, WhenCNN 1508 performs a time series analysis on both hands of the subject to identify whether the subject removes store inventory items from the shelves or places store inventory items on the shelves. A shopping cart data structure 1510 (also referred to as a log data structure that includes a list of inventory items) is generated for each subject to maintain a record of the store's inventory items in the shopping cart (or basket) associated with that subject.

第二影像處理器子系統2604接收相同資料集為送至第三影像處理器之給定輸入，該些相同資料集包含由關節資料結構800所識別的主體以及來自每相機之影像框的序列之相應影像框。子系統2604包括前台影像辨識引擎，其係語意地辨識前台(亦即，如貨架等存貨展示結構)中之重要差異，因為其係相關於(例如)來自各相機之影像中隨著時間經過的存貨項目之放下及取走。選擇邏輯組件(未顯示於圖15A中)係使用信心分數以選擇來自第二影像處理器或第三影像處理器之任一者的輸出以產生購物車資料結構1510。 The second image processor subsystem 2604 receives the same data set For a given input to the third image processor, these same data sets include the subject identified by the joint data structure 800 and the corresponding image frame from the sequence of image frames for each camera. Subsystem 2604 includes a foreground image recognition engine that semantically identifies significant differences in the foreground (ie, inventory display structures such as shelves) as it relates to, for example, imagery from various cameras over time. Putting down and taking out inventory items. The selection logic (not shown in FIG. 15A ) uses the confidence score to select the output from either the second image processor or the third image processor to generate the shopping cart data structure 1510 .

圖15B顯示協調邏輯模組1522，其係結合多數WhatCNN模型之結果並將提供為送至單一WhenCNN模型之輸入。如上所述，具有重疊觀看域之二或更多相機係擷取真實空間中之主體的影像。單一主體之關節可出現在各別影像頻道1520中之多數相機的影像框中。分離的WhatCNN模型係識別主體之手(由手關節所表示)中的存貨項目之SKU。協調邏輯模組1522將WhatCNN模型之輸出結合入WhenCNN模型之單一合併輸入。WhenCNN模型1508係操作於該合併輸入上以產生該主體之購物車。 Figure 15B shows a coordination logic module 1522 that combines the results of multiple WhatCNN models and provides as input to a single WhenCNN model. As described above, two or more cameras with overlapping viewing fields capture images of subjects in real space. The joints of a single subject may appear in the image frames of most cameras in respective image channels 1520. A separate WhatCNN model identifies the SKUs of inventory items in the subject's hand (represented by the hand joint). The coordination logic module 1522 combines the output of the WhatCNN model into a single merged input of the WhenCNN model. WhenCNN model 1508 operates on the combined input to generate the subject's shopping cart.

包含圖15A之多CNN管線的系統之詳細實施方式被提出於圖16、17、及18中。於購物商店之範例中，系統係追蹤真實空間之區域中藉由主體之存貨項目的放下及取走。真實空間之區域為購物商店，其具有存貨項目放置於如圖2及3中所示之走道中所組織的貨架中。應理解：含有存貨項目之貨架可被組織以多種不同的配置。例如，貨架可被配置成直線，以其背側靠著購物商店之牆壁而前側面朝向真實空間中之開放區域。於真實空間中具有重疊觀看域之複數相機114係產生其相應觀看域之影像的序列。一相機的觀看域係與如圖2及3中所示之至少一其他相機的該觀看域重疊。 Detailed implementations of the system including the multiple CNN pipelines of FIG. 15A are presented in FIGS. 16 , 17 , and 18 . In the example of a shopping store, the system tracks the placing and taking of inventory items by subject in the area of the real space. The area of the real space is a shopping store with inventory items placed in shelves organized in aisles as shown in FIGS. 2 and 3 . It should be understood that racks containing inventory items can be organized in a number of different configurations. E.g, The shelves can be arranged in a straight line with their back sides against the wall of the shopping store and their front sides towards the open area in the real space. A plurality of cameras 114 with overlapping viewing fields in real space generate a sequence of images of their corresponding viewing fields. The viewing field of a camera overlaps the viewing field of at least one other camera as shown in FIGS. 2 and 3 .

Articulation CNN-Subject Recognition and Update

圖16為由關節CNN 112a-112n所履行以識別真實空間中之主體的處理步驟之流程圖。於購物商店之範例中，主體為移動於貨架與其他開放空間之間的走道中之商店中的消費者。該程序開始於步驟1602。注意：如上所述，相機被調校在來自相機之影像的序列被處理以識別主體之前。相機調校之細節被提出如上。具有重疊觀看域之相機114係擷取其中有主體出現之真實空間的影像(步驟1604)。於一實施例中，相機被組態成產生影像之同步化序列。各相機之影像的序列被儲存於每相機之各別循環緩衝器1502中。循環緩衝器(亦稱為環緩衝器)係儲存時間之滑動窗中的影像之序列。於一實施例中，循環緩衝器係儲存110來自相應相機之影像框。於另一實施例中，各循環緩衝器1502係儲存針對3.5秒之時間週期的影像框。應理解：於其他實施例中，影像框(或時間週期)之數目可大於或小於以上列出的範例值。 16 is a flowchart of the processing steps performed by joint CNNs 112a-112n to identify subjects in real space. In the shopping store example, the subjects are consumers in the store moving in the aisles between the shelves and other open spaces. The process begins at step 1602. NOTE: As mentioned above, the camera is calibrated before the sequence of images from the camera is processed to identify the subject. Details of camera calibration are presented above. Cameras 114 with overlapping viewing fields capture images of the real space in which the subject appears (step 1604). In one embodiment, the cameras are configured to generate a synchronized sequence of images. The sequence of images for each camera is stored in a respective circular buffer 1502 for each camera. A circular buffer (also known as a ring buffer) stores a sequence of images in a sliding window of time. In one embodiment, the circular buffer stores 110 image frames from corresponding cameras. In another embodiment, each circular buffer 1502 stores image frames for a time period of 3.5 seconds. It should be understood that in other embodiments, the number of image frames (or time periods) may be greater or less than the exemplary values listed above.

關節CNN 112a-112n係接收來自相應相機114之影像框的序列(步驟1606)。各關節CNN係透過多數卷積網路層以處理來自相應相機之影像的批次以識別來自相應相機之影像框中的主體之關節。藉由範例卷積神經網路之影像的架構及處理被提出於圖5中。因為相機114具有重疊觀看域，所以主體之關節係由多於一個關節CNN來識別。由關節CNN所產生之關節資料結構600的二維(2D)座標被映射至真實空間之三維(3D)座標以識別真實空間中之關節位置。此映射之細節被提出於圖7之討論，其中追蹤引擎110將相應於不同影像序列中的影像之關節資料結構的陣列中之元件的座標變換為具有真實空間中之座標的候選關節。 The joint CNNs 112a-112n receive a sequence of image frames from the respective cameras 114 (step 1606). Each joint CNN is passed through the majority of convolution The network layer processes the batches of images from the corresponding cameras to identify the joints of the subject in the image frames from the corresponding cameras. The architecture and processing of images by an example convolutional neural network is presented in FIG. 5 . Because the cameras 114 have overlapping viewing domains, the joints of the subject are identified by more than one joint CNN. Two-dimensional (2D) coordinates of the joint data structure 600 generated by the joint CNN are mapped to three-dimensional (3D) coordinates in real space to identify joint positions in real space. Details of this mapping are presented in the discussion of Figure 7, where the tracking engine 110 transforms the coordinates of elements in the array of joint data structures corresponding to images in different image sequences into candidate joints with coordinates in real space.

主體之關節被組織成兩種類(足部關節及非足部關節)以將該些關節分組成為群集，如以上所討論。於目前範例中之左及右腳踝關節類型被視為足部關節，以供此程序之目的。於步驟1608，啟發法被應用以指派候選左足部關節及候選右足部關節給候選關節之集合以產生主體。接續於此，於步驟1610，判定新識別的主體是否已存在於真實空間中。假如為否，則新主體被產生於步驟1614，否則，現存主體被更新於步驟1612。 The joints of the subject are organized into two classes (foot joints and non-foot joints) to group these joints into clusters, as discussed above. The left and right ankle joint types in the current example are considered foot joints for the purposes of this procedure. At step 1608, heuristics are applied to assign candidate left foot joints and candidate right foot joints to the set of candidate joints to generate the body. Continuing from this, in step 1610, it is determined whether the newly identified subject already exists in the real space. If not, a new principal is created at step 1614, otherwise, the existing principal is updated at step 1612.

來自候選關節之星系的其他關節可被鏈結至該主體以建立該產生的主體之部分或所有關節類型的群集。於步驟1616，啟發法被應用至非足部關節以指派那些給已識別主體。總體量度計算器702計算總體量度值並嘗試藉由檢查非足部關節之不同組合以將該值減至最小。於一實施例中，該總體量度為組織於四個種類中之啟發法的總和，如上所述。 Other joints from the galaxy of candidate joints may be linked to the body to create clusters of some or all joint types of the resulting body. At step 1616, heuristics are applied to the non-foot joints to assign those to the identified subject. The global metric calculator 702 calculates a global metric value and attempts to minimize this value by examining different combinations of non-foot joints. In one embodiment, the overall metric is a heuristic organized in four categories sum, as above.

用以識別候選關節之集合的該邏輯包含根據真實空間中之主體的關節之間的物理關係之啟發函數，用以將候選關節之集合識別為主體。於步驟1618，現存主體係使用相應非足部關節而被更新。假如有更多影像以供處理(步驟1620)，則步驟1606至1618被重複，否則該程序結束於步驟1622。第一資料集被產生於上述程序之結束時。第一資料集係識別主體以及真實空間中之已識別主體的位置。於一實施例中，第一資料集係相關於圖15A而被提出於上為每主體之關節資料結構800。 The logic for identifying the set of candidate joints includes a heuristic function for identifying the set of candidate joints as a subject based on the physical relationship between the subject's joints in real space. At step 1618, the existing host system is updated using the corresponding non-foot joints. If there are more images to process (step 1620), steps 1606-1618 are repeated, otherwise the process ends at step 1622. The first data set is generated at the end of the above procedure. The first dataset is an identified subject and the location of the identified subject in real space. In one embodiment, the first data set is presented above with respect to Figure 15A as a per-subject joint data structure 800.

WhatCNN-Classification of Hand Joints

圖17為流程圖，其闡明用以識別真實空間中所識別之主體的手中之存貨項目的處理步驟。於購物商店之範例中，主體為購物商店中之消費者。當消費者移動於走道及開放空間中時，其拾起貨架中所堆放的存貨項目並將該些項目放入其購物車或籃中。影像辨識引擎識別其接收自複數相機之影像的序列中之影像的集合中之主體。該系統包括邏輯，用以處理其包括已識別主體的影像之該些序列中的影像之集合以檢測由已識別主體取走存貨項目及由已識別主體放下存貨項目於貨架上。 Figure 17 is a flow diagram illustrating the processing steps for identifying an inventory item in the hands of an identified subject in real space. In the shopping store example, the subject is the consumer in the shopping store. As consumers move through aisles and open spaces, they pick up inventory items stacked on shelves and place those items into their shopping carts or baskets. The image recognition engine identifies the subject in the set of images in the sequence of images it receives from the plurality of cameras. The system includes logic to process the set of images in the sequences including images of the identified subject to detect the taking of inventory items by the identified subject and the placing of the inventory items on the shelf by the identified subject.

於一實施例中，用以處理影像之集合的該邏輯包括(針對已識別主體)邏輯，用以處理影像來產生已識別主體之影像的類別。該些類別包括該已識別主體是否持有存貨項目。該些類別包括第一接近度類別，其係指示該已識別主體的手相對於貨架之位置。該些類別包括第二接近度類別，其係指示該已識別主體的手相對於該已識別主體的身體之位置。該些類別進一步包括第三接近度類別，其係指示該已識別主體的手相對於與已識別主體關聯的籃子之位置。最後，該些類別包括可能存貨項目之識別符。 In one embodiment, the logic for processing the set of images includes (for the identified subject) logic for processing the images to generate a class of images of the identified subject. These categories include whether the identified subject holds In stock items. The categories include a first proximity category that indicates the position of the identified subject's hand relative to the shelf. The categories include a second proximity category that indicates the position of the identified subject's hand relative to the identified subject's body. The categories further include a third proximity category that indicates the position of the identified subject's hand relative to the basket associated with the identified subject. Finally, these categories include identifiers for possible inventory items.

於另一實施例中，用以處理影像之集合的該邏輯包括(針對已識別主體)邏輯，用以識別其代表該些已識別主體之影像的集合中之影像中的手之資料的定界框。定界框中之資料被處理以產生針對該些已識別主體之定界框內的資料之類別。於此一實施例中，該類別係識別該已識別主體是否持有存貨項目。該些類別包括第一接近度類別，其係指示該已識別主體的手相對於貨架之位置。該些類別包括第二接近度類別，其係指示該已識別主體的手相對於該已識別主體的身體之位置。該些類別包括第三接近度類別，其係指示該已識別主體的手相對於與已識別主體關聯的籃子之位置。最後，該些類別包括可能存貨項目之識別符。 In another embodiment, the logic for processing the set of images includes (for identified subjects) logic to identify the delimitation of data representing hands in images in the set of images of the identified subjects frame. The data in the bounding boxes are processed to generate categories for the data in the bounding boxes for the identified subjects. In this embodiment, the category identifies whether the identified entity holds inventory items. The categories include a first proximity category that indicates the position of the identified subject's hand relative to the shelf. The categories include a second proximity category that indicates the position of the identified subject's hand relative to the identified subject's body. The categories include a third proximity category that indicates the position of the identified subject's hand relative to the basket associated with the identified subject. Finally, these categories include identifiers for possible inventory items.

該程序開始於步驟1702。於步驟1704，影像框中之主體的手(由手關節所表示)之位置被識別。定界框產生器1504係識別來自各相機之每框的主體之手位置，使用由如圖18中所述之關節CNN 112a-112n所產生的第一資料集中所識別的關節位置。接續於此，於步驟1706，定界框產生器1504係處理資料集以指明其包括影像之序列中的影像中之已識別多關節主體的手之影像的定界框。定界框產生器之細節被提出於以上之圖15A的討論中。 The process begins at step 1702. At step 1704, the position of the subject's hand (represented by the hand joint) in the image frame is identified. Bounding box generator 1504 identifies the subject's hand positions for each frame from each camera, using the joint positions identified in the first dataset generated by joint CNNs 112a-112n as described in FIG. 18 . Continuing here, at step 1706, the bounding box generator 1504 processes the data set to specify that it includes images in the sequence of images The bounding box of the image of the identified multi-joint subject's hand in the image. Details of the bounding box generator are presented in the discussion of Figure 15A above.

第二影像辨識引擎係接收來自該些複數相機之影像的序列並處理該些影像中之指明的定界框以產生該已識別主體之手的類別(步驟1708)。於一實施例中，用以根據手之影像來分類該些主體的該些影像辨識引擎之各者包含經訓練的卷積神經網路，其被稱為WhatCNN 1506。WhatCNN被配置於多CNN管線中，如以上相關於圖15A所述。於一實施例中，送至WhatCNNj輸入為多維陣列BxWxHxC(亦稱為BxWxHxC張量)。「B」為批次大小，其係指示由WhatCNN所處理之影像的批次中之影像框的數目。「W」及「H」係指示像素中之定界框的寬度及高度，「C」為頻道之數目。於一實施例中，有30個影像於一批次中(B=30)，因此定界框之大小為32像素(寬度)x 32像素(高度)。可有六個頻道，其各別地代表紅、綠、藍、前台遮罩、前臂遮罩及上臂遮罩。前台遮罩、前臂遮罩及上臂遮罩是於此範例中針對WhatCNN之額外的及選擇性的輸入資料來源，其為CNN可包括於處理中以分類RGB影像資料中之資訊。前台遮罩可使用(例如)高斯演算法之混合而被產生。前臂遮罩可為介於手腕與手肘之間的線，其係提供使用關節資料結構中之資訊所產生的背景。同樣地，上臂遮罩可為介於手肘與肩膀之間的線，其係使用關節資料結構中之資訊所產生。B、W、H及C參數之不同值可被使用於其他實施例中。例如，於另一實施例中，定界框之大小是較大的，例如，64像素(寬度)x 64像素(高度)或128像素(寬度)x 128像素(高度)。 A second image recognition engine receives the sequence of images from the plurality of cameras and processes specified bounding boxes in the images to generate a class of the identified subject's hand (step 1708). In one embodiment, each of the image recognition engines used to classify the subjects from hand images includes a trained convolutional neural network, referred to as WhatCNN 1506. WhatCNN is configured in a multi-CNN pipeline, as described above in relation to Figure 15A. In one embodiment, the input to WhatCNNj is a multidimensional array BxWxHxC (also known as a BxWxHxC tensor). "B" is the batch size, which indicates the number of image frames in a batch of images processed by WhatCNN. "W" and "H" indicate the width and height of the bounding box in pixels, and "C" is the number of channels. In one embodiment, there are 30 images in a batch (B=30), so the size of the bounding box is 32 pixels (width) x 32 pixels (height). There can be six channels, which respectively represent red, green, blue, foreground mask, forearm mask and upper arm mask. The foreground mask, the forearm mask, and the upper arm mask are additional and optional input data sources for WhatCNN in this example, which are information that the CNN can include in processing to classify the RGB image data. The foreground mask can be generated using, for example, a mixture of Gaussian algorithms. The forearm mask can be the line between the wrist and the elbow that provides the background generated using the information in the joint data structure. Likewise, the upper arm mask can be the line between the elbow and the shoulder, which is generated using information in the joint data structure. Different values for the B, W, H and C parameters may be used in other embodiments. For example, in another embodiment, the bounding box is The size is larger, for example, 64px (width) x 64px (height) or 128px (width) x 128px (height).

各WhatCNN 1506係處理影像的批次以產生該些已識別主體之手的類別。該些類別包括該已識別主體是否持有存貨項目。該些類別包括一或更多類別，其係指示該些手相對於該貨架及相對於該主體之位置，無法檢測放下及取走。於此範例中，第一接近度類別係指示該已識別主體的手相對於貨架之位置。該些類別於此範例中包括第二接近度類別，其係指示該已識別主體的手相對於該已識別主體的身體之位置，其中主體可能於購物期間持有存貨項目。該些類別於此範例中進一步包括第三接近度類別，其係指示該已識別主體的手相對於與已識別主體關聯的籃子之位置，其中「籃子」於此背景下為由該主體用以於購物期間持有存貨項目的袋子、籃子、車或其他物件。最後，該些類別包括可能存貨項目之識別符。WhatCNN 1506之最後層係產生羅吉特(logits)，其為預測之原始值。羅吉特被表示為浮點值並進一步處理(如以下所述)，以供產生分類結果。於一實施例中，WhatCNN模型之輸出包括多維陣列BxL(亦稱為BxL張量)。「B」為批次大小，而「L=N+5」為每影像框之羅吉特輸出的數目。「N」為SKU之數目，其代表購物商店中供銷售之「N」個獨特存貨項目。 Each WhatCNN 1506 processes batches of images to generate the categories of the identified subject's hands. These categories include whether the identified entity holds inventory items. The categories include one or more categories that indicate the position of the hands relative to the shelf and relative to the body, undetectable for dropping and removing. In this example, the first proximity category indicates the position of the identified subject's hand relative to the shelf. The categories include, in this example, a second proximity category that indicates the position of the identified subject's hand relative to the identified subject's body, where the subject may be holding inventory items during shopping. The categories further include, in this example, a third proximity category that indicates the position of the identified subject's hand relative to the basket associated with the identified subject, where "basket" in this context is used by the subject for Bags, baskets, carts or other items that hold inventory items during shopping. Finally, these categories include identifiers for possible inventory items. The last layer of WhatCNN 1506 produces logits, which are the raw values of the prediction. Logits are represented as floating point values and further processed (as described below) for producing classification results. In one embodiment, the output of the WhatCNN model includes a multidimensional array BxL (also known as a BxL tensor). "B" is the batch size, and "L=N+5" is the number of logit outputs per frame. "N" is the number of SKUs representing "N" unique inventory items available for sale in the shopping store.

每影像框之輸出「L」為來自WhatCNN 1506之原始啟動。羅吉特「L」被處理於步驟1710以識別存貨項目及背景。首「N」個羅吉特係代表其該主體正持有「N」個存貨項目之一的信心。羅吉特「L」包括額外五(5)個羅吉特，其被解釋於下。第一羅吉特代表其在該主體之手中的項目之影像不是商店SKU項目之一(亦稱為非SKU項目)的信心。第二羅吉特係指示該主體是否持有項目的信心。大的正值係指示WhatCNN模型具有其該主體正持有項目之高的信心位準。大的負值係指示該模型有信心其該主體並未持有任何項目。第二羅吉特之接近零的值係指示WhatCNN模型沒有信心來預測該主體是否持有項目。 The output "L" for each frame is the original activation from WhatCNN 1506. Logit "L" is processed at step 1710 to identify inventory project and background. The first "N" logs represent its confidence that the entity is holding one of the "N" inventory items. A Logit "L" includes an additional five (5) Logits, which are explained below. First Logit represents its confidence that the image of the item in the subject's hands is not one of the store SKU items (also known as a non-SKU item). The second logit indicates whether the entity holds confidence in the project. A large positive value indicates that the WhatCNN model has a high level of confidence that the subject is holding the item. A large negative value indicates that the model is confident that the subject does not hold any items. A value close to zero for the second logit indicates that the WhatCNN model has no confidence in predicting whether the subject holds an item.

接下來三個羅吉特係代表第一、第二及第三接近度類別，包括：第一接近度類別，其係指示該已識別主體的手相對於貨架之位置、第二接近度類別，其係指示該已識別主體的手相對於該已識別主體的身體之位置、及第三接近度類別，其係指示該已識別主體的手相對於與已識別主體關聯的籃子之位置。因此，這三個羅吉特係代表具有一羅吉特之手位置的背景，各指示其手之背景接近於貨架、接近於籃子(或購物車)、或接近於該主體的身體之信心。於一實施例中，WhatCNN係使用一含有三個背景下之手影像的訓練資料集來訓練：接近於貨架、接近於籃子(或購物車)、及接近於主體的身體。於另一實施例中，「接近度」參數係由該系統所使用以分類手的背景。於此一實施例中，該系統係判定該已識別主體的手與貨架、籃子(或購物車)、及該主體的身體之距離，以分類該背景。 The next three logs represent first, second, and third proximity categories, including: a first proximity category, which indicates the position of the identified subject's hand relative to the shelf, a second proximity category, which is indicative of the position of the identified subject's hand relative to the identified subject's body, and a third proximity category is indicative of the position of the identified subject's hand relative to the basket associated with the identified subject. Thus, the three logit representations have a background with a logit's hand position, each indicating the confidence that the background of its hand is close to the shelf, close to the basket (or shopping cart), or close to the subject's body. In one embodiment, WhatCNN is trained using a training dataset containing hand images in three contexts: close to the shelf, close to the basket (or shopping cart), and close to the subject's body. In another embodiment, the "proximity" parameter is used by the system to classify the context of the hand. In this embodiment, the system determines the distance of the identified subject's hand to the shelf, basket (or shopping cart), and the subject's body to classify the context.

WhatCNN之輸出為「L」羅吉特，其包括： N個SKU羅吉特、1個非SKU羅吉特、1個持有羅吉特、及3個背景羅吉特，如上所述。SKU羅吉特(首N個羅吉特)及非SKU羅吉特(接續於該些N個羅吉特後之第一羅吉特)係由softmax函數所處理。如以上參考圖5所述，softmax函數係將任意實值之K維向量變換至範圍[0,1](其向上加至1)中的實值之K維向量。softmax函數計算其涵蓋N+1項目之項目的機率分佈。輸出值係介於0與1之間，所有機率之總和等於一。softmax函數(用於多類別分類)係返回各類別之機率。具有最高機率之類別是預測類別(亦稱為目標類別)。 The output of WhatCNN is "L" Logit, which includes: N SKU LOGIT, 1 non-SKU LOGIT, 1 HOLD LOGIT, and 3 background LOGIT, as described above. SKU logs (the first N logs) and non-SKU logs (the first logs following the N logs) are handled by the softmax function. As described above with reference to FIG. 5, the softmax function transforms any real-valued K-dimensional vector to a real-valued K-dimensional vector in the range [0,1] (which adds up to 1). The softmax function computes the probability distribution of items that cover N+1 items. The output value is between 0 and 1, and the sum of all probabilities equals one. The softmax function (for multi-class classification) returns the probability of each class. The class with the highest probability is the predicted class (also called the target class).

持有羅吉特係由S形函數所處理。S形函數具有實數值為輸入並產生0至1之範圍中的輸出值。S形函數之輸出係識別該手是空的或是持有項目。三個背景羅吉特係由softmax函數所處理以識別手關節位置之背景。於步驟1712，檢查是否有更多影像待處理。假如是的話，則步驟1704-1710被重複，否則該程序於步驟1714結束。 Holding Logit is handled by a sigmoid function. The sigmoid function has real-valued inputs and produces output values in the range 0 to 1. The output of the sigmoid function identifies whether the hand is empty or holds an item. The three background logs are processed by the softmax function to identify the background of the hand joint positions. At step 1712, it is checked whether there are more images to be processed. If so, steps 1704-1710 are repeated, otherwise the process ends at step 1714.

WhenCNN-Time Series Analysis to Identify Items Putting Down and Picking Up

於一實施例中，該系統係實施邏輯以履行涵蓋主體之類別的時間序列分析，以根據該些主體之前台影像處理來檢測藉由該些已識別主體之取走及放下。時間序列分析係識別該些主體之姿勢以及與影像之序列中所表示的該些姿勢關聯之存貨項目。 In one embodiment, the system implements logic to perform a time-series analysis covering classes of subjects to detect removals and placements by the identified subjects based on prior image processing of the subjects. Time series analysis identifies the gestures of the subjects and inventory items associated with the gestures represented in the sequence of images.

於多CNN管線中之WhatCNN 1506的輸出被提供為輸入至WhenCNN 1508，其係處理這些輸入以檢測藉由該些已識別主體之取走及放下。最後，該系統包括邏輯(其係回應於檢測到的取走及放下)以產生日誌資料結構，其包括針對各已識別主體之存貨項目的列表。於購物商店之範例中，日誌資料結構亦被稱為每主體之購物車資料結構1510。 The output of WhatCNN 1506 in the multi-CNN pipeline is provided as input to WhenCNN 1508, which processes these inputs to detect By taking and putting down the identified subjects. Finally, the system includes logic (in response to detected picks and drops) to generate a log data structure that includes a list of inventory items for each identified subject. In the shopping store example, the log data structure is also referred to as the per-subject shopping cart data structure 1510.

圖18提出一種程序，其係實施邏輯以產生每主體之購物車資料結構。該程序開始於步驟1802。WhenCNN 1508之輸入被準備於步驟1804。該WhenCNN之輸入是多維陣列BxCxTxCams，其中B是批次大小，C視頻道之數目，T是針對一時間窗所考量的框之數目，及Cams是相機114之數目。於一實施例中，批次大小「B」為64而「T」之該值為110個影像框或者在3.5秒的時間內之影像框的數目。 Figure 18 presents a process that implements logic to generate a per-subject shopping cart data structure. The procedure begins at step 1802. The input to WhenCNN 1508 is prepared at step 1804. The input to the WhenCNN is the multidimensional array BxCxTxCams, where B is the batch size, C is the number of video channels, T is the number of boxes considered for a time window, and Cams is the number of cameras 114. In one embodiment, the batch size "B" is 64 and the value of "T" is 110 image frames or the number of image frames in a 3.5 second period.

針對每影像框、每相機所識別之各主體，每手關節之10羅吉特(針對兩手之20羅吉特)的列表被產生。持有及背景羅吉特為由WhatCNN 1506所產生之「L」羅吉特的部分，如上所述。 A list of 10 LOGITS per hand joint (20 LOGITS for both hands) is generated for each subject identified per image frame, per camera. Possession and Background Logit is part of the "L" Logit produced by WhatCNN 1506, as described above.

上述資料結構被產生給影像框中之各手且亦包括有關該相同主體之另一手。例如，假如資料係針對主體之左手關節，則針對右手之相應值被包括為「其他」羅吉特。第五羅吉特(被稱為log_sku之上述列表中的項目編號3)為上述「L」羅吉特中之SKU羅吉特的對數。第六羅吉特為另一手之SKU羅吉特的對數。「roll」函數係產生相同資訊在目前框之前及之後。例如，第七羅吉特(稱為roll(log_sku,-30))為SKU羅吉特之對數，比目前框更早30框。第八羅吉特為針對該手之SKU羅吉特的對數，比目前框更晚30框。該列表中之第九及第十資料為另一手之類似資料，比目前框更早30框及更晚30框。針對另一手之類似資料結構亦被產生，導致每相機之每影像框之每主體的總共20羅吉特。因此，於針對WhenCNN之輸入中的頻道數為20(亦即，C=20於多維陣列BxCxTxCams中)。 The above data structure is generated for each hand in the image frame and also includes the other hand about the same subject. For example, if the data is for the subject's left hand joint, the corresponding value for the right hand is included as the "other" logit. The fifth logit (item number 3 in the above list called log_sku) is the logarithm of the SKU logit in the above "L" logit. The sixth Logit is the log of the other SKU Logit. The "roll" function generates the same information before and after the current frame. For example, the seventh logit (called roll(log_sku,-30)) is the logarithm of the SKU logit and is 30 frames earlier than the current frame. Eighth Logit is the log of the SKU Logit for that hand, 30 frames later than the current frame. The ninth and tenth data in this list are other similar data, 30 frames earlier and 30 frames later than the current frame. A similar data structure was also generated for the other hand, resulting in a total of 20 logs per subject per frame per camera. Therefore, the number of channels in the input to WhenCNN is 20 (ie, C=20 in the multidimensional array BxCxTxCams).

針對來自各相機之影像框的批次中之所有影像框(例如，B=64)，每主體之20個手羅吉特的類似資料結構(於該影像框中識別)被產生。時間窗(T=3.5秒或110影像框)被用以向前及向後搜尋針對主體之手關節的影像框之序列中的影像框。於步驟1806，每框之每主體的20個手羅吉特被合併自多CNN管線。於一實施例中，影像框(64)之批次可被想像為影像框之較小窗，其被置於影像框110之較大窗的中間，具有額外的影像框以供兩側上之向前及向後搜尋。針對WhenCNN 1508之輸入BxCxTxCams係由以下所組成：來自所有相機114(稱為「Cams」)之影像框的批次「B」中所識別的主體之兩手的20羅吉特。合併的輸入被提供至單一經訓練的卷積神經網路(稱之為WhenCNN模型1508)。 For all image frames in the batch of image frames from each camera (eg, B=64), a similar data structure of 20 hand logs per subject (identified in that image frame) is generated. A time window (T=3.5 seconds or 110 image frames) was used to search forward and backward for image frames in the sequence of image frames for the subject's hand joint. At step 1806, the 20 hand logs per subject per frame are merged from the multi-CNN pipeline. In one embodiment, the batch of image frames (64) can be thought of as the smaller windows of the image frames, which are placed in the middle of the larger windows of the image frame 110, with additional image frames for viewing on both sides. Search forward and backward. The input BxCxTxCams to WhenCNN 1508 consists of the following: A batch of image frames from all cameras 114 (referred to as "Cams") 20 logits of both hands of the subject identified in sub "B". The combined input is provided to a single trained convolutional neural network (referred to as the WhenCNN model 1508).

WhenCNN模型之輸出包含3羅吉特，其代表已識別主體之三個可能動作中的信心：從貨架取走存貨項目、將存貨項目放回該貨架上、及無動作。三個輸出羅吉特係由softmax函數所處理以預測所履行的動作。三個類別羅吉特係針對各主體而被產生以規律的間隔，且結果係針對每人而被儲存(連同時戳)。於一實施例中，三個羅吉特被產生於每主體每二十框。於此一實施例中，於每主體每二十影像框之間隔，110個影像框之窗被形成於目前影像框周圍。 The output of the WhenCNN model contains 3 logs, which represent the confidence in the three possible actions of the identified subject: remove the inventory item from the shelf, put the inventory item back on the shelf, and no action. The three output logs are processed by the softmax function to predict the actions performed. Three category logs are generated at regular intervals for each subject, and the results are stored (with timestamps) for each individual. In one embodiment, three logs are generated every twenty frames per body. In this embodiment, at intervals of every twenty image frames per subject, 110 image frame windows are formed around the current image frame.

在一段時間週期期間之每主體的這三個羅吉特之時間序列分析被履行(步驟1808)以識別相應於真實事件之姿勢以及其發生之時間。非最大抑制(NMS)演算法被使用於此目的。當由WhenCNN 1508多次地(來自相同相機且來自多數相機)檢測到一事件(亦即，藉由主體之項目的放下及取走)時，NMS便移除針對一主體之多餘事件。NMS為一種包含兩個主要工作之再評分技術：處罰多餘檢測之「匹配損失」以及鄰居之「關節處理」，用以得知附近是否有較佳檢測。 A time series analysis of these three logs per subject during a period of time is performed (step 1808) to identify gestures corresponding to real events and the times of their occurrence. A non-maximum suppression (NMS) algorithm is used for this purpose. When an event is detected by WhenCNN 1508 multiple times (from the same camera and from multiple cameras) (ie, by dropping and taking items of a subject), the NMS removes redundant events for a subject. NMS is a re-scoring technique that includes two main tasks: "matching loss" that penalizes redundant detections, and "joint processing" of neighbors to know if there are better detections nearby.

針對各主體之取走及放下的真實事件係藉由以下方式而被進一步處理：計算在具有真實事件之該影像框前的30個影像框之SKU羅吉特的平均。最後，最大值之引數(縮寫為arg max或argmax)被用以判定最大值。由argmax值所分類的存貨項目被用以識別來自貨架之存貨項目放下或取走。存貨項目被加至各別主體之SKU(亦稱為購物車或籃)的對數，於步驟1810。程序步驟1804至1810被重複，假如有更多類別資料的話(於步驟1812檢查)。在一段時間週期期間，此處理導致對於各主體之購物車或籃的更新。該程序結束於步驟1814。 The ground truth event for each subject's take and drop is further processed by calculating the average of the SKU logit for the 30 image frames preceding the image frame with the ground truth. Finally, the maximum value of Arguments (abbreviated as arg max or argmax) are used to determine the maximum value. Inventory items sorted by the argmax value are used to identify inventory items from shelves to be put down or removed. Inventory items are added to the logarithm of the respective subject's SKU (also known as a shopping cart or basket), at step 1810. Process steps 1804 to 1810 are repeated if there are more class data (checked at step 1812). During a period of time, this process results in updates to each subject's shopping cart or basket. The process ends at step 1814.

WhatCNN with scenes and video programs

圖19提出系統之實施例，其中來自場景程序1415及視頻程序1411之資料被提供為對於WhatCNN模型1506之輸入以產生手影像類別。注意：各視頻程序之輸出被提供至分離的WhatCNN模型。來自場景程序1415之輸出為關節字典。於此字典中，密鑰為獨特關節識別符而值為該關節所關聯的獨特主體識別符。假如無任何主體與關節相關聯，則其不被包括於該字典中。各視頻程序1411從場景程序接收關節字典，並將其儲存入環緩衝器，其係將框數目映射至返回的字典。使用返回的密鑰-值字典，該視頻程序在各時刻選擇其接近與已識別主體關聯的手之影像的子集。於手關節周圍之影像框的這些部分可被稱為區提議。 Figure 19 presents an embodiment of a system in which data from scene program 1415 and video program 1411 are provided as input to WhatCNN model 1506 to generate hand image categories. Note: The output of each video program is provided to a separate WhatCNN model. The output from scene program 1415 is the joint dictionary. In this dictionary, the key is the unique joint identifier and the value is the unique body identifier that the joint is associated with. If no body is associated with the joint, it is not included in the dictionary. Each video program 1411 receives the joint dictionary from the scene program and stores it in a ring buffer, which maps frame numbers to the returned dictionary. Using the returned key-value dictionary, the video program selects at each moment its subset that approximates the image of the hand associated with the identified subject. These parts of the image frame around the hand joints may be referred to as region proposals.

於購物商店之範例中，區提議為來自一或更多相機(具有該主體於其相應觀看域中)之手位置的影像框。區提議係由系統中之每一相機所產生。其包括空手以及攜帶購物商店存貨項目和不屬於購物商店存貨之項目的手。視頻程序係選擇含有每時刻之手關節的影像框之部分。前台遮罩之類似片段被產生。以上(手關節之影像部分及前台遮罩)被序連與關節字典(指示各別手關節所屬之主體)以產生多維陣列。來自視頻程序之此輸出被提供為針對WhatCNN模型之輸入。 In the shopping store example, the area proposal is an image frame from the hand position of one or more cameras (with the subject in their respective viewing fields). Zone proposals are generated by each camera in the system. It includes empty-handed and the hand carrying items in the shop's inventory and items that are not in the shop's stock. The video program selects the portion of the image frame that contains the hand joints at each moment. A similar fragment of the foreground mask is generated. The above (image parts of hand joints and foreground masks) are concatenated with joint dictionaries (indicating the subject to which each hand joint belongs) to generate a multi-dimensional array. This output from the video program is provided as input to the WhatCNN model.

WhatCNN模型之分類結果被儲存於區提議資料結構(由視頻程序所產生)。針對一時刻之所有區被接著提供為對於場景程序之輸入。該場景程序將結果儲存於密鑰-值字典中，其中該密鑰為主體識別符而該值為密鑰-值字典，其中該密鑰為相機識別符而該值為區之羅吉特。此聚合資料結構被接著儲存於環緩衝器，其係將框數目映射至聚合結構於各時刻。 The classification results of the WhatCNN model are stored in the region proposal data structure (generated by the video program). All regions for a moment are then provided as input to the scene program. The scene program stores the results in a key-value dictionary, where the key is the subject identifier and the value is a key-value dictionary, where the key is the camera identifier and the value is the logit of the zone. This aggregated data structure is then stored in a ring buffer, which maps the frame number to the aggregated structure at each instant.

WhenCNN with scene and video program

圖20提出系統之實施例，其中WhenCNN 1508接收來自場景程序之輸出，接續於由每視頻程序之WhatCNN模型所履行的手影像分類後，如圖19中所解釋。針對一段時間週期(例如，針對一秒)之區提議資料結構被提供為對於場景程序之輸入。於一實施例中，其中相機係以每秒30框之速率拍攝影像，該輸入包括30個時間週期及相應的區提議。場景程序對單一整數(其代表存貨項目SKU)減去30個區提議(每手)。場景程序之輸出為密鑰-值字典，其中該密鑰為主體識別符而該值為SKU整數。 FIG. 20 presents an embodiment of a system in which WhenCNN 1508 receives the output from the scene program, followed by hand image classification performed by the WhatCNN model per video program, as explained in FIG. 19 . A zone proposal data structure for a period of time (eg, for one second) is provided as input to the scene program. In one embodiment, where the camera is capturing images at a rate of 30 frames per second, the input includes 30 time periods and corresponding zone proposals. The scenario program subtracts 30 zone proposals (per lot) for a single integer (which represents an inventory item SKU). The output of the scene program is a key-value dictionary, where the key is the subject identifier and the value is the SKU integer.

WhenCNN模型1508履行時間序列分析以判定所時間經過之此字典的演化。如此導致從貨架所取走之項目以及放在購物商店中的貨架上之項目的識別。WhenCNN模型之輸出為密鑰-值字典，其中該密鑰為主體識別符而該值為由WhenCNN所產生之羅吉特。於一實施例中，一組啟發法2002被用以判定每主體之購物車資料結構1510。啟發法被應用至WhenCNN之輸出、由其各別關節資料結構所指示之主體的關節位置、及貨架圖。貨架圖為貨架上之存貨項目的預先計算的映圖。啟發法2002係判定(針對各取走或放下)該存貨項目是被放在貨架或是從貨架取走、該存貨項目是被放在購物車(或籃子)中或是從購物車(或籃子)取走、或者該存貨項目是否接近該已識別主體的身體。 The WhenCNN model 1508 performs time series analysis to determine the evolution of this dictionary over time. This results in the identification of items removed from shelves and items placed on shelves in a shopping store. The output of the WhenCNN model is a key-value dictionary, where the key is the subject identifier and the value is the logit generated by WhenCNN. In one embodiment, a set of heuristics 2002 are used to determine the shopping cart data structure 1510 for each subject. Heuristics are applied to the output of WhenCNN, the joint positions of the subject indicated by their respective joint data structures, and the planogram. A planogram is a precomputed map of inventory items on a shelf. Heuristic 2002 determines (for each take or put down) whether the inventory item was placed on or removed from the shelf, whether the inventory item was placed in a shopping cart (or basket) or removed from a shopping cart (or basket) ), or whether the inventory item is in close proximity to the identified subject's body.

Example Architecture of What-CNN Model

圖21提出WhatCNN模型1506之範例架構。於此範例架構中，有總共26個卷積層。亦提出有關各別寬度(像素)、高度(像素)及頻道數之不同層的維度。第一卷積層2113接收輸入2111且具有64像素之寬度、64像素之高度及具有64個頻道(寫入為64x64x64)。對於WhatCNN之輸入的細節被提出如上。箭號之方向係指示從一層至後續層之資料的流程。第二卷積層2115具有32x32x64之維度。由第二層所接續，有八個卷積層(顯示於方盒2117中)，各具有32x32x64之維度。只有兩層2119及2121被顯示於方盒2117 中以利闡明之目的。此係接續以16x16x128之維度的另八個卷積層2123。兩個此卷積層2125及2127被顯示於圖21中。最後，最後八個卷積層2129，具有各8x8x256之維度。兩個卷積層2131及2133被顯示於方盒2129中以利闡明。 FIG. 21 presents an example architecture of WhatCNN model 1506. In this example architecture, there are a total of 26 convolutional layers. The dimensions of the different layers are also presented with respect to the respective width (pixels), height (pixels) and number of channels. The first convolutional layer 2113 receives the input 2111 and has a width of 64 pixels, a height of 64 pixels and has 64 channels (written as 64x64x64). Details of the input to WhatCNN are presented above. The direction of the arrows indicates the flow of data from one layer to subsequent layers. The second convolutional layer 2115 has dimensions of 32x32x64. Following on from the second layer, there are eight convolutional layers (shown in box 2117), each having dimensions of 32x32x64. Only two layers 2119 and 2121 are shown in box 2117 Elucidation of the purpose. This is followed by another eight convolutional layers 2123 with dimensions of 16x16x128. Two such convolutional layers 2125 and 2127 are shown in FIG. 21 . Finally, the last eight convolutional layers 2129 have dimensions of 8x8x256 each. Two convolutional layers 2131 and 2133 are shown in box 2129 for illustration.

有一完全連接層2135，具有來自最後卷積層2133之256個輸入，其產生N+5個輸出。如上所述，「N」為SKU之數目，其代表購物商店中供銷售之「N」個獨特存貨項目。五個額外羅吉特包括：第一羅吉特，其表示該影像中之項目為非SKU項目的信心、及第二羅吉特，其表示該主體是否持有項目的信心。接下來三個羅吉特係代表第一、第二及第三接近度類別，如上所述。WhatCNN之最後輸出被顯示於2137。範例架構係使用批次正規化(BN)。卷積神經網路(CNN)中之各層的分佈係於訓練期間改變且其隨著各層而變化。如此減少最佳化演算法之收斂速度。批次正規化(Ioffe及Szegedy 2015)係一種用以克服此問題之技術。ReLU(已校正的線性單元)啟動被用於各層的非線性，除了其中softmax被使用的最後輸出以外。 There is a fully connected layer 2135 with 256 inputs from the last convolutional layer 2133, which produces N+5 outputs. As noted above, "N" is the number of SKUs that represent "N" unique inventory items available for sale in the shopping store. The five additional logs include: the first logit, which expresses the confidence that the item in the image is a non-SKU item, and the second logit, which expresses the confidence whether the subject holds the item. The next three Logit lines represent the first, second, and third proximity categories, as described above. The last output of WhatCNN is shown at 2137. The example architecture uses batch normalization (BN). The distribution of layers in a Convolutional Neural Network (CNN) changes during training and it varies from layer to layer. This reduces the convergence speed of the optimization algorithm. Batch normalization (Ioffe and Szegedy 2015) is one technique to overcome this problem. ReLU (Corrected Linear Unit) enables the nonlinearity used for each layer except the last output where softmax is used.

圖22、23、及24為WhatCNN 1506之實施方式的不同部分之圖形視覺化。該些圖形被調適自其由TensorBoard^TM所產生之WhatCNN模型的圖形視覺化。TensorBoard^TM為用以檢視及理解深學習模型(例如，卷積神經網路)之視覺化工具的套件。 22, 23, and 24 are graphical visualizations of different parts of an implementation of WhatCNN 1506. The graphs were adapted from their graph visualization of the WhatCNN model produced by TensorBoard ^™ . TensorBoard ^™ is a suite of visualization tools for viewing and understanding deep learning models (eg, convolutional neural networks).

圖22顯示其檢測單手(「單手」模型2210)之卷積神經網路模型的高階架構。WhatCNN模型1506包含兩個此卷積神經網路，用以各別地檢測左及右手。於所示的實施例中，該架構包括四個區塊，稱為區塊0 2216、區塊1 2218、區塊2 2220、及區塊3 2222。區塊為較高階抽象化且包含其代表卷積層之多數節點。該些區塊被配置於從較低至較高的序列以致其來自一區塊之輸出被輸入至後續區塊。該架構亦包括集用層2214及卷積層2212。於該些區塊之間，不同的非線性可被使用。於所示的實施例中，ReLU非線性被使用如上所述。 Figure 22 shows how it detects one hand (the "one hand" model 2210) High-level architecture for convolutional neural network models. The WhatCNN model 1506 contains two such convolutional neural networks to detect the left and right hands, respectively. In the embodiment shown, the architecture includes four blocks, referred to as block 0 2216, block 1 2218, block 2 2220, and block 3 2222. A block is a higher-level abstraction and contains the majority of its nodes representing the convolutional layer. The blocks are arranged in a lower-to-higher sequence such that their output from one block is input to a subsequent block. The architecture also includes an aggregation layer 2214 and a convolutional layer 2212. Between the blocks, different nonlinearities can be used. In the embodiment shown, ReLU nonlinearity is used as described above.

於所示的實施例中，對於單手模型2210之輸入為BxWxHxC張量，其被定義如上於WhatCNN 1506之描述中。「B」為批次大小，「W」及「H」係指示輸入影像的寬度及高度，而「C」為頻道之數目。單手模型2210之輸出係與第二單手模型結合且被傳遞至完全連接網路。 In the embodiment shown, the input to the one-handed model 2210 is the BxWxHxC tensor, which is defined as described in WhatCNN 1506 above. "B" is the batch size, "W" and "H" indicate the width and height of the input image, and "C" is the number of channels. The output of the one-handed model 2210 is combined with the second one-handed model and passed to the fully connected network.

於訓練期間，單手模型2210之輸出係與地面真相(ground truth)做比較。於該輸出與該地面真相之間所計算出的預測誤差被用以更新卷積層之加權。於所示的實施例中，隨機梯度下降(SGD)被用於訓練WhatCNN 1506。 During training, the output of the one-handed model 2210 is compared to the ground truth. The calculated prediction error between the output and the ground truth is used to update the weights of the convolutional layers. In the embodiment shown, Stochastic Gradient Descent (SGD) is used to train WhatCNN 1506.

圖23提出圖22之單手卷積神經網路模型的區塊0 2216之進一步細節。其包含四個卷積層，標示為方盒2310中之conv0、conv1 2318、conv2 2320、及conv3 2322。卷積層conv0之進一步細節被提出於方盒2310中。該輸入係由卷積層2312所處理。卷積層之輸出係由批次正規化層2314所處理。ReLU非線性2316被應用至批次正規化層2314之輸出。卷積層conv0之輸出被傳遞至下一層conv1 2318。最後卷積層conv3之輸出係透過加法運算2324而被處理。此運算係將來自層conv3 2322之輸出加總至其經歷跳躍連接2326之未修改輸入。其已由He等人發表於論文，名稱為「深殘餘網路中之識別映射」(發佈於https：//arxiv.org/pdf/1603.05027.pdf，2016年七月25日)其向前及向後信號可被直接地從一區塊傳播至任何其他區塊。該信號未改變地傳播通過卷積神經網路。此技術增進了深卷積神經網路之訓練及測試性能。 FIG. 23 presents further details of block 0 2216 of the one-handed convolutional neural network model of FIG. 22 . It contains four convolutional layers, labeled conv0, conv1 2318, conv2 2320, and conv3 2322 in box 2310. Further details of the convolutional layer conv0 are presented in Box 2310. The input is processed by convolutional layer 2312. The output of the convolutional layer is processed by batch normalization layer 2314. ReLU nonlinearity 2316 is applied to batch normalization The output of the ization layer 2314. The output of the convolutional layer conv0 is passed to the next layer conv1 2318. The output of the final convolutional layer conv3 is processed by addition operation 2324. This operation sums the output from layer conv3 2322 to its unmodified input through skip connection 2326. It has been published by He et al. in a paper titled "Recognition Mapping in Deep Residual Networks" (published at https://arxiv.org/pdf/1603.05027.pdf, July 25, 2016) which forwards and The backward signal can be propagated directly from one block to any other block. This signal propagates through the convolutional neural network unchanged. This technique improves the training and testing performance of deep convolutional neural networks.

如圖21中所述，WhatCNN之卷積層的輸出係由完全連接層所處理。兩單手模型2210之輸出被結合並傳遞為輸入而至完全連接層。圖24為完全連接層(FC)2410之範例實施方式。對於FC層之輸入係由再成形運算子2412所處理。再成形運算子係改變張量之形狀，在將其傳遞至下一層2420之前。再成形包括將來自卷積層之輸出平坦化，亦即，將來自多維矩陣之輸出再成形至一維矩陣或向量。再成形運算子2412之輸出被傳遞至矩陣乘法運算子，其被標示為MatMul 2422。來自MatMul 2422之輸出被傳遞至矩陣正加法運算子，其被標示為xw_plus_b 2424。針對各輸入「x」，運算子2424將輸入乘以矩陣「w」及向量「b」以產生該輸出。「w」為與輸入「x」關聯的可訓練參數，而「b」為其被稱為偏移或攔截之另一可訓練參數。來自完全連接層2410之輸出2426為BxL張量，如以上於WhatCNN 1506之描述中所解釋。「B」為批次大小，而「L=N+5」為每影像框之羅吉特輸出的數目。「N」為SKU之數目，其代表購物商店中供銷售之「N」個獨特存貨項目。 As shown in Figure 21, the output of the convolutional layers of WhatCNN is processed by fully connected layers. The outputs of the two one-handed models 2210 are combined and passed as input to the fully connected layer. FIG. 24 is an example implementation of a fully connected layer (FC) 2410 . The input to the FC layer is processed by the reshaping operator 2412. The reshape operator changes the shape of the tensor before passing it to the next layer 2420. Reshaping includes flattening the output from the convolutional layer, that is, reshaping the output from a multidimensional matrix into a one-dimensional matrix or vector. The output of the reshape operator 2412 is passed to the matrix multiplication operator, which is designated as MatMul 2422. The output from MatMul 2422 is passed to the matrix positive addition operator, which is designated as xw_plus_b 2424. For each input "x", operator 2424 multiplies the input by a matrix "w" and a vector "b" to produce the output. "w" is the trainable parameter associated with the input "x" and "b" is another trainable parameter called offset or intercept. The output 2426 from the fully connected layer 2410 is a BxL tensor, as explained above in the description of WhatCNN 1506. "B" is the batch size, and "L=N+5" is the number of logit outputs per image frame. "N" is the number of SKUs representing "N" unique inventory items available for sale in the shopping store.

WhatCNN model training

於不同背景下持有不同存貨項目之手、以及於不同背景下之空手的影像之訓練資料集被產生。為了達成此目的，人類演員係持有各獨特的SKU存貨項目以多數不同方式，於測試環境之不同位置上。其手之背景的範圍包含：接近於演員的身體、接近於商店的貨架、及接近於演員的購物車或籃。該演員亦以空手履行上述動作。此程序被完成於左及右手兩者。多數演員係同時地履行這些動作於相同的測試環境中以模擬其發生在真實購物商店中之自然阻擋。 Training datasets of images of hands holding different inventory items in different contexts, and empty hands in different contexts are generated. To accomplish this, human actors hold each unique SKU inventory item in many different ways, at different locations in the test environment. The range of the background of his hands includes: close to the actor's body, close to the store shelf, and close to the actor's shopping cart or basket. The actor also performed the above actions with his bare hands. This procedure is done on both the left and right hands. Most of the actors performed these actions simultaneously in the same test environment to simulate the natural obstructions that would occur in a real shopping store.

相機114拍攝其履行上述動作之演員的影像。於一實施例中，二十個相機被使用於此程序。關節CNN 112a-112n及追蹤引擎110係處理該些影像以識別關節。定界框產生器1504產生類似於生產或推理之手區的定界框。取代經由WhatCNN 1506以分類這些手區，該些影像被存至儲存碟。已儲存影像被檢視並標示。影像被指派三個標籤：存貨項目SKU、背景、及該手是否持有某東西。此程序係針對大量影像(高達數百萬影像)而被履行。 The camera 114 captures images of the actors who perform the aforementioned actions. In one embodiment, twenty cameras are used in this process. Joint CNNs 112a-112n and tracking engine 110 process the images to identify joints. The bounding box generator 1504 generates a bounding box similar to the hand region of production or reasoning. Instead of sorting the hand regions via WhatCNN 1506, the images are saved to a storage disk. Saved images are viewed and marked. Images are assigned three labels: Inventory item SKU, background, and whether the hand holds something. This procedure is performed for a large number of images (up to millions of images).

影像檔係依據資料收集場景而被組織。針對影像檔之命名約定係識別該些影像之內容及背景。圖25顯示一範例實施例中之影像檔名。檔名之第一部分(以數字2502指稱)係識別資料集合場景且亦包括該影像之時戳。檔名之第二部分2504係識別來源相機。於圖25所示之範例中，影像係由「相機4」所擷取。檔名之第三部分2506係識別來自來源相機之框數。於所示之範例中，檔名係指示其為來自相機4之第94,600影像框。檔名之第四部分2508係識別來源影像框(此手區影像係從該來源影像框所取得)中之x及y座標區的範圍。於所示之範例中，該區被界定於從像素117至370的x座標值與從像素370至498的y座標值之間。檔名之第五部分2510係識別該場景中之演員的個人id。於所示之範例中，該場景中之個人具有id「3」。最後，檔名之第六部分2512係識別存貨項目之SKU數(項目=68)，識別於該影像中。 Image files are organized according to data collection scenarios. Naming conventions for image files identify the content and context of those images. Figure 25 shows An image file name in an exemplary embodiment is shown. The first part of the filename (designated by the number 2502) identifies the data collection scene and also includes the time stamp of the image. The second part of the filename 2504 identifies the source camera. In the example shown in Figure 25, the image is captured by "Camera 4". The third part 2506 of the filename identifies the frame number from the source camera. In the example shown, the filename indicates that it is the 94,600th image frame from camera 4. The fourth part 2508 of the filename identifies the extent of the x and y coordinate regions in the source image frame from which the hand region image was obtained. In the example shown, the region is defined between the x-coordinate values from pixels 117-370 and the y-coordinate values from pixels 370-498. The fifth part 2510 of the filename is the personal id that identifies the actor in the scene. In the example shown, the person in the scene has id "3". Finally, the sixth part 2512 of the file name is the SKU number (item=68) that identifies the inventory item, identified in the image.

於WhatCNN 1506之訓練模式中，前向傳遞及後向傳播被履行相反於產生模式，其中僅有前向傳遞被履行。於訓練期間，WhatCNN產生該些已識別主體之手的類別於前向傳遞中。WhatCNN之輸出係與地面真相進行比較。於後向傳播中，一或更多成本函數之梯度被計算。梯度被接著傳播至卷積神經網路(CNN)及完全連接(FC)神經網路以致其預測誤差被減少，造成輸出更接近於地面真相。於一實施例中，隨機梯度下降(SGD)被用於訓練WhatCNN 1506。 In the training mode of WhatCNN 1506, the forward pass and the backward pass are performed as opposed to the generation mode, where only the forward pass is performed. During training, WhatCNN generates the categories of the identified subject's hands in the forward pass. The output of WhatCNN is compared to the ground truth. In backward propagation, the gradients of one or more cost functions are computed. The gradients are then propagated to Convolutional Neural Networks (CNN) and Fully Connected (FC) Neural Networks so that their prediction errors are reduced, resulting in outputs that are closer to the ground truth. In one embodiment, Stochastic Gradient Descent (SGD) is used to train WhatCNN 1506.

於一實施例中，64個影像被隨機地選自訓練資料並被擴增。影像擴增之目的係用以使訓練資料多樣化，其導致模型之較佳性能。影像擴增包括影像之隨機翻轉、隨機旋轉、隨機色相移位、隨機高斯雜訊、隨機對比改變、及隨機修剪。擴增之量為超參數且透過超參數搜尋而被調諧。已擴增影像係由WhatCNN 1506所分類，於訓練期間。該分類係與地面真相進行比較，且WhatCNN 1506之係數或加權係藉由計算梯度損失函數並將梯度乘以學習速率而被更新。上述程序被重複多次(例如，約1000次)以形成時期。介於50至200時期之間被履行。於各時期期間，學習速率被稍微地減少，依循餘弦退火排程。 In one embodiment, 64 images are randomly selected from the training data and augmented. The purpose of image augmentation is to diversify the training data , which leads to better performance of the model. Image augmentation includes random flipping, random rotation, random hue shift, random Gaussian noise, random contrast change, and random cropping of images. The amount of amplification is a hyperparameter and is tuned through a hyperparameter search. Augmented images were classified by WhatCNN 1506 during training. The classification is compared to the ground truth, and the coefficients or weights of WhatCNN 1506 are updated by computing the gradient loss function and multiplying the gradient by the learning rate. The above procedure is repeated multiple times (eg, about 1000 times) to form the epochs. Between 50 and 200 periods are fulfilled. During each epoch, the learning rate is reduced slightly, following a cosine annealing schedule.

WhenCNN model training

WhenCNN 1508之訓練係類似於上述的WhatCNN 1506，使用後向傳播以減少預測誤差。演員履行多種動作於訓練環境中。於範例實施例中，訓練被履行於購物商店中，以其貨架堆疊有存貨項目。由演員所履行之動作的範例包括：從貨架取走存貨項目、將存貨項目放回貨架上、將存貨項目放入購物車(或籃)中、從購物車取回存貨項目、於左與右手之間調換項目、將存貨項目放入演員的隱蔽處中。隱蔽處是指稱其可在左與右手旁邊持有存貨項目之演員的身體上之位置。隱蔽處之某些範例包括：一存貨項目，其係擠壓於前臂與上臂之間、擠壓於前臂與胸口之間、擠壓於脖子與肩膀之間。 The training of WhenCNN 1508 is similar to WhatCNN 1506 described above, using back propagation to reduce prediction errors. Actors perform a variety of actions in a training environment. In an exemplary embodiment, the training is performed in a shopping store with shelves stacked with inventory items. Examples of actions performed by actors include: removing inventory items from shelves, placing inventory items back on shelves, placing inventory items in a shopping cart (or basket), retrieving inventory items from a shopping cart, Swap items between, put inventory items in the actor's hideout. Hideout refers to the location on the body of an actor who can hold inventory items next to his left and right hands. Some examples of hiding places include: an inventory item that is squeezed between the forearm and upper arm, squeezed between the forearm and the chest, squeezed between the neck and shoulders.

相機114係記錄於訓練期間如上所述之所有動作的視頻。該些視頻被檢視且所有影像框被標示以指示時戳及所履行的動作。這些標籤被稱為針對各別影像框之動作標籤。該些影像框係透過多CNN管線而被處理直達WhatCNN 1506如上所述，以供產生或推理。WhatCNN(連同相關的動作標籤)之輸出被接著用以訓練WhenCNN 1508，以該些動作標籤作用為地面真相。具有餘弦退火排程之隨機梯度下降(SGD)被用於如上所述之訓練以供WhatCNN 1506之訓練。 Camera 114 records video of all the actions described above during training. The videos are viewed and all frames are marked to indicate Timestamp and action performed. These tags are called action tags for the respective image frames. The image frames are processed through multiple CNN pipelines directly to WhatCNN 1506 as described above for generation or inference. The output of WhatCNN (along with associated action labels) is then used to train WhenCNN 1508 using these action labels as ground truth. Stochastic Gradient Descent (SGD) with cosine annealing schedule was used for the training of WhatCNN 1506 as described above.

除了影像擴增(用於WhatCNN之訓練)以外，時間擴增亦被應用至影像框，於WhenCNN之訓練期間。一些範例包括鏡射、加入高斯雜訊、調換與左及右手相關的羅吉特、縮短時間、藉由丟棄影像框以縮短時間序列、藉由複製框以延長時間序列、及丟棄時間序列中之資料點以模擬基礎模型(其產生用於WhenCNN之輸入)中之缺陷。鏡射包括反轉時間序列及各別標籤，例如，當被反轉時放下動作變為取走動作。 In addition to image augmentation (used in the training of WhatCNN), temporal augmentation was also applied to the image frame during the training of WhenCNN. Some examples include mirroring, adding Gaussian noise, transposing logs associated with left and right hands, shortening time, shortening time series by discarding image frames, lengthening time series by duplicating frames, and discarding time series. Data points to simulate imperfections in the base model that produced the input for WhenCNN. Mirroring includes reversing the time series and individual labels, eg, a drop action becomes a take action when reversed.

Use Background Image Processing to Predict Inventory Events

用以追蹤真實空間之區域中藉由主體之改變的系統及各種實施方式係參考圖26至28B而被描述。 A system and various embodiments for tracking changes by subject in an area of real space are described with reference to Figures 26-28B.

system structure

圖26提出依據一實施方式之一種系統的高階概圖。因為圖26為架構圖，所以某些細節被省略以增進描述之清晰。 26 presents a high-level overview of a system in accordance with one embodiment. Since FIG. 26 is an architectural diagram, some details are omitted to improve the clarity of the description.

圖26中所提出之系統係接收來自複數相機114之影像框。如上所述，於一實施例中，相機114可於時間上被彼此同步化，以致其影像被同時地(或時間上接近地)並以相同的影像擷取率來擷取。於其同時地(或時間上接近地)覆蓋真實空間之區域的所有相機中所擷取的影像被同步化，由於其同步化影像可被識別於處理引擎中如代表在具有真實空間中之固定位置的主體之某一時刻的不同視角。 The system proposed in FIG. 26 receives image frames from multiple cameras 114. As described above, in one embodiment, the cameras 114 may be synchronized with each other in time such that their images are captured simultaneously (or close in time) and at the same image capture rate. The images captured in all cameras which cover the area of real space simultaneously (or temporally close) are synchronized, since their synchronized images can be identified in the processing engine as represented by the fixed in real space Different perspectives of a moment in time on the subject of the location.

於一實施例中，相機114被安裝於購物商店(諸如超級市場)中以致其具有重疊觀看域之相機(二或更多)的集合被置於各走道上方以擷取該商店中之真實空間的影像。有「n」個相機於真實空間中。各相機係產生相應於其各別觀看域之真實空間的影像之序列。 In one embodiment, cameras 114 are installed in shopping stores (such as supermarkets) such that a set of cameras (two or more) with overlapping viewing fields are placed over each aisle to capture the real space in the store image. There are "n" cameras in real space. Each camera produces a sequence of images corresponding to the real space of its respective viewing field.

主體識別子系統2602(亦稱為第一影像處理器)係處理接收自相機114之影像框以識別並追蹤真實空間中之主體。第一影像處理器包括主體影像辨識引擎。主體影像辨識引擎接收來自複數相機之影像的相應序列，並處理影像以識別影像的該相應序列中之影像所表示的主體。於一實施例中，該系統包括如上所述之每相機影像辨識引擎，用以識別並追蹤多關節主體。替代的影像辨識引擎可被使用，包括其中僅有一「關節」被辨識並追蹤於每個體之範例，或者涵蓋空間及時間之其他特徵或其他類型的影像資料被利用以辨識並追蹤其被處理的真實空間中之主體。 The subject recognition subsystem 2602 (also referred to as the first image processor) processes the image frames received from the camera 114 to identify and track subjects in real space. The first image processor includes a subject image recognition engine. A subject image recognition engine receives a corresponding sequence of images from the plurality of cameras and processes the images to identify subjects represented by images in the corresponding sequence of images. In one embodiment, the system includes a per-camera image recognition engine as described above for identifying and tracking multi-joint subjects. Alternative image recognition engines may be used, including instances in which only one "joint" is identified and tracked per body, or other features covering space and time or other types of image data are utilized to identify and track its processed Subjects in real space.

「語意差異」子系統2604(亦稱為第二影像處理器)包括背景影像辨識引擎，其係接收來自該些複數相機之影像的相應序列並語意地辨識背景(亦即，如貨架等存貨展示結構)中之重要差異，因為其係相關於(例如)來自各相機之影像中隨著時間經過的存貨項目之放下及取走。第二影像處理器係接收主體識別子系統2602之輸出及來自相機114之影像框以當作輸入。第二影像處理器係遮蔽該前台中之該些已識別主體以產生已遮蔽影像。該些已遮蔽影像係藉由以背景影像資料取代其與前台主體相應的定界框來產生。接續於此，該些背景影像辨識引擎係處理該些已遮蔽影像以識別並分類影像之該些相應序列中的該些影像中所表示之背景改變。於一實施例中，該些背景影像辨識引擎包含卷積神經網路。 The "Semantic Difference" subsystem 2604 (also referred to as the second image processor) includes a background image recognition engine that receives a corresponding sequence of images from the plurality of cameras and semantically identifies the background (ie, inventory displays such as shelves). structure) as it relates to, for example, the dropping and removal of inventory items over time in the imagery from each camera. The second image processor receives the output of the subject recognition subsystem 2602 and the image frame from the camera 114 as input. The second image processor shades the identified subjects in the foreground to generate shaded images. The occluded images are generated by replacing their bounding boxes corresponding to the foreground subject with background image data. Continuing here, the background image recognition engines process the masked images to identify and classify background changes represented in the images in the corresponding sequences of images. In one embodiment, the background image recognition engines include convolutional neural networks.

最後，第二影像處理器係處理已識別背景改變以進行由已識別主體取走存貨項目的檢測及由已識別主體放下存貨項目於存貨展示結構上的檢測之第一集合。檢測之第一集合亦被稱為存貨項目之放下及取走的背景檢測。於購物商店之範例中，第一檢測係識別由消費者或商店之員工從貨架取走或放在貨架上的存貨項目。該語意差異子系統包括用以使已識別背景改變與已識別主體關聯的邏輯。 Finally, the second image processor processes the identified background changes for the first set of detections of removal of inventory items by the identified subject and detection of the placement of inventory items on the inventory display structure by the identified subject. The first set of detections is also referred to as the background detection of inventory items put down and removed. In the example of a shopping store, the first detection is to identify an item of inventory that has been removed from or placed on a shelf by a customer or store employee. The semantic difference subsystem includes logic to cause identified context changes to be associated with identified subjects.

區提議子系統2606(亦稱為第三影像處理器)包括前台影像辨識引擎，其係接收來自該些複數相機114之影像的相應序列，並語意地辨識前台(亦即，購物者、其手以及存貨項目)中之重要物件，因為其係相關於(例如)來自各相機之影像中隨著時間經過的存貨項目之放下及取走。子系統2606亦接收主體識別子系統2602之輸出。該些第三影像處理器係處理來自相機114之影像的序列以識別並分類影像之該些相應序列中的該些影像中所表示之前台改變。第三影像處理器係處理已識別前台改變以進行由已識別主體取走存貨項目的檢測及由已識別主體放下存貨項目於存貨展示結構上的檢測之第二集合。檢測之第二集合亦被稱為存貨項目之放下及取走的前台檢測。於購物商店之範例中，檢測之第二集合係識別由消費者及商店之員工取走存貨項以及將存貨項目放在存貨展示結構上。 The zone proposal subsystem 2606 (also referred to as the third image processor) includes a foreground image recognition engine that receives the corresponding sequence of images from the plurality of cameras 114 and semantically identifies the foreground (ie, shopper, important items in their hands and inventory items) as they relate to, for example, the dropping and removal of inventory items over time in images from various cameras. Subsystem 2606 also receives the output of subject identification subsystem 2602. The third image processors process sequences of images from camera 114 to identify and classify previous changes represented in the images in the corresponding sequences of images. The third image processor processes the identified foreground changes for a second set of detections of removal of inventory items by the identified subject and detection of the placement of inventory items on the inventory display structure by the identified subject. The second set of tests is also referred to as the front-office tests for the drop and take-off of inventory items. In the shopping store example, the second set of detections is to identify the removal of inventory items by consumers and store employees and the placement of inventory items on the inventory display structure.

圖26中所述之系統包括選擇邏輯組件2608，用以處理檢測之第一及第二集合來產生包括已識別主體之存貨項目的列表之日誌資料結構。針對真實空間中之取走或放下，選擇邏輯2608係選擇來自語意差異子系統2604或區提議子系統2606之任一者的輸出。於一實施例中，選擇邏輯2608係使用由語意差異子系統針對檢測之第一集合所產生的信心分數以及由區提議子系統針對檢測之第二集合所產生的信心分數來進行選擇。針對特定檢測具有較高信心分數之子系統的輸出被選擇並使用以產生日誌資料結構1510(亦稱為購物車資料結構)，其包括與已識別前台主體關聯的存貨項目之列表。 The system depicted in FIG. 26 includes a selection logic component 2608 for processing the first and second sets of detections to generate a log data structure that includes a list of identified subjects' inventory items. Selection logic 2608 selects the output from either the semantic difference subsystem 2604 or the region proposal subsystem 2606 for take or drop in real space. In one embodiment, selection logic 2608 selects using confidence scores generated by the semantic difference subsystem for the first set of detections and confidence scores generated by the region proposal subsystem for the second set of detections. The output of the subsystem with a higher confidence score for a particular detection is selected and used to generate a log data structure 1510 (also referred to as a shopping cart data structure) that includes a list of inventory items associated with the identified front-office entities.

Subsystem Components

圖27提出子系統組件，其係實施該系統以追蹤藉由真實空間之區域中的主體之改變。該系統包含複數相機114，其係產生該真實空間中之相應觀看域的影像之各別序列。各相機之該觀看域係與如上所述的該些複數相機中之至少一其他相機的該觀看域重疊。於一實施例中，相應於由該些複數相機114所產生之影像的影像框的序列被儲存在每相機114之循環緩衝器1502(亦稱為環緩衝器)中。各影像框具有時戳、相機之識別(縮寫為「相機_id」)、及框識別(縮寫為「框_id」)，連同影像資料。循環緩衝器1502儲存來自各別相機114之連續有時戳影像框之集合。於一實施例中，相機114被組態成產生影像之同步化序列。 Figure 27 presents the subsystem components that implement the system to track changes by subjects in an area of real space. The system includes a plurality of cameras 114 that generate respective sequences of images of corresponding viewing domains in the real space. The viewing field of each camera overlaps the viewing field of at least one other camera of the plurality of cameras as described above. In one embodiment, a sequence of image frames corresponding to images produced by the plurality of cameras 114 is stored in a circular buffer 1502 (also referred to as a ring buffer) for each camera 114 . Each image frame has a timestamp, an identification of the camera (abbreviated as "camera_id"), and a frame identification (abbreviated as "frame_id"), along with image data. The circular buffer 1502 stores a collection of consecutive time-stamped image frames from the respective cameras 114. In one embodiment, camera 114 is configured to generate a synchronized sequence of images.

相同的相機及影像之相同的序列係由一較佳實施方式中之前台及背景影像處理器兩者所使用。結果，存貨項目之放下及取走的冗餘檢測係使用相同的輸入資料而被執行，以容許高信心(及高準確度)於所得資料中。 The same camera and the same sequence of images are used by both the foreground and background image processors in a preferred embodiment. As a result, redundant detection of put-down and take-out of inventory items is performed using the same input data, allowing high confidence (and high accuracy) in the resulting data.

主體識別子系統2602(亦稱為第一影像處理器)包括主體影像辨識引擎，接收來自該些複數相機114之影像的相應序列。該主體影像辨識引擎係處理影像以識別影像之該些相應序列中的該些影像中所表示之主體。於一實施例中，該主體影像辨識引擎被實施為卷積神經網路(CNN)，其被稱為關節CNN 112a-112n。相應於具有重疊觀看域之相機的關節CNN 112a-112n之輸出被結合以將來自各相機之2D影像座標的關節之位置映射至真實空間之 3D座標。每主體(j)之關節資料結構800(其中j等於1至x)識別真實空間中以及各影像的2D空間中之主體(j)的關節之位置。主體資料結構800之某些細節被提出於圖8中。 The subject recognition subsystem 2602 (also referred to as the first image processor) includes a subject image recognition engine that receives corresponding sequences of images from the plurality of cameras 114 . The subject image recognition engine processes images to identify subjects represented in the images in the corresponding sequences of images. In one embodiment, the subject image recognition engine is implemented as a convolutional neural network (CNN), referred to as joint CNNs 112a-112n. The outputs of the joint CNNs 112a-112n corresponding to cameras with overlapping viewing fields are combined to map the positions of the joints from the 2D image coordinates of each camera to the real space. 3D coordinates. The joint data structure 800 of each subject (j), where j equals 1 to x, identifies the positions of the joints of the subject (j) in real space and in the 2D space of each image. Certain details of the master data structure 800 are presented in FIG. 8 .

語意差異子系統2604中之背景影像儲存2704係儲存針對來自相機114之影像的相應序列之已遮蔽影像(亦稱為背景影像，其中前台主體已藉由遮蔽而被移除)。背景影像儲存2704亦稱為背景緩衝器。於一實施例中，已遮蔽影像之大小係相同於循環緩衝器1502中的影像框之大小。於一實施例中，已遮蔽影像被儲存在背景影像儲存2704中，其係相應於每相機之影像框的序列中之各影像框。 Background image storage 2704 in semantic disparity subsystem 2604 stores masked images (also known as background images, where foreground subjects have been removed by masking) for the corresponding sequence of images from camera 114 . Background image storage 2704 is also referred to as a background buffer. In one embodiment, the size of the masked image is the same as the size of the image frame in the circular buffer 1502 . In one embodiment, the masked image is stored in the background image store 2704, which corresponds to each image frame in the sequence of image frames per camera.

語意差異子系統2604(或第二影像處理器)包括遮罩產生器2724，其係產生來自相機之影像的相應序列中之影像所表示的前台主體之遮罩。於一實施例中，一遮罩產生器係處理每相機之影像的序列。於購物商店之範例中，前台主體是在含有供銷售之項目的背景貨架前方之消費者或商店的員工。 The semantic disparity subsystem 2604 (or the second image processor) includes a mask generator 2724, which generates a mask of the foreground subject represented by the image in the corresponding sequence of images from the camera. In one embodiment, a mask generator processes the sequence of images per camera. In the shopping store example, the front-office subject is a customer or store employee in front of a background shelf containing items for sale.

於一實施例中，關節資料結構800及來自循環緩衝器1502之影像框被提供為針對遮罩產生器2724之輸入。關節資料結構係識別各影像框中之前台主體的位置。遮罩產生器2724產生影像框中所識別之每前台主體的定界框。於此一實施例中，遮罩產生器2724係使用2D影像框中之關節位置的x及y座標以判定定界框之四個邊界。x(來自針對一主體之關節的所有x值)之最小值係界定該主體之定界框的左垂直邊界。y(來自針對一主體之關節的所有y值)之最小值係界定定界框的底部垂直邊界。同樣地，x及y座標之最大值係識別定界框之右垂直及頂部水平邊界。於第二實施例中，遮罩產生器2724係使用卷積神經網路為基的人檢測及局部化演算法以產生前台主體之定界框。於此一實施例中，遮罩產生器2724不使用關節資料結構800以產生前台主體之定界框。 In one embodiment, the joint data structure 800 and the image frame from the circular buffer 1502 are provided as inputs to the mask generator 2724. The joint data structure identifies the position of the foreground subject in each image frame. Mask generator 2724 generates a bounding box for each foreground subject identified in the image frame. In this embodiment, the mask generator 2724 uses the x and y coordinates of the joint positions in the 2D image frame to determine the four boundaries of the bounding box. The minimum value of x (all x values from the joints for a subject) is what defines the subject The left vertical border of the bounding box. The minimum value of y (from all y values for a body's joints) defines the bottom vertical boundary of the bounding box. Likewise, the maxima of the x and y coordinates identify the right vertical and top horizontal boundaries of the bounding box. In the second embodiment, the mask generator 2724 uses a convolutional neural network based person detection and localization algorithm to generate the bounding box of the foreground subject. In this embodiment, the mask generator 2724 does not use the joint data structure 800 to generate the bounding box of the foreground body.

語意差異子系統2604(或第二影像處理器)包括遮罩邏輯，用以處理影像之該些序列中的影像而以來自影像之該些相應序列的背景影像之背景影像資料取代其代表該些已識別主體之前台影像資料，以提供已遮蔽影像，其導致新背景影像以供處理。當循環緩衝器接收來自相機114之影像框時，遮罩邏輯係處理影像之該些序列中的影像來以背景影像資料取代由影像遮罩所界定的前台影像資料。該背景影像資料被取自影像之該些相應序列的該些背景影像以產生該些相應已遮罩影像。 Semantic disparity subsystem 2604 (or second image processor) includes masking logic to process images in the sequences of images and replace them with background image data from background images of the corresponding sequences of images to represent the images The subject's foreground image data is identified to provide a masked image, which results in a new background image for processing. When the circular buffer receives an image frame from the camera 114, the masking logic processes the images in the sequences of images to replace the foreground image data defined by the image mask with background image data. The background image data is taken from the background images of the corresponding sequences of images to generate the corresponding masked images.

考量購物商店之範例。一開始，於時間t=0，當商店中沒有消費者時，背景影像儲存2704中之背景影像係相同於每相機之影像的該些序列中之其相應影像框。現在考量時間t=1，消費者係於貨架前方移動以購買該貨架中之項目。遮罩產生器2724係產生該消費者之定界框並將其傳送至遮罩邏輯組件2702。遮罩邏輯組件2702係藉由在t=0之該背景影像框中的相應像素來取代該定界框內部在t=1之該影像框中的像素。此係導致相應於循環緩衝器1502中在t=1之該影像框的在t=1之已遮蔽影像。已遮蔽影像不包括針對前台主體(或消費者)之像素，係現在係由來自在t=0之該背景影像框的像素所取代。在t=1之已遮蔽影像被儲存於背景影像儲存2704中並作用為來自相應相機的影像之該些序列中在t=2之下個影像框的背景影像。 Consider the example of a shopping store. Initially, at time t =0, when there are no customers in the store, the background image in the background image store 2704 is the same as its corresponding image frame in the sequences of images for each camera. Now considering time t = 1, the consumer moves in front of the shelf to purchase the item in the shelf. Mask generator 2724 generates the bounding box for the consumer and transmits it to mask logic component 2702. Mask logic 2702 replaces pixels inside the bounding box in the image frame at t =1 with corresponding pixels in the background image frame at t =0. This results in an occluded image at t=1 corresponding to the image frame at t = 1 in circular buffer 1502. The occluded image does not include pixels for the foreground subject (or consumer), which are now replaced by pixels from the background image frame at t =0. The occluded image at t =1 is stored in the background image store 2704 and used as the background image for the next image frame at t =2 in the sequences of images from the corresponding cameras.

於一實施例中，遮罩邏輯組件2702係結合(諸如藉由以像素來平均或加總)影像之該些序列中的多組N個已遮蔽影像以產生針對各相機之因數化影像的序列。於此一實施例中，該些第二影像處理器藉由處理因數化影像的該序列以識別並分類背景改變。因數化影像可(例如)藉由取得每相機的已遮蔽影像之該序列中的N個已遮蔽影像中之像素的平均值來產生。於一實施例中，N之值係等於相機114之框率，例如，假如框率為30 FPS(每秒之框)，則N之值為30。於此一實施例中，針對一秒之時間週期的已遮蔽影像被結合以產生因數化影像。取得平均像素值係將像素波動減至最小，由於真實空間之區域中的感應器雜訊及發光度改變。 In one embodiment, masking logic component 2702 combines (such as by averaging or summing by pixels) sets of N masked images in the sequences of images to generate a sequence of factorized images for each camera . In this embodiment, the second image processors identify and classify background changes by processing the sequence of factorized images. The factorized image can be generated, for example, by taking the average of the pixels in the N occluded images in the sequence of occluded images per camera. In one embodiment, the value of N is equal to the frame rate of the camera 114 , eg, if the frame rate is 30 FPS (frames per second), the value of N is 30. In this embodiment, the masked images for a time period of one second are combined to generate a factorized image. Obtaining the average pixel value minimizes pixel fluctuations due to sensor noise and luminance changes in areas of real space.

該些第二影像處理器藉由處理因數化影像的該序列以識別並分類背景改變。因數化影像之該些序列中的因數化影像係藉由位元遮罩計算器2710而與該相同相機之先前因數化影像進行比較。因數化影像2706之對被提供為輸入至位元遮罩計算器2710以產生位元遮罩，其係識別兩因數化影像之相應像素中的改變。該位元遮罩具有1於像素位置，其中介於相應像素(目前及先前因數化影像)的 RGB(紅、綠及藍頻道)值之間的差異係大於「差異臨限值」。該差異臨限值之值是可調整的。於一實施例中，該差異臨限值之值被設於0.1。 The second image processors identify and classify background changes by processing the sequence of factorized images. The factorized images in the sequences of factorized images are compared by the bitmask calculator 2710 to previous factorized images of the same camera. Pairs of factorized images 2706 are provided as input to a bitmask calculator 2710 to generate a bitmask that identifies changes in corresponding pixels of the two factorized images. The bitmask has 1 at the pixel position, where between the corresponding pixels (current and previous factorized images) The difference between the RGB (red, green and blue channels) values is greater than the "difference threshold". The value of this difference threshold is adjustable. In one embodiment, the value of the difference threshold is set at 0.1.

來自每相機之因數化影像的序列之該位元遮罩及該對因數化影像(目前及先前)被提供為輸入至背景影像辨識引擎。於一實施例中，背景影像辨識引擎包含卷積神經網路且被稱為改變CNN 2714a-2714n。單一改變CNN係處理每相機之因數化影像的序列。於另一實施例中，來自影像之相應序列的已遮蔽影像不被結合。該位元遮罩被計算自已遮蔽影像之該些對。於此實施例中，已遮蔽影像之該些對及該位元遮罩被接著提供為輸入至該改變CNN。 The bitmask and the pair of factorized images (current and previous) from the sequence of factorized images per camera are provided as input to the background image recognition engine. In one embodiment, the background image recognition engine includes a convolutional neural network and is referred to as changing CNNs 2714a-2714n. A single-change CNN processes a sequence of factorized images per camera. In another embodiment, masked images from the corresponding sequence of images are not combined. The bitmask is computed from the pairs of masked images. In this embodiment, the pairs of masked images and the bit mask are then provided as input to the change CNN.

於此範例中對於改變CNN模型之輸入係由七(7)個頻道所組成，包括每因數化影像之三個影像頻道(紅、綠及藍)及針對該位元遮罩之一頻道。該改變CNN包含多數卷積層及一或更多完全連接(FC)層。於一實施例中，該改變CNN包含如圖5中所示之關節CNN 112a-112n的相同數目的卷積及FC層。 The input to changing the CNN model in this example consists of seven (7) channels, including three image channels (red, green, and blue) per factorized image and one channel for the bit mask. The modified CNN includes a plurality of convolutional layers and one or more fully connected (FC) layers. In one embodiment, the modified CNN includes the same number of convolutional and FC layers as the joint CNNs 112a-112n shown in FIG. 5 .

背景影像辨識引擎(改變CNN 2714a-2714n)係識別並分類該些因數化影像中之改變且產生針對影像之該些相應序列的改變資料結構。該些改變資料結構包括已識別背景改變之已遮蔽影像中的座標、該些已識別背景改變之存貨項目主體的識別符及該些已識別背景改變之類別。該些改變資料結構中之已識別背景改變的該些類別係分類該已識別存貨項目是否已相對於該背景影像而被加入或移除。 Background image recognition engines (change CNNs 2714a-2714n) identify and classify changes in the factorized images and generate change data structures for the corresponding sequences of images. The change data structures include the coordinates in the masked image of the identified background changes, the identifiers of the inventory item bodies of the identified background changes, and the category of the identified background changes. The categories of identified background changes in the change data structures classify whether the identified inventory item has been added relative to the background image or remove.

因為項目可由一或更多主體所同時地取走或放在貨架上，所以該改變CNN係產生其重疊每輸出位置之定界框預測的數字「B」。定界框預測係相應於該因數化影像中之改變。考量該購物商店具有數字「C」的獨特存貨項目，各由獨特SKU所識別。該改變CNN係預測該改變之存貨項目主體的SKU。最後，改變CNN係識別針對該輸出中之每一位置(像素)的改變(或存貨事件類型)，其係指示該已識別項目是被取走自該貨架或是被放下於該貨架上。來自改變CNN之以上三對輸出係由式子「5 * B+C+1」來描述。各定界框「B」預測係包含五(5)數字，因此「B」被乘以5。這五個數字係代表定界框之中心、定界框之寬度及高度的「x」及「y」座標。第五數字係代表針對該定界框之預測的改變CNN模型之信心分數。「B」為超參數，其可被調整以增進改變CNN模型之性能。於一實施例中，「B」之值等於4。考量來自改變CNN之輸出的寬度及高度(以像素)係各別地由W及H所表示。改變CNN之輸出被接著表達為「W*H*(5*B+C+1)」。定界框輸出模型係根據由Redmon及Farhadi於其論文「YOLO9000：更佳、更快、更強」(發佈於2016年十二月25日)中所提議的物件檢測系統。該論文可取得於https：//arxiv.org/pdf/1612.08242.pdf。 Because items can be picked up or put on a shelf by one or more subjects at the same time, the changing CNN produces a number "B" that overlaps the bounding box predictions for each output location. Bounding box predictions correspond to changes in the factorized image. Consider that the shopping store has unique inventory items with the number "C", each identified by a unique SKU. The change CNN predicts the SKU of the body of the inventory item for the change. Finally, the change CNN identifies a change (or inventory event type) for each location (pixel) in the output that indicates whether the identified item was taken from the shelf or dropped on the shelf. The above three pairs of outputs from changing CNN are described by the formula "5*B+C+1". Each bounding box "B" prediction contains five (5) numbers, so "B" is multiplied by five. These five numbers represent the "x" and "y" coordinates of the center of the bounding box, the width and height of the bounding box. The fifth number represents the confidence score of the changed CNN model for the prediction of the bounding box. "B" is a hyperparameter that can be adjusted to improve the performance of changing the CNN model. In one embodiment, the value of "B" is equal to four. Consider that the width and height (in pixels) of the output from the varying CNN are denoted by W and H, respectively. The output of changing the CNN is then expressed as "W*H*(5*B+C+1)". The bounding box output model is based on the object detection system proposed by Redmon and Farhadi in their paper "YOLO9000: Better, Faster, Stronger" (published December 25, 2016). The paper is available at https://arxiv.org/pdf/1612.08242.pdf.

相應於來自具有重疊觀看域之相機的影像之序列的改變CNN 2714a-2714n之輸出係由協調邏輯組件2718所結合。協調邏輯組件係處理來自具有重疊觀看域之多組相機的改變資料結構來找出真實空間中之該些已識別背景改變。協調邏輯組件2718係選擇定界框，其代表具有相同SKU及相同存貨事件類型(取走或放下)之存貨項目，來自具有重疊觀看域之多數相機。選定的定界框被接著三角測量於3D真實空間中(使用如上所述之三角測量技術)，以識別3D真實空間中之存貨項目的位置。真實空間中之貨架的位置係與3D真實空間中之存貨項目的三角測量出的位置進行比較。錯誤肯定預測被丟棄。例如，假如定界框之三角測量出的位置不映射至真實空間中之貨架的位置，則該輸出被丟棄。其映射至貨架之3D真實空間中的定界框之三角測量出的位置被視為存貨事件之真實預測。 The outputs of the CNNs 2714a-2714n are combined by the coordination logic component 2718 corresponding to changes in the sequence of images from cameras with overlapping viewing domains. The Coordination Logic Component handles data from Change the data structure of groups of cameras to find the identified background changes in real space. Coordination logic component 2718 selects bounding boxes that represent inventory items with the same SKU and the same inventory event type (pick or drop) from most cameras with overlapping viewing fields. The selected bounding box is then triangulated in the 3D real space (using the triangulation techniques described above) to identify the location of the inventory item in the 3D real space. The positions of the shelves in the real space are compared to the triangulated positions of the inventory items in the 3D real space. False positive predictions are discarded. For example, if the triangulated position of the bounding box does not map to the position of the shelf in real space, the output is discarded. The triangulated location of its mapping to the bounding box in the 3D real space of the shelf is considered a true prediction of the inventory event.

於一實施例中，由第二影像處理器所產生的該些改變資料結構中之已識別背景改變的該些類別係分類該已識別存貨項目是否已相對於該背景影像而被加入或移除。於另一實施例中，該些改變資料結構中之已識別背景改變的該些類別係指示該已識別存貨項目是否已相對於該背景影像而被加入或移除，且該系統包括用以使背景改變與已識別主體相關聯的邏輯。該系統執行由該些已識別主體取走存貨項目的檢測及由該些已識別主體放下存貨項目於存貨展示結構上的檢測。 In one embodiment, the categories of identified background changes in the change data structures generated by the second image processor classify whether the identified inventory item has been added or removed relative to the background image . In another embodiment, the categories of identified background changes in the change data structures indicate whether the identified inventory item has been added or removed relative to the background image, and the system includes a method for enabling The context changes the logic associated with the identified subject. The system performs detection of the removal of inventory items by the identified entities and detection of the placement of inventory items on the inventory display structure by the identified entities.

日誌產生器2720係實施邏輯以使由改變之真實預測所識別的改變與接近該改變之位置的已識別主體相關聯。於利用關節識別引擎以識別主體之實施例中，日誌產生器2720係使用關節資料結構800以判定3D真實空間中之主體的手關節之位置。識別出一主體，其手關節是在改變之時刻落入與改變之位置的臨限值距離內。日誌產生器係使該改變與該已識別主體相關聯。 The log generator 2720 implements logic to associate a change identified by a true prediction of the change with an identified subject proximate the location of the change. In the embodiment that utilizes the joint recognition engine to identify the subject, the log generator 2720 uses the joint data structure 800 to determine the The position of the main body's hand joints. A subject is identified whose hand joints are within a threshold distance from the changed position at the moment of change. The log generator associates the change with the identified subject.

於一實施例中，如上所述，N個已遮蔽影像被結合以產生因數化影像，其被提供為輸入至該改變CNN。考量：N等於相機114之框率(每秒框數)。因此，於此一實施例中，於一第二時間週期期間之主體的手之位置係與該改變之位置進行比較以使該些改變與已識別主體相關聯。假如多於一主體之手關節位置落入與改變之位置的臨限值距離內，則該改變與主體之關聯被延緩至前台影像處理子系統2606之輸出。 In one embodiment, as described above, the N masked images are combined to generate a factorized image, which is provided as input to the changing CNN. Consideration: N is equal to the frame rate (frames per second) of the camera 114 . Thus, in this embodiment, the position of the subject's hand during a second time period is compared to the changed position to associate the changes with the identified subject. If more than one subject's hand joint positions fall within the threshold distance from the changed position, the association of the change with the subject is deferred to the output of the foreground image processing subsystem 2606 .

前台影像處理(區提議)子系統2606(亦稱為第三影像處理器)包括前台影像辨識引擎，其接收來自該些複數相機之影像的該些序列。該些第三影像處理器包括邏輯，用以識別並分類影像之該些相應序列中的該些影像中所表示之前台改變。該區提議子系統2606產生由該些已識別主體取走存貨項目的檢測及由該些已識別主體放下存貨項目於存貨展示結構上的檢測之第二集合。如圖27中所示，子系統2606包括定界框產生器1504、WhatCNN 1506及WhenCNN 1508。關節資料結構800及來自循環緩衝器1502之每相機的影像框被提供為針對定界框產生器1504之輸入。定界框產生器1504、WhatCNN 1506及WhenCNN 1508之細節被較早地提出。 The foreground image processing (region proposal) subsystem 2606 (also referred to as the third image processor) includes a foreground image recognition engine that receives the sequences of images from the plurality of cameras. The third image processors include logic to identify and classify previous changes represented in the images in the corresponding sequences of images. The zone proposal subsystem 2606 generates a second set of detections of taking inventory items by the identified subjects and detections of placing inventory items on the inventory display structure by the identified subjects. As shown in FIG. 27, subsystem 2606 includes bounding box generator 1504, WhatCNN 1506, and WhenCNN 1508. Joint data structure 800 and per-camera image frames from circular buffer 1502 are provided as inputs to bounding box generator 1504 . Details of bounding box generator 1504, WhatCNN 1506, and WhenCNN 1508 were presented earlier.

圖27中所述之系統包括選擇邏輯，用以處理檢測之第一及第二集合來產生包括已識別主體之存貨項目的列表之日誌資料結構。由該些已識別主體取走存貨項目的檢測及由該些已識別主體放下存貨項目於存貨展示結構上的檢測之第一集合係由日誌產生器2720所產生。檢測之第一集合係使用第二影像處理器之輸出及關節資料結構800(如上所述)來判定。由該些已識別主體取走存貨項目的檢測及由該些已識別主體放下存貨項目於存貨展示結構上的檢測之第二集合係使用第三影像處理器之輸出來判定。針對各真實存貨事件(取走或放下)，選擇邏輯控制器2608係選擇來自第二影像處理器(語意差異子系統2604)或第三影像處理器(區提議子系統2606)之任一者的輸出。於一實施例中，選擇邏輯係選擇來自針對該存貨事件具有較高信心分數之影像處理器的輸出。 The system depicted in Figure 27 includes selection logic to process The first and second sets of detections are used to generate a log data structure that includes a list of inventory items for the identified subject. A first set of detections of removal of inventory items by the identified entities and detections of placing inventory items on the inventory display structure by the identified entities is generated by the log generator 2720 . The first set of detections is determined using the output of the second image processor and the joint data structure 800 (described above). A second set of detections of removal of inventory items by the identified subjects and detections of placement of inventory items on the inventory display structure by the identified subjects are determined using the output of the third image processor. For each actual inventory event (taken or put down), the selection logic controller 2608 selects a output. In one embodiment, the selection logic selects the output from the image processor with the higher confidence score for the inventory event.

The program flow of background image semantic difference

圖28A及28B提出由語意差異子系統2604所履行的詳細步驟，用以追蹤藉由真實空間之區域中的主體之改變。於購物商店之範例中，主體為移動於貨架與其他開放空間之間的走道中之商店中的消費者及商店的員工。該程序開始於步驟2802。如上所述，相機114被調校在來自相機之影像的序列被處理以識別主體之前。相機調校之細節被提出如上。具有重疊觀看域之相機114係擷取其中有主體出現之真實空間的影像。於一實施例中，相機被組態成以每秒N框的速率產生影像之同步化序列。各相機之影像的序列被儲存於每相機之各別循環緩衝器1502中，於步驟2804。循環緩衝器(亦稱為環緩衝器)係儲存時間之滑動窗中的影像之序列。背景影像儲存2704被初始化以不具前台主體之每相機的影像框之該序列中的初始影像框(步驟2806)。 Figures 28A and 28B present the detailed steps performed by the semantic disparity subsystem 2604 to track changes in subjects through regions of real space. In the shopping store example, the subjects are the consumers and store employees in the store moving in the aisles between the shelves and other open spaces. The process begins at step 2802. As described above, the camera 114 is calibrated before the sequence of images from the camera is processed to identify the subject. Details of camera calibration are presented above. Cameras 114 with overlapping viewing fields capture images of the real space in which the subject appears. In one embodiment, the cameras are configured to generate a synchronized sequence of images at a rate of N frames per second. of each camera The sequence of images is stored in each camera's respective circular buffer 1502, at step 2804. A circular buffer (also known as a ring buffer) stores a sequence of images in a sliding window of time. Background image storage 2704 is initialized with the initial image frame in the sequence of image frames per camera without foreground subjects (step 2806).

當主體於貨架之前方移動時，每主體之定界框係使用其相應關節資料結構800(如上所述)來產生(步驟2808)。於步驟2810，已遮蔽影像係藉由以來自背景影像儲存2704之來自背景影像的相同位置上之像素取代每影像框的定界框中之像素來產生。相應於每相機的影像之該些序列中之各影像的已遮蔽影像被儲存於背景影像儲存2704中。第i已遮蔽影像被使用為背景影像，用以取代每相機的影像框之該序列中的後續(i+1)影像框中之像素。 As the bodies move in front of the shelf, a bounding box for each body is generated using its corresponding joint data structure 800 (described above) (step 2808). At step 2810, a masked image is generated by replacing the pixels in the bounding box of each image frame with pixels from the background image store 2704 at the same location from the background image. Masked images for each of the sequences of images corresponding to each camera are stored in background image storage 2704 . The i-th masked image is used as the background image to replace pixels in subsequent (i+1) image frames in the sequence of each camera's image frame.

於步驟2812，N個已遮蔽影像被結合以產生因數化影像。於步驟2814，差異熱映圖係藉由比較多對因數化影像之像素值來產生。於一實施例中，介於兩因數化影像(fi1及fi2)的2D空間中之位置(x,y)上的像素之間的差異被計算於方程式1中如以下所示：

At step 2812, the N masked images are combined to generate a factorized image. At step 2814, a difference heatmap is generated by comparing the pixel values of pairs of factorized images. In one embodiment, the difference between pixels at positions (x, y) in the 2D space of the two factorized images ( fi 1 and fi 2) is calculated in Equation 1 as follows:

介於2D空間中之相同x及y位置上的像素之間的差異係使用紅、綠及藍(RGB)頻道(如該方程式中所示)之各別強度值來判定。以上方程式係提供兩個因數化影像中的相應像素之間的差異之數值(亦稱為歐幾里德模值)。 Differences between pixels at the same x and y positions in 2D space are determined using the respective intensity values of the red, green and blue (RGB) channels (as shown in the equation). The above equations provide the value of the difference (also known as the Euclidean modulus value) between corresponding pixels in the two factorized images.

差異熱映圖可含有雜訊，由於真實空間之區域中的感應器雜訊及發光度改變。在圖28B中，於步驟2816，位元遮罩被產生給差異熱映圖。語意上有意義的改變係由該位元遮罩中之1(一)的叢集所識別。這些叢集係相應於其識別從貨架取走或被放在貨架上的存貨項目之改變。然而，差異熱映圖中之雜訊可引入隨機1於該位元遮罩中。此外，多數改變(取自該貨架或放在該貨架上之多數項目)可引入1之重疊叢集。於該程序流之下一步驟(2818)，影像形態操作被應用至該位元遮罩。影像形態操作係移除雜訊(不想要的1)且亦嘗試分離1之重疊叢集。如此導致較乾淨的位元遮罩，其包含相應於語意上有意義的改變之1的叢集。 Difference heatmaps can contain noise due to sensor noise and luminance changes in areas of real space. In Figure 28B, at step 2816, a bitmask is generated for the difference heatmap. A semantically meaningful change is identified by a cluster of 1 (one) in the bit mask. These clusters correspond to changes in their identification of inventory items removed from or placed on the shelf. However, noise in the difference heatmap can introduce random 1s into the bitmask. Furthermore, most changes (most items taken from or placed on the shelf) can introduce overlapping clusters of 1. At the next step in the program flow (2818), image shape operations are applied to the bit mask. Image morphological operations remove noise (unwanted 1s) and also attempt to separate overlapping clusters of 1s. This results in a cleaner bit mask containing clusters corresponding to one of the semantically meaningful changes.

兩輸入被提供至形態操作。第一輸入為該位元遮罩而第二輸入被稱為結構元件或內核。兩個基本形態操作為「侵蝕」及「膨脹」。內核係由以多種大小之矩形矩陣所配置的1所組成。不同形狀(例如，圓形、橢圓形或十字形狀)的內核係藉由相加該矩陣中之特定位置上的0來產生。不同形狀的內核被用於影像形態操作以獲得於清潔位元遮罩時之所欲結果。於侵蝕操作時，內核係滑動(或移動)於該位元遮罩之上。該位元遮罩中之像素(1或0之任一者)被視為1，假如於該內核之下的所有像素均為1的話。否則，其被侵蝕(改變至0)。侵蝕操作可用於移除該位元遮罩中之隔離的1。然而，侵蝕亦藉由侵蝕邊緣而縮小了1的該些叢集。 Two inputs are provided to the morphological operations. The first input is the bit mask and the second input is called the structural element or kernel. The two basic morphological operations are "erosion" and "dilation". The kernel consists of 1s arranged in a rectangular matrix of various sizes. Kernels of different shapes (eg, circular, oval, or cross-shaped) are generated by adding 0s at specific positions in the matrix. Kernels of different shapes are used for image morphing operations to obtain the desired result when cleaning the bit mask. During the erosion operation, the kernel slides (or moves) over the bit mask. Pixels (either 1 or 0) in the bitmask are considered 1 if all pixels under the kernel are 1s. Otherwise, it is eroded (changed to 0). The erosion operation can be used to remove isolated ones in the bitmask. However, erosion also shrinks the clusters of 1 by eroding the edges.

膨脹操作為侵蝕的相反。於此操作中，當內核滑動於該位元遮罩之上時，由該內核所重疊之位元遮罩區域中的所有像素之值均被改變至1，假如該內核之下的至少一像素之值為1的話。膨脹被應用至侵蝕後之該位元遮罩以增加1之大小叢集。因為雜訊在侵蝕時被移除，所以膨脹不會將隨機雜訊引入至該位元遮罩。侵蝕與膨脹操作之組合被應用以獲得較乾淨的位元遮罩。例如，電腦程式碼之後續行係將1之3x3過濾器應用至該位元遮罩以履行「開放」操作，其係應用侵蝕操作接著膨脹操作以移除雜訊並復原該位元遮罩中之1的叢集之大小，如上所述。上述電腦程式碼係使用針對即時電腦視覺應用程式之編程功能的OpenCV(開放式來源電腦視覺)庫。該庫可取得於https：//opencv.org/._bit_mask=cv2.morphologyEx(bit_mask,cv2.MORPH_OPEN,self.kernel_3x3,dst=_bit_mask) Dilation is the opposite of erosion. In this operation, when the kernel slides over the bitmask, the values of all pixels in the bitmask area overlapped by the kernel are changed to 1, provided that at least one pixel below the kernel If the value is 1. Dilation is applied to the bitmask after erosion to increase the size clusters by 1. Dilation does not introduce random noise into the bitmask because noise is removed during erosion. A combination of erosion and dilation operations are applied to obtain a cleaner bit mask. For example, subsequent lines of computer code apply a 3x3 filter of 1 to the bitmask to perform an "open" operation, which applies an erosion operation followed by a dilation operation to remove noise and restore the bitmask The size of the cluster of 1, as described above. The above computer code uses the OpenCV (Open Source Computer Vision) library for programming functions of real-time computer vision applications. The library is available at https://opencv.org/._bit_mask=cv2.morphologyEx(bit_mask,cv2.MORPH_OPEN,self.kernel_3x3,dst=_bit_mask)

「關閉」操作係應用膨脹操作接著侵蝕操作。其可用於關閉1之叢集內部的小洞。以下程式碼係使用30x30之十字形狀的過濾器以應用關閉操作至該位元遮罩。 The "close" operation applies the dilation operation followed by the erosion operation. It can be used to close small holes inside a cluster of 1s. The following code uses a 30x30 cross-shaped filter to apply a close operation to the bitmask.

該位元遮罩及兩個因數化影像(之前及之後)被提供為輸入至每相機之卷積神經網路(稱之為如上的改變CNN)。改變CNN之輸出為改變資料結構。於步驟2822，來自具有重疊觀看域之改變CNN的輸出係使用較早所述之三角測量技術而被結合。3D真實空間中之改變的位置係與貨架之位置匹配。假如存貨事件之位置映射至貨架上之位置，則該改變被視為真實事件(步驟2824)。否則，該改變為錯誤肯定且被丟棄。真實事件係與前台主體相關。於步驟2826，前台主體被識別。於一實施例中，關節資料結構800被用以判定該改變之臨限值距離內的手關節之位置。假如前台主體被識別於步驟2828，則該改變被關聯至該已識別主體，於步驟2830。假如無前台主體被識別於步驟2828，例如，由於該改變之臨限值距離內的多數主體之手關節位置。接著藉由區提議子系統之該改變的冗餘檢測被選擇，於步驟2832。該程序結束於步驟2834。 The bitmask and the two factorized images (before and after) are provided as input to a per-camera convolutional neural network (referred to as a change CNN as above). Changing the output of the CNN changes the data structure. At step 2822, the output from the changing CNN with overlapping viewing domains is used earlier The triangulation techniques described are combined. The changed position in the 3D real space matches the position of the shelf. If the location of the inventory event maps to the location on the shelf, the change is considered a true event (step 2824). Otherwise, the change is a false positive and discarded. The real event is related to the foreground subject. At step 2826, the foreground subject is identified. In one embodiment, the joint data structure 800 is used to determine the position of the hand joints within the threshold distance of the change. If the foreground subject is identified at step 2828, then the change is associated with the identified subject at step 2830. If no foreground subject is identified at step 2828, for example, due to the majority of subjects' hand joint positions within the threshold distance of the change. Then, at step 2832, redundant detection of the change by the zone proposal subsystem is selected. The routine ends at step 2834.

Training Change CNN

七個頻道輸入之訓練資料集被產生以訓練該改變CNN。當作消費者之一或更多主體係藉由假裝在購物商店中購物以履行取走及放下動作。主體係於走道中移動，從貨架取走存貨項目以及將項目放回貨架上。履行取走及放下動作的演員之影像被收集於循環緩衝器1502中。該些影像被處理以產生因數化影像，如上所述。多對因數化影像2706以及由位元遮罩計算器2710所輸出之相應位元遮罩被手動地檢視以視覺地識別介於兩因數化影像之間的改變。針對具有改變之因數化影像，定界框被手動地描繪繪製於該改變周圍。此為最小的定界框，其含有相應於該位元遮罩中之該改變的1之叢集。該改變中之存貨項目的 SKU數被識別且被包括於針對該影像(連同該定界框)之標籤中。識別存貨項目之取走或放下的事件類型亦被包括於該定界框之標籤中。因此各定界框之標籤係識別(在該因數化影像上之其位置)該項目之SKU以及該事件類型。因數化影像可具有多於一個定界框。上述程序被重複於該訓練資料集中之所有已收集因數化影像中的每一改變。一對因數化影像(連同該位元遮罩)係形成對於該改變CNN之七個頻道輸入。 A training dataset of seven channel inputs was generated to train the modified CNN. One or more hosts as consumers perform pick-up and drop-off actions by pretending to shop in a shopping store. The main system moves in the aisle, removing inventory items from the racks and placing items back on the racks. Images of actors performing pick and drop actions are collected in circular buffer 1502. The images are processed to generate factorized images, as described above. Pairs of factorized images 2706 and corresponding bit masks output by bitmask calculator 2710 are manually reviewed to visually identify changes between the two factorized images. For a factorized image with a change, a bounding box is manually drawn around the change. This is the smallest bounding box that contains the cluster of 1's corresponding to the change in the bitmask. of the changing inventory item The SKU number is identified and included in the tag for the image (along with the bounding box). An event type identifying the taking or putting down of an inventory item is also included in the label of the bounding box. The label of each bounding box thus identifies (its position on the factorized image) the SKU of the item and the event type. The factorized image can have more than one bounding box. The above procedure is repeated for each change in all the collected factorized images in the training data set. A pair of factorized images (along with the bitmask) form the seven channel inputs to the changing CNN.

於該改變CNN之訓練期間，前向傳遞及後向傳播被履行。於前向傳遞中，該改變CNN係識別並分類背景改變，其被表示於該訓練資料集中的影像之該些相應序列中的因數化影像中。該改變CNN係處理已識別背景改變以進行由已識別主體取走存貨項目的檢測及由已識別主體放下存貨項目於存貨展示結構上的檢測之第一集合。於後向傳播期間，該改變CNN之輸出係與該地面真相(如訓練資料集之標籤中所指示)進行比較。一或更多成本函數之梯度被計算。梯度被接著傳播至卷積神經網路(CNN)及完全連接(FC)神經網路以致其預測誤差被減少，造成輸出更接近於地面真相。於一實施例中，softmax函數及交叉熵損失函數被用於針對該輸出之類別預測部分的改變CNN之訓練。該輸出之類別預測部分包括該存貨項目及該事件類型(亦即，取走或放下)之SKU識別符。 During the training of the modified CNN, forward pass and backward pass are performed. In the forward pass, the change CNN identifies and classifies background changes, which are represented in the factorized images in the corresponding sequences of images in the training data set. The change CNN is a first set of processing identified context changes for detection of removal of an inventory item by an identified subject and detection of an inventory item placed by an identified subject on an inventory display structure. During backpropagation, the output of the altered CNN is compared to the ground truth (as indicated in the labels of the training dataset). The gradients of one or more cost functions are computed. The gradients are then propagated to Convolutional Neural Networks (CNN) and Fully Connected (FC) Neural Networks so that their prediction errors are reduced, resulting in outputs that are closer to the ground truth. In one embodiment, a softmax function and a cross-entropy loss function are used for the training of the altered CNN for the class prediction portion of the output. The category forecast portion of the output includes the SKU identifier for the inventory item and the event type (ie, take or put down).

第二損失函數被用以訓練針對定界框之預測的改變CNN。此損失函數係計算介於預測框與地面真相框之間的intersection over union(IOU)。由具有真實定界框標籤之改變CNN所預測的定界框之交點的區域被除以相同定界框之聯集的區域。假如介於預測框與地面真相框之間的重疊很大，則IOU之值是高的。假如多於一預測定界框重疊地面真相定界框，則具有最高IOU值之一者被選擇以計算損失函數。損失函數之細節係由Redmon等人所提出於其論文中「You Only Look Once：Unified,Real-Time Object Detection」發佈於2016年五月9日。該論文可取得於https：//arxiv.org/pdf/1506.02640.pdf。 The second loss function is used to train the altered CNN for the prediction of the bounding box. This loss function is calculated between the prediction box and the ground truth box Intersection over union (IOU) between. The region of the intersection of bounding boxes predicted by the changing CNN with the true bounding box label is divided by the region of the union of the same bounding boxes. The value of IOU is high if the overlap between the prediction box and the ground truth box is large. If more than one prediction bounding box overlaps the ground truth bounding box, the one with the highest IOU value is selected to compute the loss function. The details of the loss function were proposed by Redmon et al. in their paper "You Only Look Once: Unified, Real-Time Object Detection" published on May 9, 2016. The paper is available at https://arxiv.org/pdf/1506.02640.pdf.

specific implementation

於各個實施例中，用以追蹤真實空間之區域中藉由主體之存貨項目的放下及取走之系統(如上所述)亦包括以下特徵之一或更多者。 In various embodiments, the system (as described above) for tracking the placement and removal of inventory items by subject in an area of real space also includes one or more of the following features.

1. District Proposal

區提議為來自其涵蓋人之所有不同相機的手位置之框影像。區提議係由系統中之每一相機所產生。其包括空手以及攜帶商店項目的手。 The district proposes a framed image of hand positions from all the different cameras it covers the person. Zone proposals are generated by each camera in the system. It includes empty hands as well as hands carrying store items.

1.1 WhatCNN model

區提議可被使用為針對使用深學習演算法之影像分類的輸入。此分類引擎被稱為「WhatCNN」模型。其為手中分類模型。其係分類手中的事物。即使物件之部分被手所阻擋，手中影像分類仍可操作。較小的項目可被手阻擋高達90%。藉由WhatCNN模型之影像分析的區被有意地保持為小(於某些實施例中)，因為其是計算上昂貴的。各相機可具有專屬GPU。此係針對每一框而被履行於來自每一相機之每一手影像。除了藉由WhatCNN模型之上述影像分析以外，信心加權亦被指派給該影像(一相機、一時點)。分類演算法係輸出涵蓋庫存保持單元(SKU)之完整列表的羅吉特以產生針對n個項目之該商店的產品和服務識別碼列表及針對空手(n+1)之一額外者。 Region proposals can be used as input for image classification using deep learning algorithms. This classification engine is called the "WhatCNN" model. It is a classification model in hand. It is the thing in the hands of classification. even part of the object The separation is blocked by the hand, and the image classification in the hand can still be operated. Smaller items can be blocked by hands up to 90%. The area of image analysis by the WhatCNN model is intentionally kept small (in some embodiments) because it is computationally expensive. Each camera can have its own GPU. This is performed for each frame for each hand image from each camera. In addition to the above image analysis by the WhatCNN model, confidence weights are also assigned to the image (one camera, one time point). The sorting algorithm outputs a logit covering the full list of stock keeping units (SKUs) to generate a list of product and service identifiers for the store for n items and an extra for empty hands (n+1).

場景程序現在藉由傳送密鑰-值字典至各視頻以將其結果傳回至各視頻程序。於此，密鑰為獨特關節ID而值為該關節所關聯的獨特個人ID。假如無任何人與該關節相關聯，則其不被包括於該字典中。 The scene program now sends its results back to each video program by sending a key-value dictionary to each video. Here, the key is the unique joint ID and the value is the unique personal ID associated with that joint. If no one is associated with the joint, it is not included in the dictionary.

各視頻程序從場景程序接收密鑰-值字典，並將其儲存入環緩衝器，其係將框數目映射至返回的字典。 Each video program receives a key-value dictionary from the scene program and stores it in a ring buffer, which maps frame numbers to the returned dictionary.

使用返回的密鑰-值字典，該視頻在各時刻選擇其接近與已知的人關聯的手之影像的子集。這些區為numpy片段。吾人亦取得類似的片段於前台遮罩周圍以及關節CNN之原始特徵陣列。這些結合的區被序連在一起而成為單一多維numpy陣列且被儲存於資料結構中，該資料結構係保存：與該區關聯的該numpy陣列和該個人ID、以及該區係來自該個人的哪隻手。 Using the returned key-value dictionary, the video selects at each moment its subset that approximates the image of the hand associated with the known person. These regions are numpy fragments. We also obtained similar segments around the foreground mask and the original feature array of the articulated CNN. The combined regions are concatenated together into a single multidimensional numpy array and stored in a data structure that holds: the numpy array associated with the region and the person ID, and the region from the individual which hand.

所有提議區被接著饋送入FIFO佇列。此佇列係接受數區且將其numpy陣列推入GPU上之記憶體。 All proposal fields are then fed into the FIFO queue. this queue The coefficients accept the coefficient area and push its numpy array into memory on the GPU.

當陣列到達GPU時，其被饋送入一專用於分類之CNN，稱之為WhatCNN。此CNN之輸出為大小N+1之浮點的平坦陣列，其中N為該商店中之獨特SKU的數目，而最後類別係代表無類別(或空手)。此陣列中之該些浮點被稱為羅吉特。 When the array reaches the GPU, it is fed into a CNN dedicated to classification, called WhatCNN. The output of this CNN is a flat array of floats of size N+1, where N is the number of unique SKUs in the store, and the last category represents no category (or empty hand). The floats in this array are called logs.

WhatCNN之結果被儲存回入區資料結構。 WhatCNN results are stored back into the zone data structure.

針對一時刻之所有區被接著從各視頻程序傳回至場景程序。 All regions for a moment are then passed back from each video program to the scene program.

該場景程序在某一時刻接收來自所有視頻之所有區並將結果儲存於密鑰-值字典中，其中該密鑰為個人ID而該值為密鑰-值字典，其中該密鑰為相機ID而該值為區之羅吉特。 The scene program receives all regions from all videos at a time and stores the results in a key-value dictionary, where the key is the personal ID and the value is a key-value dictionary, where the key is the camera ID And that value is Logit of the District.

此聚合資料結構被接著儲存於環緩衝器，其係將框數目映射至聚合結構於各時刻。 This aggregated data structure is then stored in a ring buffer, which maps the frame number to the aggregated structure at each instant.

1.2 WhenCNN model

由WhatCNN模型所處理之來自不同相機的影像在一段時間週期期間被結合(在一段時間週期期間之多數相機)。對於此模型之額外輸入為3D空間中之手位置，三角測量自多數相機。對於此演算法之另一輸入為手與該商店之貨架圖的距離。於某些實施例中，貨架圖可被用以識別該手是否接近一含有特定項目(例如，cheerios盒子)的貨架。對於此演算法之另一輸入為在該商店上之足部位置。 Images from different cameras processed by the WhatCNN model are combined during a period of time (most cameras during a period of time). An additional input to this model is the hand position in 3D space, triangulated from most cameras. Another input to this algorithm is the distance of the hand from the store's planogram. In some embodiments, a planogram can be used to identify whether the hand is approaching a shelf containing a particular item (eg, cheerios boxes). Another input to this algorithm is the foot part on the store set.

除了使用SKU之物件分類以外，第二分類模型係使用時間序列分析以判定該物件是被拾起自該貨架或者是被放在該貨架上。該些影像在一段時間週期期間被分析以判定其在先前影像框中位於該手中的該物件是已被放回該貨架中或者是已被拾起自該貨架。 In addition to item sorting using SKUs, a second sorting model uses time series analysis to determine whether the item was picked up from the shelf or placed on the shelf. The images are analyzed over a period of time to determine whether the item, which was in the hand in the previous image frame, has been placed back into the rack or picked up from the rack.

針對一第二時間(每秒30框)週期及三個相機，系統將具有90個類別輸出，針對相同手加信心。此結合影像分析顯著地增加了正確地識別該手中之物件的機率。涵蓋時間的分析係增進了輸出之品質，儘管是各別框之某些極低信心位準的輸出。此步驟可具有(例如)從80%準確度至95%準確度之輸出信心。 For a second time (30 frames per second) cycle and three cameras, the system will have 90 class outputs, with confidence for the same hand. This combined image analysis significantly increases the chances of correctly identifying the object in the hand. Analysis over time improves the quality of the output, albeit at some very low confidence levels for the individual boxes. This step may have output confidence ranging from, for example, 80% accuracy to 95% accuracy.

此模型亦包括來自貨架模型之輸出以當作其輸入，用來識別此人已拾起什麼物件。 The model also includes the output from the shelf model as its input to identify what the person has picked up.

場景程序等待30或更多聚合結構累積(其代表真實時間之至少一秒)，並接著履行進一步分析以向下減少聚合結構至針對每一個人ID-手對之單一整數，其中該整數為代表該商店中之SKU的獨特ID。針對一時刻，此資訊被儲存於密鑰-值字典中，其中密鑰為個人ID-手對，而值為SKU整數。此字典係隨著時間經過而被儲存於環緩衝器，其係將框數目映射至針對該時刻之各字典。 The scenario program waits for 30 or more aggregate structures to accumulate (which represent at least one second of real time), and then performs further analysis to reduce aggregate structures down to a single integer for each person ID-hand pair, where the integer is the Unique ID of the SKU in the store. For a moment, this information is stored in a key-value dictionary, where the key is a personal ID-hand pair and the value is a SKU integer. This dictionary is stored in the ring buffer over time, which maps the number of boxes to each dictionary for that time.

可接著履行額外分析以觀察此字典如何隨著時間經過而改變以識別個人在什麼時刻取走某物以及其取走什麼東西。此模型(WhenCNN)係發出SKU羅吉特以及針對各布林問題：某物被取走？某物被放置？之羅吉特。 Additional analysis can then be performed to observe how this dictionary changes over time to identify when an individual took something and what he took. This model (WhenCNN) sends SKU Logit and Needle Question to Breen: Something was taken? something was placed? The Logit.

WhenCNN之輸出被儲存於環緩衝器，其係將框數目映射至密鑰-值字典，其中密鑰為個人ID而值為由WhenCNN所發出之延伸羅吉特。 The output of WhenCNN is stored in a ring buffer, which maps the number of boxes to a key-value dictionary, where the key is the personal ID and the value is the extended logit sent by WhenCNN.

啟發法之另一集合被接著運行於WhenCNN及人之已儲存關節位置兩者之儲存結果上、以及於商店貨架上之項目的預先計算映圖上。啟發法之此集合係判定其取走及放下係導致項目被加至或移除自何處。針對各取走/放下，該些啟發法係判定該取走或放下係自或至貨架、自或至籃子、或者自或至個人。該輸出為針對每個人的存貨，其被儲存為一陣列，其中在SKU之索引上的陣列值為個人所擁有的那些SKU之數目。 Another set of heuristics was then run on both the WhenCNN and the stored results of the person's stored joint positions, as well as on the precomputed map of items on store shelves. This set of heuristics determines where its taking and dropping caused items to be added or removed from. For each take/drop, the heuristics determine that the take or drop is from or to a shelf, from or to a basket, or from or to an individual. The output is the inventory for each person, which is stored as an array, where the array value on the SKU's index is the number of those SKUs that the individual has.

當購物者接近商店之出口時，該系統可傳送存貨列表至該購物者的手機。該手機接著顯示該使用者的存貨並要求確認從其所儲存的信用卡資訊收費。假如使用者接受，則其信用卡將被收費。假如其不具有該系統中所已知的信用卡，則其將被要求提供信用卡資訊。 When a shopper approaches the store's exit, the system can transmit an inventory list to the shopper's cell phone. The handset then displays the user's inventory and asks for confirmation to charge from its stored credit card information. If the user accepts, their credit card will be charged. If it does not have a credit card known in the system, it will be asked to provide credit card information.

替代地，購物者亦可靠近商店內的服務台(kiosk)。該系統係識別出該購物者於何時接近該服務台且將傳送訊息至該服務台以顯示該購物者的存貨。該服務台要求該購物者接受該存貨之收費。假如購物者接受，則其可接著刷他們的信用卡或者插入現金來付款。圖16提出針對區提議之WhenCNN模型的圖示。 Alternatively, shoppers may also approach kiosks within the store. The system recognizes when the shopper is approaching the service desk and will send a message to the service desk to display the shopper's inventory. The service desk requires the shopper to accept a charge for the inventory. If the shopper accepts, they can then swipe their credit card or insert cash to pay. Figure 16 presents an illustration of the WhenCNN model for region proposals.

2. Misplaced items

此特徵係識別錯置的項目，當該些項目被個人放回隨機的貨架上時。如此造成物件識別的問題，因為相對於貨架圖之足部及手位置將是不正確的。因此，該系統隨著時間經過而建立修改的貨架圖。根據先前的時間序列分析，該系統能夠判定個人是否已將項目放回該貨架中。下一次，當物件從該貨架位置被拾起時，該系統便得知有至少一錯置的項目在該手位置上。相應地，演算法將具有其該個人可能從該貨架拾起錯置的項目之一些信心。假如該錯置的項目被拾起自該貨架，則該系統便從該位置減去該項目，該貨架不再具有該項目。該系統亦可經由應用程式以告知店員有關錯置的項目以致該店員可將該項目移至其正確的貨架。 This feature identifies misplaced items when they are individually placed back on a random shelf. This creates problems with object identification, as the foot and hand positions relative to the planogram will be incorrect. Thus, the system builds a modified planogram over time. Based on previous time-series analysis, the system was able to determine whether the individual had put the item back on the shelf. The next time the item is picked up from the shelf position, the system knows that there is at least one misplaced item in the hand position. Accordingly, the algorithm will have some confidence that the individual may pick up the misplaced item from the shelf. If the misplaced item is picked up from the shelf, the system subtracts the item from the location and the shelf no longer has the item. The system can also notify the clerk about the misplaced item via the app so that the clerk can move the item to its correct shelf.

3. Semantic Differences (Shelf Model)

用於背景影像處理之替代技術包含背景減去演算法，用以識別對於該些貨架上之項目的改變(項目被移除或放置)。此係根據像素位準上之改變。假如有人在該貨架前方，則該演算法便停止以致其不會將由於人的存在所致之像素改變列入考量。背景減去為一種雜訊程序。因此，跨相機分析被執行。假如有足夠的相機同意其該貨架上有「語意上有意義的」改變，則該系統便記錄其在該貨架之該部分中有改變。 Alternative techniques for background image processing include background subtraction algorithms to identify changes to items on the shelves (items removed or placed). This is based on changes in pixel level. If a person is in front of the shelf, the algorithm stops so that it does not take into account pixel changes due to the presence of a person. Background subtraction is a noise procedure. Therefore, a cross-camera analysis is performed. If enough cameras agree that there is a "semantically meaningful" change in that shelf, the system records that there is a change in that part of the shelf.

下一步驟係用以識別該改變是「放下」或是「取走」改變。對此，第二分類模型之時間序列分析被使用。針對該貨架之該特定部分的區提議被產生並通過深學習演算法。此比手中影像分析更為容易，因為該物件不會被阻擋在手內部。第四輸入被提供至該演算法，除了三個典型的RGB輸入以外。該第四頻道為背景資訊。該貨架或語意差異之輸出被再次輸入至第二分類模型(時間序列分析模型)。 The next step is to identify whether the change is "put down" or "Take" changes. For this, time series analysis of the second classification model is used. A zone proposal for that particular portion of the shelf is generated and passed through a deep learning algorithm. This is easier than hand image analysis because the object is not blocked inside the hand. A fourth input is provided to the algorithm, in addition to the three typical RGB inputs. The fourth channel is background information. The output of the shelf or semantic difference is re-input to the second classification model (time series analysis model).

此方式中之語意差異包括以下步驟： Semantic differences in this approach include the following steps:

1.來自相機之影像係與來自相同相機之較早影像進行比較。 1. The image from the camera is compared to an earlier image from the same camera.

2.介於兩影像之間的各相應像素係經由RGB空間中之歐幾里德距離而被比較。 2. Each corresponding pixel between the two images is compared via Euclidean distance in RGB space.

3.在某臨限值之上的距離被標記，其導致剛標記的像素之新影像。 3. The distance above some threshold is marked, which results in a new image of the pixel just marked.

4.影像形態過濾器之集合被用以從該已標記影像移除雜訊。 4. A set of image shape filters is used to remove noise from the marked image.

5.吾人接著搜尋已標記像素之大型集合並形成於其周圍之定界框。 5. We then search for a large set of marked pixels and form a bounding box around it.

6.針對各定界框，吾人接著觀察兩影像中之原始像素以獲得兩個影像快照。 6. For each bounding box, we then observe the raw pixels in the two images to obtain two image snapshots.

7.這兩個影像快照被接著推入CNN，其被訓練以分類該影像區是代表被取走的項目或者是代表被放置的項目以及該項目是什麼。 7. The two image snapshots are then pushed into a CNN, which is trained to classify whether the image region represents a removed item or a placed item and what the item is.

3. Store inspection

各貨架之存貨係由該系統所維持。當項目被消費者所拾起時其便被更新。於任何時點，該系統能夠產生商店存貨之稽查。 Inventory of each rack is maintained by the system. It is updated when the item is picked up by the consumer. At any point in time, the system can generate audits of store inventory.

4. Most projects in hand

不同影像被用於多數項目。手中的兩個項目與一個項目相較之下係被不同地處置。某些演算法僅可預測一個項目而非一項目之數個。因此，CNN被訓練以致其針對「二」數量的項目可不同於手中之單一項目來執行。 Different images are used for most projects. Two items in hand are treated differently than one item. Some algorithms can only predict one item rather than a number of items. Thus, the CNN is trained so that it can perform on a "two" number of items different from the single item in hand.

5. Data collection system

預先定義的購物腳本被用以收集影像之良好品質的資料。這些影像被用於演算法之訓練。 Pre-defined shopping scripts are used to collect information on the good quality of the images. These images are used to train the algorithm.

5.1 Shopping Script

資料收集包括以下步驟： Data collection includes the following steps:

1.腳本被自動地產生以告知人類演員應採取哪些動作。 1. Scripts are automatically generated to inform human actors what actions to take.

2.這些動作被隨機地取樣自包括以下之動作集合：取走項目X、放下項目X、持有項目X達Y秒。 2. These actions are randomly sampled from the set of actions that include: take item X, drop item X, hold item X for Y seconds.

3.當履行這些動作時，演員係移動並使其本身盡可能多方式地定向，而同時在該既定動作上仍成功。 3. When performing these actions, the actor moves and orients itself in as many ways as possible while still being successful at the given action.

4.於動作之序列期間，相機之集合係從許多觀點記錄該些演員。 4. During the sequence of actions, the set of cameras is recorded from many viewpoints these actors.

5.在該些演員已完成該腳本後，相機視頻被捆在一起並連同原始腳本而被儲存。 5. After the actors have completed the script, the camera video is bundled and stored with the original script.

6.該腳本係作用為對於在演員之視頻上所訓練的機器學習模型(諸如CNN)之輸入標籤。 6. The script acts as an input label to a machine learning model (such as a CNN) trained on the actor's video.

6. Product Line

該系統及其部分可被用於無出納員結帳，其係由以下應用程式所支援。 The system and parts of it can be used for cashierless checkout, which is supported by the following applications.

6.1 Store Apps

商店應用程式具有數個主要可能性：提供資料分析視覺化、支援損失預防、及提供平台以輔助消費者，藉由顯示零售商有關人在商店中的何處以及他們已收集了什麼商品。對於員工之許可位準以及應用程式存取權可由零售商所決定。 Store applications have several major possibilities: providing data analysis visualization, supporting loss prevention, and providing a platform to assist consumers by showing retailers where people are in the store and what items they have collected. Permission levels for employees and access to applications can be determined by the retailer.

6.1.1 Standard Analysis

資料係由平台所收集且可被使用以多種方式。 Data is collected by the platform and can be used in a variety of ways.

1.衍生資料被用以履行對於以下各者之多種分析：商店、其所提供的購物經驗、以及消費者與產品、環境、及其他人的互動。 1. Derivative data is used to perform various analyses of the store, the shopping experience it offers, and consumer interactions with products, the environment, and others.

a.該資料被儲存並使用於背景中以履行商店與消費者互動之分析。商店應用程式將顯示此資料之某些視覺化給零售商。其他資料被儲存並詢問(當想要該資料點時)。 a. The data is stored and used in the background to fulfill store and Analysis of consumer interaction. The store application will display some visualizations of this data to the retailer. Other data is stored and queried (when the data point is desired).

2.熱映圖： 2. Heat map:

平台將以下視覺化：零售商的平面圖、貨架佈局、及其他商店環境，具有顯示多種活動之位準的重疊圖。 The platform visualizes the retailer's floor plan, shelf layout, and other store environments, with overlays showing levels of various activities.

1.範例： 1. Example:

1.針對人走過、但並未觸摸任何產品的地點之地圖。 1. A map of places where people walk by without touching any products.

2.針對當與產品互動時人所站立的處所之地圖。 2. A map of where people stand when interacting with the product.

3.錯置的項目： 3. Misplaced items:

該平台係追蹤商店之SKU的所有者。當項目被放在不正確的位置時，該平台將知道該項目在哪裡並建立日誌。於某臨限值，或立即地，商店員工可被警示有關錯置的項目。替代地，員工可存取商店應用程式中之錯置的項目映圖。當方便時，員工可接著快速地找出並校正錯置的項目。 The platform tracks the owner of the store's SKU. When an item is placed in an incorrect location, the platform will know where the item is and build a log. At a certain threshold, or immediately, store employees may be alerted about misplaced items. Alternatively, employees can access misplaced item maps in the store application. Staff can then quickly locate and correct misplaced items when convenient.

6.1.2 Standard assistance

‧商店應用程式將顯示商店的平面圖。 ‧The store app will display the floor plan of the store.

‧其將顯示圖形以表示該商店中的每個人。 ‧It will display graphics to represent everyone in the store.

‧當該圖形被選擇(經由接觸、按壓、或其他手段)時，針對商店員工之相關資訊將被顯示。例如：購物車項目(其已收集之項目)將出現在列表中。 • When the graphic is selected (via touch, press, or other means), relevant information for store employees will be displayed. For example: buy The cart item (the one it has collected) will appear in the list.

‧假如該平台具有低於針對特定項目及針對一段時間週期之預定臨限值的信心位準(有關其係為某人所擁有(購物車))，則其圖形(目前為一個點)將指示該差異。該應用程式系使用顏色改變。綠色指示高信心而黃色/橘色係指示較低的信心。 ‧If the platform has a confidence level below a predetermined threshold for a particular item and for a period of time (as to whether it is owned by someone (shopping cart)), its graph (currently a dot) will indicate the difference. The app uses color changing. Green indicates high confidence and yellow/orange shades indicate lower confidence.

‧具有商店應用程式之商店員工可被告知有關該較低的信心。他們可以去確認消費者的購物車是正確的。 ‧Store employees with store apps can be informed about this lower confidence. They can go and confirm that the customer's shopping cart is correct.

‧透過商店應用程式，零售商之員工將能夠調整消費者的購物車項目(加入或刪除)。 ‧Through the store app, the retailer's staff will be able to adjust the customer's shopping cart items (add or delete).

6.1.3 Standard LP

‧假如購物者正在使用商店應用程式，則其僅需離開商店且被收費。然而，假如其不是的話，則其將必須使用訪客應用程式以針對其購物車中的項目付款。 ‧If the shopper is using the store app, they only need to leave the store and be charged. However, if it is not, it will have to use the guest app to pay for the items in its shopping cart.

‧假如購物者在其離開商店的途中繞過訪客應用程式，則其圖形係指示其必須在離開前被靠近。該應用程式係使用顏色之改變至紅色。人員亦接收潛在損失之通知。 • If a shopper bypasses the guest app on his way out of the store, its graphics indicate that he must be approached before leaving. The app uses a color change to red. Personnel also receive notifications of potential damages.

6.2 Non-Store Apps

以下分析特徵係表示該平台之額外能力。 The following analytical features represent additional capabilities of the platform.

6.2.1 Standard Analysis 1. Product interaction:

產品互動之粒度分解，諸如： Granular breakdown of product interactions, such as:

a.針對各產品之互動時間與轉換比。 a. Interaction time and conversion ratio for each product.

b.A/B比較(顏色、式樣，等等)。展示架上之某些較小產品具有多數選項，如顏色、口味，等等。 b. A/B comparison (color, style, etc.). Some of the smaller products on the display stand have multiple options, such as colors, flavors, and more.

‧玫瑰金是否比銀被操作更多？ ‧Is rose gold manipulated more than silver?

‧藍色罐子是否比紅色罐子吸引更多互動？ ‧Do blue jars attract more interaction than red jars?

2. Directional impression:

得知介於位置為基的印象與購物者的關注在何處之間的差異。假如其正觀看其在15英尺遠的產品(20秒)，則該印象不應考量其位於何處，而應考量其正在觀看何處。 Know the difference between location-based impressions and where shoppers are focused. If he is looking at his product at 15 feet away (20 seconds), the impression should not consider where he is, but where he is looking.

3. Consumer identification:

記住重複購物者及其相關的電子郵件地址(由零售商以多種方式來收集)和購物輪廓。 Remember repeat shoppers and their associated email addresses (collected in a variety of ways by retailers) and shopping profiles.

4. Group dynamics:

決定購物者何時在觀看其他人與產品互動。 Determine when shoppers are watching others interact with products.

‧回答該個人之後是否與該產品互動？ ‧ Did the person interact with the product after answering?

‧那些人是否一起進入商店、或者其可能是陌生人？ • Did those people enter the store together, or could they be strangers?

‧個人還是人群花比較多時間在商店中？ ‧Individuals or groups of people spend more time in the store?

5. Consumers return to the array:

提供消費者目標資訊、公布商店經驗。此特徵可依各零售商而具有稍微不同的實施方式，取決於特定習慣及策略。其可能需要來自零售商之整合及/或開發以採取該特徵。 Provide consumer target information and publish store experience. This feature may have slightly different implementations from retailer to retailer, depending on particular habits and strategies. It may require integration and/or development from retailers to take advantage of this feature.

‧購物者將被詢問其是否希望接收有關其可能有興趣的產品之通知。該步驟可被整合與收集電子郵件之商店的方法。 ‧Shoppers will be asked if they would like to receive notifications about products they may be interested in. This step can be integrated with the store's method of collecting emails.

‧在離開商店後，消費者可接收一封具有其在該商店中花了時間的產品之電子郵件。針對歷時、接觸、及目睹(方向印象)之互動臨限值將被決定。當該臨限值被滿足時，該些產品將進入她的列表且在她離開商店後不久被傳送給她。 • After leaving the store, the consumer can receive an email with the products they spent time in the store. Interaction thresholds for duration, contact, and sighting (directional impressions) will be determined. When the threshold is met, the products will enter her list and be delivered to her shortly after she leaves the store.

此外，或替代地，購物者可在一段時間週期後被傳送一封電子郵件，其係提供折扣產品或其他特殊資訊。這些產品將是他們表達有興趣(但並未購買)的項目。 Additionally, or alternatively, shoppers may be sent an email after a period of time offering discounted products or other special information. These products will be items in which they expressed interest (but did not purchase).

6.3 Guest Applications

購物者應用程式自動地幫人們結帳，當他們離開商店時。然而，平台並未要求購物者需具有或使用購物者應用程式才能使用該商店。 The shopper app automatically checks people out when they leave the store. However, the platform does not require shoppers to have or use The store must be used by the Inventor app.

當購物者/人不具有或使用該購物者應用程式時，他們便走向服務台(iPad/平板或其他螢幕)或者他們走向預先安裝的自行結帳機器。該顯示(與該平台整合)將自動地顯示消費者的購物車。 When a shopper/person does not have or use the shopper app, they go to a help desk (iPad/tablet or other screen) or they go to a pre-installed self-checkout machine. The display (integrated with the platform) will automatically display the consumer's shopping cart.

購物者將有機會檢視其顯示了什麼。假如他們同意該顯示上之資訊，則他們可以將現金投入該機器(假如該能力被建入硬體(例如，自行結帳機器)的話)或者他們刷信用卡或轉帳卡。他們可接著離開商店。 Shoppers will have the opportunity to view what it shows. If they agree to the information on the display, they can put cash into the machine (if the capability is built into the hardware (eg, a self-checkout machine)) or they swipe a credit or debit card. They can then leave the store.

假如他們不同意該顯示，則商店人員被告知，藉由他們的選擇來透過觸控螢幕、按鈕、或其他手段提出質疑。(參見商店應用程式之下的商店輔助) If they disagree with the display, store personnel are told to challenge their choice by touch screen, button, or other means. (See Store Assist under Store Apps)

6.4 Shopper App

透過應用程式(購物者應用程式)之使用，消費者可帶著商品離開商店且自動地被收費並提供數位收據。購物者必須在當位於商店的購物區域內時之任何時刻開啟他們的應用程式。該平台將辨識其被顯示於購物者的裝置上之獨特影像。該平台將把他們綁定至他們的帳戶(消費者協會)，且無論他們是否保持該應用程式為開啟，將能夠在他們位於商店的購物區域內的所有時間記得他們是誰。 Through the use of the app (shopper app), the consumer can leave the store with the item and be automatically charged and provided with a digital receipt. Shoppers must open their app at any time while in the shopping area of the store. The platform will recognize its unique image displayed on the shopper's device. The platform will bind them to their account (Consumer Association) and will be able to remember who they are all the time they are in the shopping area of the store, whether or not they keep the app open.

當購物者收集項目時，購物者應用程式將顯示該些項目於購物者的購物車中。假如購物者想要，他們可以觀看有關他們所拾起(亦即，加入到他們的購物車)之各項目的產品資訊。產品資訊被儲存以該商店的系統或者被加至平台。用以更新該資訊之能力(諸如提供產品折扣或顯示價錢)為一種零售商可請求/購買或開發的選項。 When a shopper collects items, the shopper app will display those items in the shopper's cart. If shoppers want, they Product information can be viewed about each item they have picked up (ie, added to their shopping cart). Product information is stored in the store's system or added to the platform. The ability to update this information, such as offering product discounts or displaying prices, is an option that retailers can request/purchase or develop.

當購物者把項目放下時，則其被移除自後端上以及購物者應用程式上之其購物車。 When a shopper drops an item, it is removed from his cart on the backend and on the shopper app.

假如購物者應用程式被開啟，且接著在消費者協會完成後被關閉，則該平台將維持購物者的購物車並正確地向他們收費(一旦他們離開該商店)。 If the shopper app is opened, and then closed after the consumer association is complete, the platform will maintain the shopper's cart and charge them correctly (once they leave the store).

購物者應用程式亦具有關於其開發準則之映射資訊。其可告知消費者去何處找該商店中之項目，假如該消費者藉由鍵入搜尋項目以請求該資訊的話。在稍後的日子，吾人將取得購物者的購物列表(手動地鍵入該應用程式或者透過其他智慧型系統)並顯示通過該商店以收集所有想要的項目之最快速路由。其他過濾器(諸如「裝袋偏好」)可被加入。裝袋偏好過濾器係容許購物者不依循最快速路由，而是先收集最強韌的項目，接著稍後收集較易碎的項目。 The shopper app also has mapping information about its development criteria. It can inform the consumer where to find the item in the store if the consumer requests the information by typing in the search item. At a later date, we will take the shopper's shopping list (either manually typed into the app or through some other intelligent system) and display the quickest route through the store to collect all the desired items. Other filters (such as "Bagging Preferences") can be added. The bagging preference filter allows shoppers not to follow the fastest route, but to collect the toughest items first, followed by the more fragile items later.

7. Types of consumers

會員消費者-第一類型的消費者係使用應用程式以登入該系統。該消費者被提示以一圖片且當她/他按壓時，該系統會將其鏈結至該消費者的內部id。假如該消費者具有帳戶，則該帳戶被自動地收費(當該消費者走出該商店時)。此為會員為基的商店。 Member Consumers - The first type of consumers use an app to log into the system. The consumer is prompted with a picture and when she/he presses, the system will link it to the consumer's internal id. If the consumer has an account, the account is automatically charged (when the consumer when leaving the store). This is a member based store.

訪客消費者-不是每個商店將具有會員制度，或者消費者可能沒有智慧型手機或信用卡。此類型的消費者將向服務台。該服務台將顯示該消費者所具有的項目且將要求該消費者放入金錢。該服務台將已得知有關該消費者已購買的所有項目。針對此類型的消費者，該系統能夠識別該消費者是否尚未針對購物車中的項目付款，並提示在門上的收銀機(在該消費者到達那裡之前)以讓收銀機得知有關未付款的項目。該系統亦可針對一個尚未被付款的項目提示，或者該系統具有關一個項目的低信心。此被稱為預測路徑找尋。 Guest Consumer - Not every store will have a membership system, or the consumer may not have a smartphone or credit card. Consumers of this type will report to the service desk. The service desk will display the items the customer has and will ask the customer to put money in. The service desk will have been informed about all the items that the consumer has purchased. For this type of consumer, the system can identify if the consumer has not paid for the items in the shopping cart and prompt the cash register at the door (before the consumer gets there) to let the cash register know about the outstanding payment s project. The system may also prompt for an item that has not been paid for, or the system has low confidence about an item. This is called predictive path finding.

該系統係根據信心位準以將顏色碼(綠或黃)指派給行走在該商店中的消費者。綠色編碼的消費者是已登入該系統或者是該系統具有關於他們的高信心。黃色編碼的消費者具有其尚未被預測以高信心的一或更多項目。店員可觀看黃色點並按壓它們以識別問題項目，走向該消費者並解決問題。 The system assigns color codes (green or yellow) to consumers walking through the store based on confidence levels. Green-coded consumers are either logged into the system or have high confidence in the system about them. Consumers coded in yellow have one or more items for which they have not been predicted with high confidence. The clerk can look at the yellow dots and press on them to identify the problem item, walk up to the customer and fix the problem.

8. Analysis

關於該消費者收集了一大群分析資訊，諸如消費者在特定貨架前方花費了多少時間。此外，該系統係追蹤消費者正觀看的位置(關於該系統之印象)，以及消費者拾起並放回貨架的項目。此等分析目前可用於電子商務但尚未可用於零售商店。 A large group of analytical information is collected about this consumer, such as how much time the consumer spends in front of a particular shelf. In addition, the system tracks where the consumer is looking (impressions about the system), as well as the items that the consumer picks up and puts back on the shelf. Such analytics are currently available for e-commerce but not yet for retail stores.

9. Functional modules

以下為功能性模組之列表： The following is a list of functional modules:

1.使用同步化相機之商店中影像的系統擷取陣列。 1. Use a systematic capture array of images in a store that synchronizes cameras.

2.用以識別影像中之關節、及各別人的關節之集合的系統。 2. A system for identifying joints in images, and sets of joints of individuals.

3.用以使用關節集合來產生新人的系統。 3. A system for generating new people using joint sets.

4.用以使用關節集合來刪除幽靈人的系統。 4. A system to delete ghost people using joint sets.

5.用以藉由追蹤關節集合來隨著時間推移追蹤各別人的系統。 5. A system to track individuals over time by tracking sets of joints.

6.用以針對該商店中所存在的各人產生區建議的系統，其係指示手中之項目的SKU數(WhatCNN)。 6. A system to generate zone suggestions for each person present in the store, which indicates the SKU number of the item in hand (WhatCNN).

7.用以履行針對區提議之獲取/放下分析的系統，其係指示手中的項目是被拾起或是被放在貨架上(WhenCNN)。 7. A system to perform a get/put analysis for district proposals that indicates whether the item in hand is picked up or put on the shelf (WhenCNN).

8.用以使用區提議及獲取/放下分析來產生每人之存貨陣列的系統(與啟發法、人的已儲存關節位置、及商店貨架上之項目的預先計算映圖結合之WhenCNN的輸出)。 8. System to generate per-person inventory arrays using zone proposal and get/put analysis (output of WhenCNN combined with heuristics, person's stored joint positions, and precomputed maps of items on store shelves) .

9.用以識別、追蹤及更新貨架上之錯置的項目之位置的系統。 9. A system for identifying, tracking and updating the location of misplaced items on shelves.

10.用以使用像素為基的分析來追蹤對於貨架上之項目的改變(獲取/放下)之系統。 10. A system to track changes (gets/puts) to items on the shelf using pixel-based analysis.

11.用以履行商店之存貨稽查的系統。 11. A system for performing inventory audits of stores.

12.用以識別手中之多數項目的系統。 12. A system for identifying the majority of items in hand.

13.用以使用購物腳本來收集來自商店之項目影像資料的系統。 13. A system for collecting item image data from stores using shopping scripts.

14.用以履行結帳並從會員消費者收款的系統。 14. A system to perform checkout and collect payments from member consumers.

15.用以履行結帳並從訪客消費者收款的系統。 15. A system to perform checkouts and collect payments from guest consumers.

16.用以藉由識別購物車中之未付款項目來履行損失預防的系統。 16. A system for performing loss prevention by identifying unpaid items in a shopping cart.

17.用以使用顏色碼來追蹤消費者以協助店員識別消費者的購物車中之不正確識別的項目之系統。 17. A system for tracking consumers using color codes to assist store associates in identifying incorrectly identified items in a consumer's shopping cart.

18.用以產生消費者購物分析之系統，其包括位置為基的印象、方向性印象、A/B分析、消費者辨識、群組動態，等等。 18. A system for generating consumer shopping analytics including location-based impressions, directional impressions, A/B analysis, consumer identification, group dynamics, and the like.

19.用以使用購物分析來產生針對性的消費者回陣之系統。 19. A system for generating targeted consumer responses using shopping analytics.

20.用以產生商店之熱映圖重疊圖來視覺化不同活動的系統。 20. A system for generating heatmap overlays of stores to visualize different activities.

文中所述之技術可支援無出納員結帳。去商店。取走東西。離開。 The techniques described in this article can support cashierless checkout. go to the shop. take something away. leave.

無出納員結帳是一種純機器視覺及深學習為基的系統。購物者跳過排隊而更快速且更輕易地獲得他們想要的。無RFID標籤。對於商店的後端系統無改變。可與銷售及存貨管理系統之第三方點(3^rd party Point of Sale and Inventory Management systems)整合。 Cashierless checkout is a pure machine vision and deep learning based system. Shoppers skip the line to get what they want faster and easier. No RFID tags. No changes to the store's backend systems. Can be integrated with ^3rd party Point of Sale and Inventory Management systems.

每一視頻饋送之即時30 FPS分析。 Real-time 30 FPS analysis of each video feed.

預置的、尖端的GPU叢集。 Pre-built, cutting-edge GPU clusters.

辨識購物者以及與他們互動的項目。 Identify shoppers and the items they interact with.

於範例實施例中並無網際網路依存性。 There is no internet dependency in the example embodiment.

多數最先進深學習模型(包括專屬訂製演算法)，用以首次解決機器視覺技術中的間隙。 Most state-of-the-art deep learning models (including proprietary custom algorithms) to address gaps in machine vision technology for the first time.

技術及能力(Techniques & Capabilities)包括以下： Techniques & Capabilities include the following:

1.標準認知的機器學習管線係解決： 1. Standard cognitive machine learning pipeline system to solve:

a)人檢測。 a) Human detection.

b)單體追蹤。 b) Monomer tracking.

c)多數相機個人同意。 c) The majority of cameras personally agree.

d)手檢測。 d) Hand detection.

e)項目分類。 e) Item classification.

f)項目所有權解決。 f) Project ownership resolution.

結合這些技術，吾人可： Combining these techniques, we can:

1.遍及其即時的購物經驗以追蹤所有人。 1. Track everyone throughout their instant shopping experience.

2.得知購物者手中有什麼、他們站在哪裡、以及他們放回了什麼。 2. Know what shoppers have in their hands, where they stand, and what they put back.

3.得知購物者面對什麼方向以及多久。 3. Know what direction shoppers are facing and for how long.

4.辨識錯置的項目並履行24/7視覺推銷稽查。 4. Identify misplaced items and perform 24/7 visual merchandising audits.

可檢測購物者手中以及其籃子中確實有什麼。 Detects what a shopper actually has in his or her basket.

Learn about your store:

對於特定商店及項目所訓練的訂製神經網路。訓練資料可再使用橫跨所有商店位置。 Custom neural networks trained for specific stores and items. Training data can be reused across all store locations.

Standard deployment:

天花板相機必須被安裝以該商店之所有區域的雙重覆蓋。針對一般走道需要介於2與6之間的相機。 Ceiling cameras must be installed with double coverage of all areas of the store. Between 2 and 6 cameras are required for general walkways.

預置GPU叢集可配適入後端辦公室中的一或兩個伺服器架。 Pre-built GPU clusters can fit into one or two server racks in the back office.

範例系統可整合與或者包括銷售及存貨管理系統之點。 Example systems may integrate with or include points of sales and inventory management systems.

使用同步化相機以擷取商店中之影像的陣列之第一系統、方法及電腦程式產品。 A first system, method and computer program product for capturing an array of images in a store using synchronized cameras.

用以識別影像中之關節、及各別人的關節之集合的第二系統、方法及電腦程式產品。 A second system, method, and computer program product for identifying joints in an image, and sets of joints of individual people.

使用關節集合以產生新人的第三系統、方法及電腦程式產品。 A third system, method, and computer program product for generating a new person using joint assembly.

使用關節集合以刪除幽靈人的第四系統、方法及電腦程式產品。 A fourth system, method and computer program product for removing ghost people using joint sets.

藉由追蹤關節集合以隨著時間推移追蹤各別人的第五系統、方法及電腦程式產品。 A fifth system, method and computer program product for tracking individuals over time by tracking sets of joints.

用以針對該商店中所存在的各人產生區建議的第六系統、方法及電腦程式產品，其係指示手中之項目的SKU數(WhatCNN)。 A sixth system, method and computer program product for generating zone recommendations for each person present in the store, indicating the number of SKUs (WhatCNN) for the item in hand.

用以履行針對區提議之獲取/放下分析的第七系統、方法及電腦程式產品，其係指示手中的項目是被拾起或是被放在貨架上(WhenCNN)。 To perform the first step of the acquire/drop analysis for the district proposal 7. Systems, methods, and computer program products that indicate whether an item in hand is picked up or placed on a shelf (WhenCNN).

使用區提議及獲取/放下分析以產生每人之存貨陣列的第八系統、方法及電腦程式產品(例如，與啟發法、人的已儲存關節位置、及商店貨架上之項目的預先計算映圖結合之WhenCNN的輸出)。 Eighth system, method, and computer program product using area proposal and get/put analysis to generate per-person inventory arrays (eg, with heuristics, person's stored joint positions, and precomputed maps of items on store shelves) combined with the output of WhenCNN).

用以識別、追蹤及更新貨架上之錯置的項目之位置的第九系統、方法及電腦程式產品。 Ninth system, method and computer program product for identifying, tracking and updating the location of misplaced items on a shelf.

使用像素為基的分析以追蹤對於貨架上之項目的改變(獲取/放下)之第十系統、方法及電腦程式產品。 A tenth system, method, and computer program product using pixel-based analysis to track changes (gets/puts) to items on a shelf.

用以履行商店之存貨稽查的第十一系統、方法及電腦程式產品。 Eleventh system, method and computer program product for performing an inventory audit of a store.

用以識別手中之多數項目的第十二系統、方法及電腦程式產品。 Twelfth system, method and computer program product for identifying a majority of items in hand.

使用購物腳本以收集來自商店之項目影像資料的第十三系統、方法及電腦程式產品。 Thirteenth system, method and computer program product for using a shopping script to collect item image data from a store.

用以履行結帳並從會員消費者收款的第十四系統、方法及電腦程式產品。 Fourteenth system, method and computer program product for performing checkout and collecting payment from member consumers.

用以履行結帳並從訪客消費者收款的第十五系統、方法及電腦程式產品。 Fifteenth system, method, and computer program product for performing checkout and collecting payments from guest consumers.

藉由識別購物車中之未付款項目以履行損失預防的第十六系統、方法及電腦程式產品。 A sixteenth system, method and computer program product for performing loss prevention by identifying unpaid items in a shopping cart.

使用(例如)顏色碼來追蹤消費者以協助店員識別消費者的購物車中之不正確識別的項目之第十七系統、方法及電腦程式產品。 Using, for example, color coding to track consumers to assist store associates in identifying incorrectly identified items in a consumer's shopping cart, Series 17 Systems, methods and computer program products.

用以產生消費者購物分析之第十八系統、方法及電腦程式產品，該些分析包括一或更多位置為基的印象、方向性印象、A/B分析、消費者辨識、群組動態，等等。 Eighteenth system, method and computer program product for generating consumer shopping analytics including one or more location-based impressions, directional impressions, A/B analysis, consumer identification, group dynamics, and many more.

使用購物分析以產生針對性的消費者回陣之第十九系統、方法及電腦程式產品。 A nineteenth system, method and computer program product for generating targeted consumer responses using shopping analytics.

用以產生商店之熱映圖重疊圖來視覺化不同活動的第二十系統、方法及電腦程式產品。 A twentieth system, method and computer program product for generating a heat map overlay of a store to visualize different activities.

用於手檢測之第二十一系統、方法及電腦程式。 Twenty-first system, method and computer program for hand detection.

用於項目分類之第二十二系統、方法及電腦程式。 Twenty-second system, method and computer program for item classification.

用於項目所有權解決之第二十三系統、方法及電腦程式。 Twenty-third system, method and computer program for project title resolution.

用於項目人檢測之第二十四系統、方法及電腦程式。 Twenty-fourth system, method and computer program for subject person detection.

用於項目單體追蹤之第二十五系統、方法及電腦程式。 Twenty-fifth system, method and computer program for item tracking.

用於項目多數相機個人同意之第二十六方法及電腦程式。 The twenty-sixth method and computer program used in the project with the consent of the majority of cameras.

實質上如文中所述之用於無出納員結帳的第二十七系統、方法及電腦程式產品。 A twenty-seventh system, method and computer program product substantially as herein described for cashierless checkout.

系統1-26之任一者與任何其他系統或以上列出的系統1-26中之系統的組合。 Any of systems 1-26 and any other system or listed above Combination of the systems in systems 1-26.

文中所述者為一種用以追蹤真實空間之區域中藉由主體之存貨項目的放下及取走之方法，包含：使用複數相機以產生該真實空間中之相應觀看域的影像之各別序列，各相機之該觀看域係與該些複數相機中之至少一其他相機的該觀看域重疊；接收來自該些複數相機之影像的該些序列，並使用第一影像辨識引擎以處理影像來產生其識別該真實空間中之該些已識別主體的主體及位置之第一資料集；處理第一資料集以指明其包括影像之該些序列中的影像中之已識別主體的手之影像的定界框；接收來自該些複數相機之影像的該些序列，並處理該些影像中之該些已指明定界框以使用第二影像辨識引擎來產生該些已識別主體的手之分類，該分類包括該已識別主體是否正持有存貨項目、第一接近度類別，其係指示該已識別主體的手相對於貨架之位置、第二接近度類別，其係指示該已識別主體的手相對於該已識別主體的身體之位置、第三接近度類別，其係指示該已識別主體的手相對於與已識別主體關聯的籃子之位置、及可能存貨項目之識別符；以及處理已識別主體的影像之該些序列中的影像之集合的手之類別以檢測由已識別主體取走存貨項目以及由已識別主體放下存貨項目於存貨展示結構上。 Described herein is a method for tracking the placement and removal of inventory items by a subject in an area of real space, comprising: using a plurality of cameras to generate respective sequences of images of corresponding viewing areas in the real space, The viewing field of each camera overlaps the viewing field of at least one other camera of the plurality of cameras; receiving the sequences of images from the plurality of cameras and using a first image recognition engine to process the images to generate its identifying a first data set of subjects and positions of the identified subjects in the real space; processing the first data set to specify that it includes a delimitation of images of the identified subjects' hands in the images in the sequence of images boxes; receiving the sequences of images from the plurality of cameras and processing the specified bounding boxes in the images to use a second image recognition engine to generate a classification of the identified subject's hands, the classification Includes whether the identified subject is holding an inventory item, a first proximity category indicating the position of the identified subject's hand relative to the shelf, and a second proximity category indicating the identified subject's hand relative to the identified subject's hand. the position of the identified subject's body, a third proximity class indicating the position of the identified subject's hand relative to the basket associated with the identified subject, and an identifier for possible inventory items; and the processing of the identified subject's image The hand class of a collection of images in these sequences to detect the taking of inventory items by the identified subject and the placing of inventory items by the identified subject on the inventory display structure.

於此描述的方法中，該些第一資料集可針對各已識別主體包含具有真實空間中之座標的候選關節之集合。 In the methods described herein, the first data sets may be for Each identified subject contains a set of candidate joints with coordinates in real space.

此描述的方法可包括處理該些第一資料集以指明定界框包括根據針對各主體的候選關節之該些集合中的關節之位置以指明定界框。 The method of this description can include processing the first sets of data to specify bounding boxes including specifying bounding boxes based on positions of joints in the sets of candidate joints for each subject.

於此描述的方法中，該些第一及第二影像辨識引擎之一者或兩者可包含卷積神經網路。 In the methods described herein, one or both of the first and second image recognition engines may include convolutional neural networks.

此描述的方法可包括使用卷積神經網路以處理定界框之該些類別。 The methods of this description may include using a convolutional neural network to process the classes of bounding boxes.

描述一種電腦程式產品(及產品)，其包括電腦可讀取記憶體，其包含非暫態資料儲存媒體，儲存於該記憶體中之電腦指令可由電腦所執行以追蹤真實空間之區域中藉由主體的存貨項目之放下及取走，藉由文中所述之程序的任一者。 Describes a computer program product (and product) comprising computer readable memory comprising a non-transitory data storage medium in which computer instructions stored can be executed by a computer to track an area of real space by Items of inventory of the entity are put down and removed by any of the procedures described herein.

描述一種系統，包含複數相機，其係產生包括主體的手之影像的序列；及處理系統，其係耦合至該些複數相機，該處理系統包括手影像辨識引擎，其係接收影像的該些序列以產生該手的類別於時間序列中、及邏輯，用以處理來自影像的該些序列之該手的該些類別來識別藉由該主體的動作，其中該動作為存貨項目的放下及取走之一。 A system is described that includes a plurality of cameras that generate sequences of images including a subject's hands; and a processing system coupled to the plurality of cameras, the processing system including a hand image recognition engine that receives the sequences of images Actions by the subject are identified by generating the class of the hand in the time series, and logic to process the classes of the hand from the sequences of images, where the action is the placing and taking of inventory items one.

該系統可包括邏輯，用以識別影像的該些序列中之該些影像中的該主體的關節之位置，及用以根據該些已識別關節來識別其包括該主體的該些手之相應影像中的定界框。 The system may include logic to identify the positions of the joints of the subject in the images of the sequences of images, and to identify corresponding images that include the hands of the subject based on the identified joints middle the bounding box.

電腦程式列表附錄係依附於本說明書，且包括用以實施本申請案中所提供之系統的某些部分之電腦程式的範例之部分。該附錄包括啟發法之範例，用以識別主體之關節及存貨項目。該附錄提出電腦程式碼，用以更新主體的購物車資料結構。該附錄亦包括電腦程式常式，用以計算於卷積神經網路之訓練期間的學習率。該附錄包括電腦程式常式，用以將來自卷積神經網路之主體的手之分類結果儲存於來自各相機之每影像框的每主體之每手的資料結構中。 The Computer Program Listing Appendix is attached to this specification and includes portions of examples of computer programs used to implement portions of the systems provided in this application. This appendix includes examples of heuristics for identifying body joints and inventory items. This appendix presents computer code for updating a subject's shopping cart data structure. This appendix also includes computer programming routines for calculating the learning rate during the training of convolutional neural networks. This appendix includes computer program routines to store the results of the subject's hand classification from the convolutional neural network in a per-subject per-hand data structure per frame from each camera.

112a-112n:影像辨識引擎 112a-112n: Image recognition engine

114:相機 114: Camera

800:資料結構 800: Data Structure

1502:循環緩衝器 1502: Circular buffer

1504:定界框產生器 1504: Bounding Box Generator

1506:WhatCNN 1506: WhatCNN

1508:WhenCNN 1508:WhenCNN

1510:購物車資料結構 1510: Shopping Cart Data Structure

2602:第一影像處理器子系統 2602: First Image Processor Subsystem

2604:第二影像處理器子系統 2604: Second Image Processor Subsystem

2606:第三影像處理器子系統 2606: Third Image Processor Subsystem

2608:選擇邏輯組件 2608: Select Logic Components

2702:遮罩邏輯組件 2702: Mask Logic Components

2704:背景影像儲存 2704: Background image storage

2706:因數化影像 2706: Factorized Imagery

2710:位元遮罩計算器 2710: Bit Mask Calculator

2714a-2714n:改變CNN 2714a-2714n: Changes to CNN

2718:協調邏輯組件 2718: Coordination Logic Components

2720:日誌產生器 2720: log generator

2724:遮罩產生器 2724: Mask Generator

Claims

A system for tracking a multi-joint subject in an area of real space, comprising: a plurality of cameras, the cameras of the plurality of cameras produce respective sequences of images of corresponding viewing fields in the real space, the viewing a field overlapping the viewing field of at least one other camera of the plurality of cameras; and a processing system coupled to the plurality of cameras, the processing system comprising: an image recognition engine receiving images from the plurality of cameras The sequences, which process the images to generate corresponding arrays of joint data structures, the arrays of joint data structures corresponding to a particular image by the joint type, the time of the particular image, and the coordinates of the elements in the particular image to classify the elements of the particular images; a tracking engine configured to receive the arrays of joint data structures corresponding to images in a sequence of images from cameras having overlapping viewing fields, and to correspond to the arrays of joint data structures in a different sequence transform the coordinates of the elements in the arrays of the joint data structure of the image into candidate joints with coordinates in the real space; and logic to convert the set of candidate joints with coordinates in the real space Recognized as a multi-joint body in this real space.

The system of claim 1, wherein the image recognition engines comprise convolutional neural networks.

The system of claim 1, wherein the image recognition engines process images to generate a confidence array for elements of the images, wherein the confidence array for a particular element of the image includes confidence values for a plurality of joint types of the particular element , and selects the joint type of the joint data structure for the particular element according to the confidence array.

The system of claim 1, wherein the logic for identifying the set of candidate joints includes a heuristic function for identifying the set of candidate joints as multiple based on physical relationships between joints of subjects in the real space Joint body.

The system of claim 4 includes logic for storing the sets of joints identified as the multi-joint bodies, and wherein the logic for identifying the set of candidate joints includes logic for determining Whether a candidate joint identified in an image taken at a particular time corresponds to a member of one of the sets of candidate joints that were identified as candidate joints of a multi-joint subject in a previous image.

The system of claim 1, further comprising logic to process the sets of candidate joints identified as the multi-joint bodies over time to detect the identified multi-joint bodies with real Interaction events for inventory items in an area of space.

The system of claim 1, wherein the plurality of cameras comprises cameras disposed above and having viewing fields covering respective portions of the region in real space, which are identified as candidate joints of a multi-joint body The coordinate systems in the real space of the members of the set identify locations in the region of the real space of the articulated body.

The system of claim 1 includes logic for tracking the position of a plurality of articulated bodies in the region of real space.

The system of claim 8 includes logic for determining when a multi-joint body of the plurality of multi-joint bodies leaves the region of the real space.

The system of claim 1 , including logic to track the location in the region of real space of a majority of candidate joints that are members of a set of candidate joints identified as a particular multi-joint body.

A method for tracking a multi-joint subject in an area of real space, comprising: using a plurality of cameras to generate respective sequences of images of corresponding viewing fields in the real space, the viewing fields of each camera being associated with the plurality of cameras the viewing fields of at least one of the other cameras overlap; images in the sequences of images are processed to generate respective arrays of joint data structures, the arrays of joint data structures corresponding to particular images are classify the elements of the particular images by joint type, the time of the particular image, and the coordinates of the elements in the particular image; assign the elements in the arrays of joint data structures corresponding to images in different sequences transforming the coordinates of the elements into candidate joints having coordinates in the real space; and identifying the set of candidate joints having the coordinates in the real space as a multi-joint body in the real space.

The method of claim 11, wherein the processing the image comprises using a convolutional neural network.

The method of claim 11, wherein the processing the images includes generating a confidence array for elements of the images, wherein the confidence array for a particular element of the image includes confidence values for a plurality of joint types of the particular element, and according to the confidence Array to select the joint type for this joint data structure for this particular component.

The method of claim 11, wherein the identifying the set of candidate joints comprises applying a heuristic function according to the physical relationship between the joints of the subject in the real space to identify the set of candidate joints as a multi-joint subject.

The method of claim 14, comprising storing the sets of joints identified as multi-joint bodies, and the set of candidate joints identified therein The combination includes determining whether a candidate joint identified in an image taken at a particular time corresponds to a member of one of the sets of candidate joints that were identified as candidate joints of a multi-joint subject in a previous image.

The method of claim 11, wherein the sequences of the image are synchronized.

The method of claim 11, wherein the plurality of cameras comprise cameras disposed above and having viewing fields covering respective portions of the region in real space, which are identified as candidate joints of a multi-joint body The coordinate systems in the real space of the members of the set identify locations in the region of the real space of the articulated body.

The method of claim 11 of the claimed scope includes tracking the position of the plurality of multi-joint bodies in the region of the real space.

The method of claim 18, comprising determining when a multi-joint body of the plurality of multi-joint bodies leaves the region of the real space.

The method of claim 11 , comprising tracking the location in the region of real space of a majority of candidate joints that are members of a set of candidate joints identified as a particular multi-joint body.

A computer program product comprising: Computer-readable memory, which includes a non-transitory data storage medium; and computer instructions stored in the memory, which can be executed by the computer to track, by a program, a multi-joint body in an area of real space, the The procedure includes: using a sequence of images from a plurality of cameras having corresponding viewing fields in the real space, the viewing field of each camera overlapping the viewing field of at least one other camera of the plurality of cameras; processing the image images in the sequences to generate corresponding arrays of joint data structures corresponding to a particular image by the joint type, the time of the particular image, and the coordinates of the elements in the particular image the elements of the particular images; transform the coordinates of the elements in the arrays of joint data structures corresponding to images in different sequences into candidate joints having coordinates in the real space; and will have the A set of candidate joints for coordinates in real space is identified as a multi-joint body in that real space.

The product of claim 21, wherein the processing the image includes using a convolutional neural network.

The product of claim 21, wherein the processing the images includes generating a confidence array for elements of the images, wherein the confidence array for a specific element of the image includes confidence values for a plurality of joint types of the specific element, and according to the confidence array to select the joint information for the particular element The joint type of the material structure.

The product of claim 21, wherein the identifying the set of candidate joints comprises applying a heuristic function according to the physical relationship between the joints of the body in the real space to identify the set of candidate joints as a multi-joint body.

The product of claim 24, comprising storing the sets of joints identified as multi-joint bodies, and wherein identifying the set of candidate joints includes determining whether the candidate joints identified in images obtained at a particular time conform to It is identified as a member of one of the sets of candidate joints of the multi-joint subject in the previous image.

The product of claim 21, wherein the sequences of the image are synchronized.

The product of claim 21, wherein the plurality of cameras include cameras disposed above and having viewing fields covering respective portions of the region in real space, which are identified as candidate joints of a multi-joint body The coordinate systems in the real space of the members of the set identify locations in the region of the real space of the articulated body.

The product of claim 21 of the scope of application includes tracking the position of a plurality of multi-joint bodies in the area in real space.

The product of claim 28 of the claimed scope includes determining when a multi-joint body of the plurality of multi-joint bodies leaves the region of the real space.

The product of claim 21, comprising tracking the position in the region of real space of a majority of candidate joints that are members of a set of candidate joints identified as a particular multi-joint body.