WO2022047662A1

WO2022047662A1 - Method and system of neural network object recognition for warpable jerseys with multiple attributes

Info

Publication number: WO2022047662A1
Application number: PCT/CN2020/113012
Authority: WO
Inventors: Hang Zheng; Qiang Li; Longwei FANG; Wenlong Li
Original assignee: Intel Corporation
Priority date: 2020-09-02
Filing date: 2020-09-02
Publication date: 2022-03-10

Abstract

A method and system of neural network object recognition for warpable jerseys with multiple attributes. A computer-implemented method of image processing, comprising: obtaining image data of clothing having varying visible attributes and visible identifiers on the outer surface of the clothing (302); synthetically warping a version of the image data to form warped images of synthetically warped clothing (304); and training a neural network to recognize the identifiers by using the warped images (306).

Description

METHOD AND SYSTEM OF NEURAL NETWORK OBJECT RECOGNITION FOR WARPABLE JERSEYS WITH MULTIPLE ATTRIBUTES

BACKGROUND

With the advancement of multi-camera, three-dimensional, immersive visual displays based on volumetric models, especially of athletic events, it is possible to freeze the action on a field, rotate the scene to a desired perspective of a virtual camera view, and zoom in or out to create a desired proximity to the action including showing an athlete’s view of the athletic field. This can be accomplished by using an array of cameras around an athletic field for example. The athletes can be individually identified and tracked by their jersey number to follow a certain athlete on a display of the event, and this is often performed by using object detection or recognition neural networks. Such jersey recognition is complex, however, because jerseys vary in jersey number value, jersey number font, color combinations, and other fabric patterns such as stripes or check patterns. While an event is being recorded by cameras, a further significant complication is that the jerseys warp such that the jerseys do not remain flat and facing a camera image plane since the jerseys are usually made out of flexible clothing and the athletes are usually moving. These variations make it very difficult to recognize the jersey numbers from camera images so that the conventional neural network object recognition systems are frequently inaccurate.

DESCRIPTION OF THE FIGURES

The material described herein is illustrated by way of example and not by way of limitation in the accompanying figures. For simplicity and clarity of illustration, elements illustrated in the figures are not necessarily drawn to scale. For example, the dimensions of some elements may be exaggerated relative to other elements for clarity. Further, where considered appropriate, reference labels have been repeated among the figures to indicate corresponding or analogous elements. In the figures:

FIG. 1 is an image of an athlete wearing an example jersey with a jersey number;

FIG. 2 is another image of an athlete wearing another example jersey with a jersey number;

FIG. 3 is a flow chart of a method of neural network object recognition of warpable jerseys with multiple attributes according to at least one of the implementations herein;

FIG. 4 is a schematic diagram of an image processing device according to at least one of the implementations herein;

FIG. 5 is a detailed flow chart of a method of neural network object recognition of warpable jerseys with multiple attributes according to at least one of the implementations herein;

FIG. 6 is a schematic flow chart showing jersey synthesis according to at least one of the implementations herein;

FIG. 7A a schematic flow chart showing postural feature extraction according to at least one of the implementations herein;

FIG. 7B a schematic flow chart showing base jersey or clothing feature extraction according to at least one of the implementations herein;

FIG. 8 is a schematic diagram of a neural network used with the feature extraction of FIGS. 7A-7B and according to at least one of the implementations herein;

FIG. 9 a schematic flow chart of the image processing system according to at least one of the implementations herein;

FIG. 10 is a flow chart showing a neural network process according to FIG. 9 and according to at least one of the implementations herein;

FIG. 11A is a schematic diagram of a warping grid according to at least one of the implementations herein;

FIG. 11B is a schematic diagram of a warping grid after warping and according to at least one of the implementations herein;

FIG. 12 is an illustrative diagram of a warping algorithm according to at least one of the implementations herein;

FIG. 13 is an image of an actual jersey to be recognized and used as input to a neural network;

FIG. 14 is an image of a warped jersey used in the training dataset to train the neural network to recognize the jersey of FIG. 13;

FIG. 15 is an image of an actual jersey to be recognized and used as input to a neural network;

FIG. 16 is an image of a warped jersey used in the training dataset to train the neural network to recognize the jersey of FIG. 15;

FIG. 17 is an image of an actual jersey to be recognized and used as input to a neural network;

FIG. 18 is an image of a warped jersey used in the training dataset to train the neural network to recognize the jersey of FIG. 17;

FIG. 19 is an illustrative diagram of an example system;

FIG. 20 is an illustrative diagram of another example system; and

FIG. 21 illustrates another example device, all arranged in accordance with at least some implementations of the present disclosure.

DETAILED DESCRIPTION

One or more implementations are now described with reference to the enclosed figures. While specific configurations and arrangements are discussed, it should be understood that this is done for illustrative purposes only. Persons skilled in the relevant art will recognize that other configurations and arrangements may be employed without departing from the spirit and scope of the description. It will be apparent to those skilled in the relevant art that techniques and/or arrangements described herein also may be employed in a variety of other systems and applications other than what is described herein.

While the following description sets forth various implementations that may be manifested in architectures such as system-on-a-chip (SoC) architectures for example, implementation of the techniques and/or arrangements described herein are not restricted to particular architectures and/or computing systems and may be implemented by any architecture and/or computing system for similar purposes. For instance, various architectures employing, for example, multiple integrated circuit (IC) chips and/or packages, and/or various computing devices, professional electronic devices such as one or more television cameras, video cameras, or camera arrays that surround an event to be recorded by the cameras, servers, laptops, desk tops, computer networks, and/or consumer electronic (CE) devices such as imaging devices, digital cameras, smart phones, webcams, video cameras, video game panels or consoles, televisions, set top boxes, and so forth, may implement the techniques and/or arrangements described herein, and whether part of a single camera or multi-camera system. Further, while the following description may set forth numerous specific details such as logic implementations, types and interrelationships of system components, logic partitioning/integration choices, and so forth, claimed subject matter may be practiced without such specific details. In other instances, some material such as, for example, control structures and full software instruction sequences, may not be shown in detail in order not to obscure the material disclosed herein. The material disclosed herein may be implemented in hardware, firmware, software, or any combination thereof.

The material disclosed herein may also be implemented as instructions stored on a machine-readable medium or memory, which may be read and executed by one or more processors. A machine-readable medium may include any medium and/or mechanism for storing or transmitting information in a form readable by a machine (for example, a computing device) . For example, a machine-readable medium may include read-only memory (ROM) ; random access memory (RAM) ; magnetic disk storage media; optical storage media; flash memory devices; electrical, optical, acoustical or other forms of propagated signals (e.g., carrier waves, infrared signals, digital signals, and so forth) , and others. In another form, a non-transitory article, such as a non-transitory computer readable medium, may be used with any of the examples mentioned above or other examples except that it does not include a transitory signal per se. It does include those elements other than a signal per se that may hold data temporarily in a “transitory” fashion such as RAM and so forth.

References in the specification to "one implementation" , "an implementation" , "an example implementation" , and so forth, indicate that the implementation described may include a particular feature, structure, or characteristic, but every implementation may not necessarily include the particular feature, structure, or characteristic. Moreover, such phrases are not necessarily referring to the same I implementation. Further, when a particular feature, structure, or characteristic is described in connection with an implementation, it is submitted that it is within the knowledge of one skilled in the art to affect such feature, structure, or characteristic in connection with other implementations whether or not explicitly described herein.

Method and system of neural network object recognition of warpable jerseys with multiple attributes is described as follows.

One conventional multi-camera system in stadiums captures high-resolution videos and employs segmentation and 3D reconstruction of a scene from images captured by the cameras to create a volumetric model. These systems use athlete and ball tracking technology to extract a 3D position of an athlete and ball so that a virtual camera (vCam) perspective can follow the athlete and ball to provide a compelling user experience.

A number of these systems use convolutional neural network (CNN) based jersey number recognition. However, these systems have accuracy limitations since these systems operate in a computer vision (CV) domain. In particular, while known jersey number recognition models work well with jersey number input that is similar to the data of a training dataset used to train a neural network jersey recognition model in the first place, these systems are quite inaccurate when the jersey number input, such as jersey number, jersey color, or jersey pattern is significantly different than the training dataset, or the jersey is deformed (or shaped) due to the jersey’s natural shaping due to gravity, friction, and so forth, and as the athlete moves such that the deformed jersey appears sufficiently different in an image that the neural network cannot recognize the jersey number. Note that the term ‘warp’ herein is used as both intentional synthetic deforming of clothing as well as the natural shaping of the clothing depending on the context.

Referring to FIGS. 1-2 for example, jersey sample 100 has a stripe style 102 with a red diagonal stripe while jersey sample 200 has a specific digit font 202 that has a thickened boundary on the numbers. When existing jersey number recognition training data does not contain these two jersey styles, the conventional object recognition systems often fail to recognize the jersey numbers. This is especially true when a neural network system is trained with the same data set for multiple sports. In some products, the network may be used for NFL, NBA and soccer for example, where each team in the sport has a different jersey number style, and where some of the styles are mixed with the digit itself such as with a stripe style and color configuration of the jersey itself. Even though large-scale diversified jersey data sets are used to train the known jersey number model, it is not enough to cover all jersey styles due to the huge number of the combination of all jersey numbers (e.g., 00, 0, 1 ..., 99) , digit font, stripe style, player pose, etc. Furthermore, it is also impractical and labor intensive to collect all different kinds of real jersey data to cover all these combinations.

Conventional CNN-based methods used in other fields often are adequate when trained with massive amounts of diversified data, and have been used by the jersey number recognition systems. For example, data augmentation is often an effective technique to increase both the amount and diversity of data by randomly adopting some augmentation policies. In the image domain, common augmentation techniques include providing modifications to the training data set such as changing input images in contrast, size, rotation, shearing, flipping horizontally, translation, and so forth to add modified images to an existing training dataset. To automate the process of finding an effective data augmentation policy for a target dataset, a search space may be arranged where a policy consists of many sub-policies, one of which is randomly chosen for each image in mini-batches. A sub-policy consists of two operations, each operation being an image processing function such as translation, rotation, or shearing, and the probabilities and magnitudes with which the functions are applied. Some of these techniques use a search algorithm to find the best policy to construct the combinations of those data augmentations.

These previous solutions, however, focused on adding different data augmentation techniques of the image processing functions. These augmentation techniques cannot compensate for known inadequate training datasets typically used for jersey number recognition and that only include flat pictures of the jerseys or pictures of the jerseys from previous sporting events. Unknown (referring to not experienced yet) jersey styles and digit fonts are particularly problematic for even the data augmented jersey recognition neural networks. The performance of model training from such an existing dataset drops significantly when encountering a new jersey not experienced before and due to a significant lack of model generalization.

To resolve these issues, the disclosed system and method automatically generate synthetically warped synthesized jersey (or clothing) data as part of its training dataset. Specifically, synthesized images of the jerseys or clothing are created by combining different available combinations of attributes. Also, pose or warp data of many different standard poses that would warp clothing worn by a person are collected as well. With this image data, and based on the synthesis of a particular pattern, such as a stripe style as one example, and with specific stylized digit fonts, the present method disclosed herein both transforms such a target jersey into a most fitting shape of a pose of a person or athlete and preserves spatial alignment details. With this data, the present system and method can fully express the characteristics of real-world athletes by tracking visible jersey number images. Furthermore, the synthetic data can be used in jersey number recognition training in an end-to-end manner such that resulting constructed warped jersey or clothing images can be directly used as input for training a jersey number (or clothing identifier) object recognition neural network. By including these synthetic warped images in the training dataset, model generalization capability may be significantly improved for many sports and other situations that can use synthetic warped clothing or warped object input for example.

Referring now to FIG. 3, by one approach an example process 300 characterizes the presently disclosed method and is a computer-implemented method of image processing, and by one specific example, a process of neural network object recognition of warpable jerseys or clothing with multiple attributes. In the illustrated implementation, process 300 may include one or more operations, functions or actions as illustrated by one or more of operations 302 to 306 numbered evenly. By way of non-limiting example, process 300 may be described herein with reference to example

image processing systems

400 and 1900, 2000, or 2100 of FIGS. 4 and 19-21 respectively, and where relevant.

Process 300 may include “obtain image data of clothing having varying visible attributes and visible identifiers on the outer surface of the clothing” 302. This may include obtaining both images of clothing and clothing attributes such as athletic jerseys as well as images of a person in different poses wearing clothing, such as upper-body clothes such as shirts, jerseys, or even pants when pants are being analyzed instead, that show the deforming or warping of the clothes in different poses of the person. This operation also separately includes obtaining image data on the attributes of the clothing or jerseys that may include a clothing or shirt template that shows a shape of clothing or a jersey for example, a visible pattern, jersey number font, a jersey color scheme, and so forth. The attributes may be in the form of separate images or simply a database or list depending on the type of attribute. Thus, an image may each provide one attribute on a jersey with a particular template shape, number font, pattern (such as one or more stripes) , or color scheme, although any combination of them may be provided, and other attributes could be provided as well. This may be in the form of a collection or database for each type of attribute.

Process 300 may include “synthetically warp a version of the image data to form warped images of synthetically warped clothing” 304. This involves a number of operations including synthesizing the attribute image data to form base or attribute (or standard) images using the obtained image data in all or many possible attribute combinations. The attribute images may have the picture of the jersey or clothing in a flat or forward-facing position shown directly face the image plane of a camera and with or without 3D shape. Separately, masks may be formed of the clothing in the pose images. Images also may be formed that show key-point locations that define the pose of the clothing, such as five-key points in one example including key-points at the neck, shoulders and hips. The key-points and masks are collectively referred to as target pose (or postural) warped representations (TPWR images or just pose image data or pose data) .

Next, features are extracted from the images and that represent the attributes and warped pose combinations. This may involve an attribute neural network that provides features (or feature maps) by using the attribute images as input alone, and another pose plus attribute neural network that receives both the attribute images and the TPR images that outputs pose or warp related feature maps that are now also associated with the attribute images. The TPWR images provide a more realistic or natural shape of the jerseys or clothing. Since the feature extraction networks can leverage semantical features, such as jersey numbers or other identifiers with desired fonts, as well as jerseys or clothing with patterns, color schemes, and so forth from existing datasets, and in order to form the base images, the generated images can have a natural texture very similar to real images.

Thereafter, the features are input to a combining or correlating network and then a regression network. This network outputs spatial transformation parameters that each represents the spatial transformation from a flat or forward facing attribute image of the clothing or jersey into one of the poses. A warping operation then applies the spatial transformation representations (STRs) or parameters on the attribute images to perform synthetic warping and generate an image of realistically warped clothing or jersey.

Optionally, the resulting warped images can be evaluated by using a loss function to compare the warped image to ground truth. A combined loss function weighs both a structural similarity index normalization (SSIM) type of normalization and another normalization, such as L1. When the loss is too large, the image may be dropped. This may avoid over-smoothing blur caused by conventional L1 and/or L2 loss functions alone, resulting in better transform performance than the conventional loss function.

Process 300 may include “train a neural network to recognize the identifiers by using the warped images” 306. Once generated, the warped images may be used directly as input to an object detection neural network to train the neural network to recognize jersey numbers or other identifiers on warpable clothing, thereby providing an end-to-end solution that significantly increases the generalization accuracy for such a network.

Referring to FIG. 4, an image processing system or device 400 performs the synthetically warped image training dataset generation described herein. By one form, device 400 may be a postural warped synthetic jersey image generation device or system, or simply object detection training dataset generation system, unit, or device. The device 400 may include two functional units separated by the dashed line where the left side is a jersey or clothing synthesis section 402 and the right side is generally and collectively referred to as a postural warping transform network 404 although the network 404 may have a number of separate neural networks.

In more detail, the synthesis section 402 includes a target postural warping representation (TPWR) unit 408 that has a key-point generator 422 to indicate key-points 421 on pose images 423 of a person wearing clothing. The key-points represent the pose (or warp) of the person, and in turn the clothing, in the pose image. A pose masking unit 424 generates a mask 419 showing the boundary of the clothes. The key-points 421 and mask 419 form pose image data (or just pose data) 425 that is provided from the TPWR 408 for feature extraction as explained below.

An attribute jersey (or clothing) synthesis unit 406 has an attribute synthesizer unit 410 that generates attribute images 420 where each or individual attribute images 420 have different combinations of attributes. The attributes may include a shirt template 412 showing the clothing shape in addition to a pattern image or data (such as stripes) 414, jersey number font 416 referring to the shape, style, and size of a jersey number, and color scheme 418, but many other attributes could be tracked as well. The pose data 425 and attribute images 420 are provided to the postural warping transform network 404 and specifically a feature extraction unit 426. The feature extraction unit 426 has a pose feature extraction unit 428 that receives both the attribute images and the pose data to generate representative features while an attribute feature extraction unit 430 generates features more directly related to just the attributes including the semantic attribute (such as a jersey number or other identifier) .

The features (or feature vectors) may be provided to a spatial transform regression unit 432 that first combines (or correlates or matches) the pose data in the form of the pose features and the attribute image features by a feature correlation network 434. A transform regression unit 436 then performs a regression to output spatial transform parameters that indicate a transform from a flat or front-facing view of the clothing or jersey as in in the attribute image and to one of the poses.

A warping unit 438 then applies the spatial transform parameters using a warping algorithm and applied to the attribute images to generate warped images. Optionally, a loss evaluation function unit 440 can then evaluate whether the warped images meet a criteria. If met, the warped image can be used directly as an input image to train a clothing or jersey (or jersey number) object recognition neural network. The details of the operation of system 400 are described with process 500 below.

Referring to FIG. 5, by one approach an example process 500 characterizes the presently disclosed method and is a computer-implemented method of image processing, and by one specific example, a process of neural network object recognition of warpable jerseys or clothing with multiple attributes. In the illustrated implementation, process 500 may include one or more operations, functions or actions as illustrated by one or more of operations 502 to 512 numbered evenly. By way of non-limiting example, process 500 may be described herein with reference to example

image processing systems

As a preliminary matter, the example object being detected in the methods provided herein are athletic jerseys with an identifier that is a one or two digit jersey number that identifies the player wearing the jersey, such as 0 to 9 or 00 to 99. In other alternatives, the identifiers could be any other one or more numbers, letters, or symbols on any clothing, whether upper-body clothing whether pull-over, button-down, or zippered, for example, or lower-body clothing such as pants or shorts, and so forth that shifts or warps (also referred to as deforming or distorting) such that the shape of the clothing, and in turn the identifier, changes shape depending on the pose and/or motion of the person wearing the clothing. By another option, the identifiers could include pictures or logos specific to a person wearing the identifier. By one form, a player number is used as the identifier when the event being captured is a team sport such as soccer, American football, baseball, basketball, hockey, rugby, cricket, or any individual sport or event where numbers are worn by participants on warpable clothing or materials such as racing, running whether track or cross-country running, marathons, triathlons, biking, skiing, organized walks, and so forth. The player number often is usually large relative to the player and other objects on the event field, making it relatively easier for an object detection program to detect and recognize.

Also, the pose and attribute data may be obtained from pre-formed databases of customized collections and/or standard known collections. Otherwise, whether for training or to record an actual event being analyzed, cameras may be used to capture the images of poses or attributes in clothing or jerseys mentioned herein, and in this case, may be from a single camera or a multi-camera system used to record events, such as at athletic events, and may capture athletes in images from a number of different perspectives. The images then can be used to create images from virtual camera viewpoints permitting the athletes or events on the field to be viewed from just about any desired angle. By one form, 30, 36, or 38 cameras have been known to be mounted around a field at a stadium, arena, or other venue that holds an event that is, or can be, viewed on still photo or video displays, such as live television or later on recorded videos. By one form, the cameras may provide 5K resolution images. In the present methods, while image data from camera arrays may be used to match images and form 3D models or 3D spaces to generate virtual views, such 3D reconstruction is not always necessary for the jersey or clothing object detection itself described herein (although such jersey detection subsequently may be used for such 3D reconstruction) . 2D images of the clothing, attributes, identifier, and poses are adequate for the generation of the training dataset as well as the actual identifier object detection.

While referring to FIG. 6, process 500 may include ” generate jersey image data by synthesizing jersey attributes and poses” 502. As mentioned, this involves having the jersey synthesis section 402, or here 600, obtaining images or descriptions of clothing attributes such as a jersey or other clothing template 412 to obtain the shape of the jersey. This may be an image of 2D data that shows a flat 2D jersey or may be a 3D picture or model of the jersey in a full frontal pose albeit still provided in 2D image data. The images may be obtained from a database of “clean” templates that only show available jersey (or clothing) style shapes without any other texture details and may be provided in a single uniform color such as white or some gray-scale value. In other examples, the template images may be obtained from anywhere including the internet, actual athletic events, and so forth, where only the boundary of the clothing or jersey is kept and any other details in the obtained image is ignored. The template images could be collected manually by a person viewing the images or by known automatic object detection algorithms, or otherwise, known libraries or databases of such images may be used, and may be collected from known sports organizations. A collection of pattern images 414 for the jerseys or clothing may be obtained in the same way, where checks, stripes, and plain (or no) patterns are shown here.

The patterns may be formed by the fabric or texture of the clothing or jersey or may be a design overlaid onto the clothing or jersey. These pattern images 414 may or may not include the shape of the jersey on the images. A collection of font images 416 also may be obtained and include image data of images of fonts showing pictures of different available identifiers including numbers (or alternatively letters, symbols, and so forth when desired) . Each number or identifier may be provided in a different font or style. Another attribute here is a color scheme which may or may not be provided as images. Such color data could be provided simply as a list of available RGB color combination values and which parts of the jersey or clothing are to have which color, such as blue for the fabric of the jersey and gray for the identifier number on the jersey. Many different color schemes on the clothing are possible as desired.

The attribute synthesizer unit 410 then combines all of the attributes to generate attribute images 420 of synthesized jerseys or clothing, also referred to as feature extraction inputs. The attribute synthesizer unit 410 creates a different combination of attributes for each attribute image, and the images are 2D flat or full forward views (also referred to as standard) images of the jersey or clothing. The attribute images 420, therefore, provide a base or start position for the warping described below as well as provide the input for feature extraction.

In one example, 69 stylized digit fonts, eight different stripe styles, and 18 pre-determined color schemes are used to form the attribute images. When the typical range for jersey numbers is 1-99, providing all possible attribute combinations results in 99 x 8 x 69 x 18 =983, 664 attribute images 420 each showing a unique flat, forward (or standard) synthetic jersey image to cover various jersey styles.

In addition to the attribute jersey synthesis, the present process 500 factors the natural and realistic look of the clothing or jerseys when a person wearing the jersey or clothing is in different poses. A change from a flat, full forward view of the clothing to a natural or realistic look of the jersey as the clothing is worn in different poses is a visible and measurable warping (or deformation or distortion) of the clothing forming pose data. Thus, the warped positions of the clothing can be factored into the training dataset by generating features (or feature maps) that partly depend on both the pose data and the attribute images. This results in the pose data acting as a target teaching template that guides the standard attribute image synthesized jersey with corresponding deformation changes, which can be represented in resulting pose plus attribute features (or feature maps, or vectors or other output structure) .

This is accomplished by converting the flat or forward facing attribute images 420 into realistic poses that warp the clothing or jersey by providing features (or feature vectors) that are associated with both the flat attribute images 420 and the pose data 425. Specifically, the TPWR (Target Postural Warping upper-body clothes Representation) unit 408 obtains pose images 423 to represent the target postural warped arrangement of clothing or jerseys. By one form, the images 423 are from a known dataset of including 16, 253 different poses collected by Han, X., et al., “an image-based virtual try-on network” , arXiv preprint arXiv: 1711.08447 (2017) .

The TPWR unit 408 also has a key-point generator 422 that generates key-points 421 to represent each pose. By one form, it has been found five key-points (neck, left shoulder, right shoulder, left hip, right hip) 421 of the person wearing the clothes adequately represent the pose (or warp) of the clothing and are indicated as pixel locations, each on a separate image map (or channel) for feature extraction neural network input. The key-points may be provided by the known pose collection or may be detected by manual labeling or automatic algorithms such as Openpose. More or less key-points, or key-points in different locations, could be used instead as long as the number of key-points, and in turn the number of channels, does not require an unnecessarily large computational load so that the system becomes inefficient.

The TPWR unit 408 also provides a clothing shape mask unit 424 that generates a 1-channel (single image) binary feature map or mask 419 of the posed (or warped) clothing where, for example, 1 indicates a pixel location of the clothing and 0 indicates background (not clothing) instead. The mask may be formed by manual labeling or algorithms such as known parsing or segmentation algorithms.

With both the pose key-points 421 and pose mask 419, the system 400, and specifically the postural warping transform network 404, learns the key characteristics (e.g. shape, spatial deformation) that indicate texture diversity of real-world clothes along with posture (or warp) changes, thereby providing the attribute images with more realistic warped (or posed) arrangements.

When needed, this operation also may include any preliminary pre-processing operation to generate any of the images mentioned above for jersey synthesis and pose data generation and in sufficient form and format to be used for the training dataset generation, such as mosaicing, denoising, color correction, and so forth. By one form, the attribute images 420 and the pose data 425 are provided in the form of input channels for feature extraction that are uniformly cropped to provide regions of interest (ROIs) with the clothing and identifiers, key-points, and masks to about 192 x 256 pixel images, as one example.

While referring to FIGS. 4 and 7A-7B now, process 500 may include “extract features representing jersey images with and without poses” 504. Particularly, the feature extraction unit 426 of the postural warping transform network section 404 has a pose plus attribute neural network 700 for the pose plus attribute feature extraction network unit 428 and an attribute neural network 702 for the attribute feature extraction network unit 430. By one example form, the pose plus attribute neural network 700 may receive nine channels of input data including three attribute (RGB) channels from the attribute images 420, one channel for the mask 419, and five channels with one channel each for one of the key-points. By one example, each key-point channel is an image with all 0s except for a 1 for the location of the key-point. The nine channels are collectively referred to as pose data p.

By one form, the attribute neural network 702 receives the three RGB channels of the attribute images 420 alone and is referred to as attribute data j. It also should be noted that since the attribute image channels indicate an identifier, such as a jersey or clothing number, letter, symbol, logo, to name a few examples, this is considered at least partially semantic object recognition of a semantic feature. In order to efficiently leverage the semantic feature, the

neural networks

700 and 702 are customized to respectively use the pair-wise inputs (p, j) to extract representative features of the TPWR (the pose data 425) and the standard synthetic jersey or clothing images (the attribute images 420) .

The two

feature extraction networks

700 and 702 have similar or the same structure. Both of them may have four down-sampling convolutional blocks C0, C1, C2, and C3 with a similar or the same structure. The down-sampling block C0 may have two convolution layers each with 3 x 3 kernels but with different stride sizes where a first convolutional layer of the two uses a stride of one and the other uses a stride of two, or vice-versa. The total filter number (or number of stacked kernels or channels) for sampling block C0 is 64 and forms 96 x 128 resolution of intermediate feature maps for each channel. The down-sampling blocks C1, C2, and C3 respectively form 48 x 64, 24 x 32, and 12 x 16 resolution of intermediate feature maps respectively and for each channel.

In order to improve the performance of semantic (and pose or warp) feature extraction in the down-sampling convolutional blocks, at least one, but here each, of the down-sampling convolutional blocks C1, C2, C3 of both

neural networks

700 and 702 also have a series 704 of residual sub-block modules 706. These blocks have deformable convolutional layers and a squeeze and excitation sub-series (or side sequence of layers) to leverage both the spatial distribution (or geometric variations) of object features and the relative influence of channel-wise arrangements of the convolutions handling the spatial representations. By one example, the down-sampling blocks C1 and C2 each have two residual sub-blocks 706, and C3 may have three residual sub-blocks 706. Each series 704 ends with a convolution layer with a 3 x 3 kernel of stride two to down-scale the feature size. The number of filters (or channels) is 128, 256, 512 respectively for the down-sampling blocks C1, C2, and C3.

Referring to FIG. 8, each residual sub-block 706 comprises a series 800 of neural network layers including a combination of convolutional layers with 1 x 1 and 3 x 3 kernels with stride one and a shortcut. By default, each convolutional kernel may be succeeded by a batch-normalization and RELU. However, in order to further improve geometric variability of the residual sub-blocks versus that of the kernels of conventional convolutional layers, deformable convolutions may replace the 3 x 3 kernels with stride one to provide sampling over a broader range of feature levels. So arranged, the series 800 may have, in order, a first convolutional layer 802 with 1 x 1 kernels, a first deformable convolutional layer 804 with 3 x 3 kernels, a second convolutional layer 806 with 1 x 1 kernels, and a second deformable convolutional layer 808 with 3 x 3 kernels. These layers provide output to a scaling layer 820 that changes the channels to 12 x 16 by sampling. Thus, the output of the feature extraction neural networks will be a final feature map.

To improve the performance and provide more efficient supplemental information to identify an output feature map, a skip connection strategy can be used to add a squeeze and excitation mechanism. This involves also providing the output of the second convolutional 1 x 1 layer to a side series 810 with, in order, a global pooling layer 812, two fully connected (FC) layers 814 and 816, and a Sigmoid layer 820. The squeeze and excitation of the skip connection series 810 provides efficient modeling of the channel-wise relationships among the spatial convolutions. The side path or series 810 with the skip connection works with the main pipeline by providing two different data outputs to the scale unit 820, one from deformable layer 808 and one from the side sigmoid layer 818. The side path 810 uses lower layer features and computes global information. The output of the side path 810 is then used as weights on the output from the deformable layer 808 on the main pipeline. The output of both of the

neural networks

700 and 702 are features in the form of extracted feature maps of 12x16 with 512 channels. It will be understood that different output arrangements could be formed instead.

Process 500 may include “generate representative spatial transform parameters of individual combinations of jersey attributes and poses” 506. Here, the output features from the feature extraction unit 426 are input to the spatial transform regression unit 432. The two feature maps, one representing TPWR synthetic poses (or warps) and synthesized attributes (including the semantics) , and the other representing the attributes alone are combined into a single feature representation, such as a single feature map or tensor, and then regressed to learn or generate spatial transformation parameters θ that indicate an estimate transform from a flat or forward view of the clothing or jersey warped into a particular pose.

Referring to FIGS. 9-10, and to accomplish these operations, the feature correlation network unit 434 (FIG. 4) may operate a correction or correlation layer block C4 to combine the two feature maps or a particular attribute combination and pose into a single high-level abstract feature. Here, the similarities of the pair-wise input features (p, j) are found by correlation instead, and by block C4 to combine independent features into one feature tensor.

Specifically, given two feature maps F _A, F _B ∈ R ^{w x h x c} where w, h, and c represent the width, height and channel number of the extracted feature, the output of correlation layer C4, here C _AB comes from a matrix multiply of pair-wise feature matrix maps F _A, F _B as:

C_AB (i, j, k) =f _B (i, j) ^Tf_A (i_k, j_k ) (1)

where, , C _AB∈R ^{w×h× (w×h)} , (i, j) , and (i _k, j _k) indicate the individual feature positions in the w×h feature maps, and k=h (j _k-1) +i _k is an auxiliary indexing variable for (i _k, j _k) . For each individual pair-wise feature f _A ∈ F _A and f _B ∈F _B, a particular position (i, j) of the correlation layer output C _AB has all of the similarities between f _B (i, j) ^T and all of the features of F _A. The size of the feature extraction input is 12 x 16 x 512 (x 2) for both

feature extraction networks

700 and 702, and the output dimensions of the correlation layer C4 is 12 x 16 x 192.

Following the correction or correlation layer block C4, a regression neural network 902 may be used to learn the spatial transform parameters. The regression neural network may have blocks C5, C6, C7, and C8 to predict the spatial transformation parameters θ. The output channel resolutions or dimensions of blocks C5, C6, C7, and C8 are 6 x 8, 3 x 4, 3 x 4, and 1 x 1 respectively. The composition of blocks C5, C6, and C7 may have a similar structure as that of the feature extraction network blocks C1, C2, and C3. Here, however, the number of filters (or channels) for blocks C5, C6, and C7 are 512, 256, and 128, respectively. Also here, the block C7 ends with an additional convolutional layer with 3 x 3 kernels of stride one, and the down-sampling is doubled in blocks C5 and C6 so that C5 and C6 end with 3 x 3 kernels with stride two and with two convolutional layers.

Block C8 is an output block, and includes a series 1000 of layers including, in order, a BatchNorm layer 1002, a dropout layer 1004, an FC layer 1006, and another BatchNorm layer 1008. The output block C8 predicts θ which has x and y coordinate offsets of warping grids as explained below. To handle large spatial deformation and preserve more textural details of the target clothing or jersey, a 7 x 7 warping grid is used and has a size of 2×7×7=98. Thus, the output dimension of the FC layer 1006 in block C8 is 98. One grid is provided for each coordinate x and y for center point of grid (x, y) 1104 described below. The BatchNorm layer 1008 does not change the size of the output.

Referring to FIGS. 11A-11B and 12, process 500 may include “generate warped jersey images using the parameters” 508. By one example, the warping unit 438 may have a grid generation unit 904 and a grid sample unit 906 (FIG. 9) . The grid generation unit 904 performs grid generation operations 1202 to use transform parameters θ to form a warped grid 1102, and a grid sampling unit 906 performs grid sampling operations 1204 to apply the warped grid (and in turn, transform parameters θ) to a sample of the jersey or clothing, such as in one of the attribute images j 420 to generate a final synthetically warped or postural jersey or clothing image 442.

Specifically, a thin-plate spline (TPS) warping technique may be used to apply the transformation parameters θ from the transform regression network and upon the synthetic jersey or clothing data of the attribute images. TPS is a parametric technique that uses 2D interpolation based on a set of known corresponding control points. It interpolates a surface that passes through each control point. A grid A _a 1100 (FIG. 11A) is a 2 x 2 block grid of 3 x 3 vertices or contour points, and is defined or indexed by its center point 1104 of a jersey or clothing image shown in an original non-warped shape, and where ‘a’ indicates a certain pixel area of an image covered by the grids. Grid 1102 is a grid B _a 1000 corresponds to the original grid 1100 after being warped by transform parameters θ, and showing the transform, or warping, of at least one contour point 1106. Let Z _a=B _a-A _a be the difference between the two grids and representing the transform parameters θ. Then, a TPS is fit over points (a _ix, a _iy, z _i) to get an interpolation function F for translation of points for B, and where (a _ix, a _iy) ∈ A _a, z _i∈Z , i ∈ [1, 2... k] and k is the number of contour points.

By this example, a 7 × 7 input grid of control points is used for the computations but other sizes could be used, and the computation grid is a different size than the output warping or transform grids A and B (1000 and 1002 of FIGS. 11A-11B) . After obtaining the spatial transformation parameters θ (parameters of 7 x 7 = 49 contour points) , the grid generation unit 1202 will generate a TPS grid F _θ (G) from the θ. Then, the grid sampling unit 1204 applies the TPS grid F _θ (G) to one of the sample flat synthesized clothing or jersey attribute images j to generate the output three RGB channels of a resulting final warped image F _θ (j) .

Process 500 then may include “evaluate warped jersey images using a loss function” 510. To better ensure the accuracy of the system, model training may be performed with a particular loss function. Specifically, in order to better ensure the present method and system provide an end-to-end diversified postural synthetically warped jersey or clothing generation solution based on CNN, the network generation of the training dataset here should be supervised and predictable. The process 500, as mentioned above, provides a neural network training dataset generator that has the capability of learning from the pair-wise input data to predict

where

is ground truth. Therefore, a loss function

may be gauged to compare discrepancy between

and synthetic warped J. Without loss of generality, the network may be trained with a pixel-wise L1 loss:

The output warped image may be an RGB image, and pixel-wise L1 loss tends to incur a blurry effect in the final output, which will lead the network regression to an over-smooth blur. Therefore, a mean structural similarity index (MSSIM) loss method may be adopted as another component of the loss function to obtain more structural similarity and perceptual motivation. By one form, MSSIM loss is described here as:

where M is a number of sliding windows, and x _j, y _j is a pixel content of the jth patch. The combination loss function can be written as Eq. 4. below where λ ₁ and 1-λ ₁ represent the weights of corresponding losses respectively. Experiments show that λ ₁=0.64 obtains the best training result. Furthermore, the general optimizer Adam has β ₁=0.5, β ₂=0.999, and a learning rate of 0.001.

This combined loss function with SSIM avoids or reduces blur output. It will be appreciated that the loss function is not always limited to L1 and should at least include at least two weighted normalization algorithms where one of the algorithms is a structural similarity index (SSIM) loss.

Process 500 may include “provide warped jersey images for training dataset of object recognition neural network to recognize identifiers on the jerseys” 512. The resulting training dataset of images of synthetically warped clothing or jerseys can then be used directly as input to train jersey number (or other clothing identifier) object recognition neural networks either alone or with other training datasets so that the training dataset generation is an end-to-end solution.

Experimental Results

To demonstrate the effectiveness of the present method with jersey number detection, several experiments were conducted using the same test benchmarks (or input datasets with jerseys numbers to be detected) and while using a previously developed jersey detection neural network. This neural network should not give an advantage to one of the training datasets over the other based on the neural network structure alone. The neural network should perform object semantic feature extraction, and may use bounding boxes and label regression (or regional networks and classifications) .

Training dataset

Two different jersey number detection training datasets were developed. A first typical training dataset was generated by finding 125, 750 images with visible jersey numbers. These images were collected and cropped from the internet and actual athletic game clips. The second training dataset is a synthetic warped training dataset generated using the proposed method disclosed herein. The present method generated 983, 664 different jersey images. FIGS. 13-18 (below) shows three samples of the generated jersey images including different stripe styles, digit fonts, and color configurations.

Test benchmarks

The two benchmarks were used to evaluate for accuracy affected by the training datasets, and each benchmark includes testing images to be analyzed to detect the jersey number in the images. Typical Benchmark Part A includes 2,490 real-world images with visible jersey numbers and obtained from the internet and/or actual athletic game clips. Benchmark Part B has 2353 real-world images that are different in at least some significant ways including stripe style, stylized digit font, and/or color configuration compared to images in the typical Benchmark Part A and should be difficult for the neural network to recognize.

Referring to FIGS. 13-18,

image samples

1300, 1500, and 1700 to be analyzed for presence of a jersey number are the type of samples provided in the Benchmark Part B image set. These types of styles do not usually exist in the typical training dataset. However,

images

1400, 1600, and 1800 are relatively close warped images in the synthetic warped training dataset that permit recognition of the jersey numbers in

image samples

1300, 1500, and 1700 respectively, and would most likely not be recognized from use of the typical training dataset.

For the experiment, the two training datasets were used to train models M1 and M2. M1 was trained only from the typical dataset, and M2 was trained using both the typical and synthetic warped datasets. Precision and recall techniques were used to measure performance. Precision is the fraction of correct predicted jersey images among all predicted results, and recall is the fraction of the total amount of relevant jersey images that are predicted correctly.

As shown in table 1, both M1 and M2 achieve very good results for Benchmark Part A. However, when evaluating for Benchmark Part B (table 2) , the M2 model maintains similar accuracy (>99%precision and recall) , while the M1 model performance drops significantly, at about 50%the accuracy of the M2 model trained with the synthetically warped training dataset. Thus, the postural warped synthetic jersey data generation solution can substantially improve the generalization capabilities of jersey (or clothing) number recognition, and therefore enhance the robustness of player tracking systems as well as athlete or person tracking for many other systems.

Table 1. Performance comparison on benchmark part A

Benchmark Part A	Precision	Recall
M1 model	99.72%	99.24%
M2 Model	99.78%	99.37%

Table 2. Performance comparison on benchmark part B

Benchmark Part B	Precision	Recall
M1 model	53.76%	48.33%
M2 model	99.66%	99.19%

It will be appreciated that the

processes

300 and 500 respectively explained with FIGS. 3 and 5 do not necessarily have to be performed in the order shown, nor with all of the operations shown. It will be understood that some operations may be skipped or performed in different orders.

Also, any one or more of the operations of FIGS. 3 and 5 may be undertaken in response to instructions provided by one or more computer program products. Such program products may include signal bearing media providing instructions that, when executed by, for example, a processor, may provide the functionality described herein. The computer program products may be provided in any form of one or more machine-readable media. Thus, for example, a processor including one or more processor core (s) may undertake one or more of the operations of the example processes herein in response to program code and/or instructions or instruction sets conveyed to the processor by one or more computer or machine-readable media. In general, a machine-readable medium may convey software in the form of program code and/or instructions or instruction sets that may cause any of the devices and/or systems to perform as described herein. The machine or computer readable media may be a non-transitory article or medium, such as a non-transitory computer readable medium, and may be used with any of the examples mentioned above or other examples except that it does not include a transitory signal per se. It does include those elements other than a signal per se that may hold data temporarily in a “transitory” fashion such as RAM and so forth.

As used in any implementation described herein, the term “module” refers to any combination of software logic, firmware logic and/or hardware logic configured to provide the functionality described herein. The software may be embodied as a software package, code and/or instruction set or instructions, and “hardware” , as used in any implementation described herein, may include, for example, singly or in any combination, hardwired circuitry, programmable circuitry, state machine circuitry, and/or firmware that stores instructions executed by programmable circuitry. The modules may, collectively or individually, be embodied as circuitry that forms part of a larger system, for example, an integrated circuit (IC) , system on-chip (SoC) , and so forth. For example, a module may be embodied in logic circuitry for the implementation via software, firmware, or hardware of the coding systems discussed herein.

As used in any implementation described herein, the term “logic unit” refers to any combination of firmware logic and/or hardware logic configured to provide the functionality described herein. The logic units may, collectively or individually, be embodied as circuitry that forms part of a larger system, for example, an integrated circuit (IC) , system on-chip (SoC) , and so forth. For example, a logic unit may be embodied in logic circuitry for the implementation firmware or hardware of the coding systems discussed herein. One of ordinary skill in the art will appreciate that operations performed by hardware and/or firmware may alternatively be implemented via software, which may be embodied as a software package, code and/or instruction set or instructions, and also appreciate that logic unit may also utilize a portion of software to implement its functionality.

As used in any implementation described herein, the term “component” may refer to a module or to a logic unit, as these terms are described above. Accordingly, the term “component” may refer to any combination of software logic, firmware logic, and/or hardware logic configured to provide the functionality described herein. For example, one of ordinary skill in the art will appreciate that operations performed by hardware and/or firmware may alternatively be implemented via a software module, which may be embodied as a software package, code and/or instruction set, and also appreciate that a logic unit may also utilize a portion of software to implement its functionality.

Referring to FIG. 19, an example image processing system 1900 is arranged in accordance with at least some implementations of the present disclosure. In various implementations, the example image processing system 1900 only performs warped clothing or jersey training dataset generation as described above, and has logic units 1904 for those purposes. By other alternatives, system 1900 may have one or more imaging devices 1902 to form or receive captured image data, and this may include either one or more cameras such as an array of cameras around an athletic field, stage or other such event locations. Thus, in one form, the image processing system 1900 may be a digital camera or other image capture device that is one of the cameras in an array of the cameras. In this case, the imaging device (s) 1902 may be the camera hardware and camera sensor software, module, or component. In other examples, imaging processing system 1900 may have an imaging device 1902 that includes, or may be, one camera or some or all of the cameras in the array, and logic modules 1904 may communicate remotely with, or otherwise may be communicatively coupled to, the imaging device 1902 for further processing of the image data. The camera (s) may be used to capture images for forming the training dataset or for performing run-time recognition of clothing or jerseys as described above.

Accordingly, the part of the image processing system 1900 that holds the logic units 1904 and that processes the images may be on one of the cameras or may be on a separate device included in, or entirely forming, the image processing system 1900. Thus, the image processing system 1900 may be a desktop or laptop computer, remote server, or mobile computing device such as a smartphone, tablet, or other device. It also could be or have a fixed function device such as a set top box (cable box or satellite box) , game box, or a television. The camera (s) 1902 may be wirelessly communicating, or wired to communicate, image data to the logic units 1904.

In any of these cases, such technology may include a camera such as a digital camera system, a dedicated camera device, web cam, or any other device with a camera, a still camera and so forth for the run-time of the system as well as for model learning and/or image collection for generating predetermined personal image data. The cameras may be RGB cameras or RGB-D cameras, but could be YUV cameras. Thus, in one form, imaging device 1902 may include camera hardware and optics including one or more sensors as well as auto-focus, zoom, aperture, ND-filter, auto-exposure, flash, actuator controls, and so forth. By one form, the cameras may be fixed in certain degrees of freedom, or may be free to move in certain or all directions, as long as the position and optical axis from camera to camera is known so that the cameras can be registered to the same coordinate system.

The logic modules 1904 of the image processing system 1900 may include, or communicate with, an image unit 1906 that performs at least partial processing. Thus, the image unit 1906 may receive raw image data and perform pre-processing, decoding, encoding, and/or even post-processing to prepare the image data for transmission, storage, and/or display. It will be appreciated that the pre-processing performed by the image unit 1906 could be modules located on one or each of the cameras, a separate image processing unit 1900, or other location.

In the illustrated example, the logic modules 1904 also may include an object detection unit 1930 which may have an object detection training dataset generation unit 400 (FIG. 4) to generate the warped clothing or jersey images for training as described above. The object detection unit 1930 alternatively also may have an object detection neural network unit 1932 that is trained by using the warped images and that is used to perform the clothing or jersey object detection during events to be analyzed such as athletic events. By other forms, the neural network unit 1932 is remote from the system 1900. A library or database may be provided on a memory store (s) 1914 to store any of the data related to the clothing or jersey object detection in a pose images database 1936, an attribute and identifier images database 1938, and a training dataset database, each storing image data related to its label and as described above. Any of these databases or others can be used to store intermediate and final output versions (feature vectors, spatial transform parameters, and so forth) of image data while generating the training dataset. A display controller 1908 may be provided to control a display 1916 to display any of the images mentioned herein.

These units may be operated by, or even entirely or partially located at, processor (s) 1910, such as the Intel Atom, and which may include a dedicated image signal processor (ISP) 1912, to perform many of the operations mentioned herein. The logic modules 1904 may be communicatively coupled to the components of the imaging device 1902 in order to receive raw image data when desired. The memory stores 1914 also may or may not hold other image data or logic units. An antenna 1920 may be provided to transmit or receive encoded data. In one example implementation, the image processing system 1900 may have at least one memory 1914 communicatively coupled to the processor 1910 to perform the operations described herein as explained above.

The image unit 1906, which may have an encoder and decoder, and antenna 1920 may be provided to compress and decompress the image date for transmission to and from other devices that may display or store the images. This may refer to transmission of image data among the cameras, and the logic units 1904. Otherwise, the processed image 1918 may be displayed on the display 1916 or stored in memory 1914 for further processing as described above. As illustrated, any of these components may be capable of communication with one another and/or communication with portions of logic modules 1904 and/or imaging device 1902. Thus, processors 1910 may be communicatively coupled to both the image devices 1902 and the logic modules 1904 for operating those components. By one approach, although image processing system 1900, as shown in FIG. 19, may include one particular set of unit or actions associated with particular components or modules, these units or actions may be associated with different components or modules than the particular component or module illustrated here.

Referring to FIG. 20, an example system 2000 in accordance with the present disclosure operates one or more aspects of the image processing system described herein. It will be understood from the nature of the system components described below that such components may be associated with, or used to operate, certain part or parts of the image processing systems described above including performance of a camera system operation described above. In various implementations, system 2000 may be a media system although system 2000 is not limited to this context. For example, system 2000 may be incorporated into a digital video camera, mobile device with camera or video functions such as an imaging phone, webcam, personal computer (PC) , remote server, laptop computer, ultra-laptop computer, tablet, touch pad, portable computer, handheld computer, palmtop computer, personal digital assistant (PDA) , cellular telephone, combination cellular telephone/PDA, television, smart device (e.g., smart phone, smart tablet or smart television) , mobile internet device (MID) , messaging device, data communication device, and so forth.

In various implementations, system 2000 includes a platform 2002 coupled to a display 2020. Platform 2002 may receive content from a content device such as content services device (s) 2030 or content delivery device (s) 2040 or other similar content sources. A navigation controller 2050 including one or more navigation features may be used to interact with, for example, platform 2002 and/or display 2020. Each of these components is described in greater detail below.

In various implementations, platform 2002 may include any combination of a chipset 2005, processor 2010, memory 2012, storage 2014, graphics subsystem 2015, applications 2016 and/or radio 2018. Chipset 2005 may provide intercommunication among processor 2010, memory 2012, storage 2014, graphics subsystem 2015, applications 2016 and/or radio 2018. For example, chipset 2005 may include a storage adapter (not depicted) capable of providing intercommunication with storage 2014.

Processor 2010 may be implemented as a Complex Instruction Set Computer (CISC) or Reduced Instruction Set Computer (RISC) processors; x86 instruction set compatible processors, multi-core, or any other microprocessor or central processing unit (CPU) . In various implementations, processor 2010 may be dual-core processor (s) , dual-core mobile processor (s) , and so forth.

Memory 2012 may be implemented as a volatile memory device such as, but not limited to, a Random Access Memory (RAM) , Dynamic Random Access Memory (DRAM) , or Static RAM (SRAM) .

Storage 2014 may be implemented as a non-volatile storage device such as, but not limited to, a magnetic disk drive, optical disk drive, tape drive, an internal storage device, an attached storage device, flash memory, battery backed-up SDRAM (synchronous DRAM) , and/or a network accessible storage device. In various implementations, storage 2014 may include technology to increase the storage performance enhanced protection for valuable digital media when multiple hard drives are included, for example.

Graphics subsystem 2015 may perform processing of images such as still or video for display. Graphics subsystem 2015 may be a graphics processing unit (GPU) or a visual processing unit (VPU) , for example, and may or may not include an image signal processor (ISP) . An analog or digital interface may be used to communicatively couple graphics subsystem 2015 and display 2020. For example, the interface may be any of a High-Definition Multimedia Interface, Display Port, wireless HDMI, and/or wireless HD compliant techniques. Graphics subsystem 2015 may be integrated into processor 2010 or chipset 2005. In some implementations, graphics subsystem 2015 may be a stand-alone card communicatively coupled to chipset 2005.

The graphics and/or video processing techniques described herein may be implemented in various hardware architectures. For example, graphics and/or video functionality may be integrated within a chipset. Alternatively, a discrete graphics and/or video processor may be used. As still another implementation, the graphics and/or video functions may be provided by a general purpose processor, including a multi-core processor. In further implementations, the functions may be implemented in a consumer electronics device.

Radio 2018 may include one or more radios capable of transmitting and receiving signals using various suitable wireless communications techniques. Such techniques may involve communications across one or more wireless networks. Example wireless networks include (but are not limited to) wireless local area networks (WLANs) , wireless personal area networks (WPANs) , wireless metropolitan area network (WMANs) , cellular networks, and satellite networks. In communicating across such networks, radio 2018 may operate in accordance with one or more applicable standards in any version.

In various implementations, display 2020 may include any television type monitor or display. Display 2020 may include, for example, a computer display screen, touch screen display, video monitor, television-like device, and/or a television. Display 2020 may be digital and/or analog. In various implementations, display 2020 may be a holographic display. Also, display 2020 may be a transparent surface that may receive a visual projection. Such projections may convey various forms of information, images, and/or objects. For example, such projections may be a visual overlay for a mobile augmented reality (MAR) application. Under the control of one or more software applications 2016, platform 2002 may display user interface 2022 on display 2020.

In various implementations, content services device (s) 2030 may be hosted by any national, international and/or independent service and thus accessible to platform 2002 via the Internet, for example. Content services device (s) 2030 may be coupled to platform 2002 and/or to display 2020. Platform 2002 and/or content services device (s) 2030 may be coupled to a network 2060 to communicate (e.g., send and/or receive) media information to and from network 2060. Content delivery device (s) 2040 also may be coupled to platform 2002 and/or to display 2020.

In various implementations, content services device (s) 2030 may include a cable television box, personal computer, network, telephone, Internet enabled devices or appliance capable of delivering digital information and/or content, and any other similar device capable of unidirectionally or bidirectionally communicating content between content providers and platform 2002 and/display 2020, via network 2060 or directly. It will be appreciated that the content may be communicated unidirectionally and/or bidirectionally to and from any one of the components in system 2000 and a content provider via network 2060. Examples of content may include any media information including, for example, video, music, medical and gaming information, and so forth.

Content services device (s) 2030 may receive content such as cable television programming including media information, digital information, and/or other content. Examples of content providers may include any cable or satellite television or radio or Internet content providers. The provided examples are not meant to limit implementations in accordance with the present disclosure in any way.

In various implementations, platform 2002 may receive control signals from navigation controller 2050 having one or more navigation features. The navigation features of controller 2050 may be used to interact with user interface 2022, for example. In implementations, navigation controller 2050 may be a pointing device that may be a computer hardware component (specifically, a human interface device) that allows a user to input spatial (e.g., continuous and multi-dimensional) data into a computer. Many systems such as graphical user interfaces (GUI) , and televisions and monitors allow the user to control and provide data to the computer or television using physical gestures.

Movements of the navigation features of controller 2050 may be replicated on a display (e.g., display 2020) by movements of a pointer, cursor, focus ring, or other visual indicators displayed on the display. For example, under the control of software applications 2016, the navigation features located on navigation controller 2050 may be mapped to virtual navigation features displayed on user interface 2022, for example. In implementations, controller 2050 may not be a separate component but may be integrated into platform 2002 and/or display 2020. The present disclosure, however, is not limited to the elements or in the context shown or described herein.

In various implementations, drivers (not shown) may include technology to enable users to instantly turn on and off platform 2002 like a television with the touch of a button after initial boot-up, when enabled, for example. Program logic may allow platform 2002 to stream content to media adaptors or other content services device (s) 2030 or content delivery device (s) 2040 even when the platform is turned “off. ” In addition, chipset 2005 may include hardware and/or software support for 8.1 surround sound audio and/or high definition (7.1) surround sound audio, for example. Drivers may include a graphics driver for integrated graphics platforms. In implementations, the graphics driver may comprise a peripheral component interconnect (PCI) Express graphics card.

In various implementations, any one or more of the components shown in system 2000 may be integrated. For example, platform 2002 and content services device (s) 2030 may be integrated, or platform 2002 and content delivery device (s) 2040 may be integrated, or platform 2002, content services device (s) 2030, and content delivery device (s) 2040 may be integrated, for example. In various implementations, platform 2002 and display 2020 may be an integrated unit. Display 2020 and content service device (s) 2030 may be integrated, or display 2020 and content delivery device (s) 2040 may be integrated, for example. These examples are not meant to limit the present disclosure.

In various implementations, system 2000 may be implemented as a wireless system, a wired system, or a combination of both. When implemented as a wireless system, system 2000 may include components and interfaces suitable for communicating over a wireless shared media, such as one or more antennas, transmitters, receivers, transceivers, amplifiers, filters, control logic, and so forth. An example of wireless shared media may include portions of a wireless spectrum, such as the RF spectrum and so forth. When implemented as a wired system, system 1900 may include components and interfaces suitable for communicating over wired communications media, such as input/output (I/O) adapters, physical connectors to connect the I/O adapter with a corresponding wired communications medium, a network interface card (NIC) , disc controller, video controller, audio controller, and the like. Examples of wired communications media may include a wire, cable, metal leads, printed circuit board (PCB) , backplane, switch fabric, semiconductor material, twisted-pair wire, co-axial cable, fiber optics, and so forth.

Platform 2002 may establish one or more logical or physical channels to communicate information. The information may include media information and control information. Media information may refer to any data representing content meant for a user. Examples of content may include, for example, data from a voice conversation, videoconference, streaming video, electronic mail ( “email” ) message, voice mail message, alphanumeric symbols, graphics, image, video, text and so forth. Data from a voice conversation may be, for example, speech information, silence periods, background noise, comfort noise, tones and so forth. Control information may refer to any data representing commands, instructions or control words meant for an automated system. For example, control information may be used to route media information through a system, or instruct a node to process the media information in a predetermined manner. The implementations, however, are not limited to the elements or in the context shown or described in FIG. 20.

Referring to FIG. 21, a small form factor device 2100 is one example of the varying physical styles or form factors in which

systems

1900 or 2000 may be embodied. By this approach, device 1900 may be implemented as a mobile computing device 2100 having wireless capabilities. A mobile computing device may refer to any device having a processing system and a mobile power source or supply, such as one or more batteries, for example.

As described above, examples of a mobile computing device may include a digital still camera, digital video camera, mobile devices with camera or video functions such as imaging phones, webcam, personal computer (PC) , laptop computer, ultra-laptop computer, tablet, touch pad, portable computer, handheld computer, palmtop computer, personal digital assistant (PDA) , cellular telephone, combination cellular telephone/PDA, television, smart device (e.g., smart phone, smart tablet or smart television) , mobile internet device (MID) , messaging device, data communication device, and so forth.

Examples of a mobile computing device also may include computers that are arranged to be worn by a person, such as a wrist computer, finger computer, ring computer, eyeglass computer, belt-clip computer, arm-band computer, shoe computers, clothing computers, and other wearable computers. In various implementations, for example, a mobile computing device may be implemented as a smart phone capable of executing computer applications, as well as voice communications and/or data communications. Although some implementations may be described with a mobile computing device implemented as a smart phone by way of example, it may be appreciated that other implementations may be implemented using other wireless mobile computing devices as well. The implementations are not limited in this context.

As shown in FIG. 21, device 2100 may include a housing with a front 2101 and a back 2102. Device 2100 includes a display 2104, an input/output (I/O) device 2106, and an integrated antenna 2108. Device 2100 also may include navigation features 2112. I/O device 2106 may include any suitable I/O device for entering information into a mobile computing device. Examples for I/O device 2106 may include an alphanumeric keyboard, a numeric keypad, a touch pad, input keys, buttons, switches, microphones, speakers, voice recognition device and software, and so forth. Information also may be entered into device 2100 by way of microphone 2114, or may be digitized by a voice recognition device. As shown, device 2100 may include a camera 2105 (e.g., including at least one lens, aperture, and imaging sensor) and a flash 2110 integrated into back 2102 (or elsewhere) of device 2100. The implementations are not limited in this context.

Various forms of the devices and processes described herein may be implemented using hardware elements, software elements, or a combination of both. Examples of hardware elements may include processors, microprocessors, circuits, circuit elements (e.g., transistors, resistors, capacitors, inductors, and so forth) , integrated circuits, application specific integrated circuits (ASIC) , programmable logic devices (PLD) , digital signal processors (DSP) , field programmable gate array (FPGA) , logic gates, registers, semiconductor device, chips, microchips, chip sets, and so forth. Examples of software may include software components, programs, applications, computer programs, application programs, system programs, machine programs, operating system software, middleware, firmware, software modules, routines, subroutines, functions, methods, procedures, software interfaces, application program interfaces (API) , instruction sets, computing code, computer code, code segments, computer code segments, words, values, symbols, or any combination thereof. Determining whether an implementation is implemented using hardware elements and/or software elements may vary in accordance with any number of factors, such as desired computational rate, power levels, heat tolerances, processing cycle budget, input data rates, output data rates, memory resources, data bus speeds and other design or performance constraints.

One or more aspects of at least one implementation may be implemented by representative instructions stored on a machine-readable medium which represents various logic within the processor, which when read by a machine causes the machine to fabricate logic to perform the techniques described herein. Such representations, known as “IP cores” may be stored on a tangible, machine readable medium and supplied to various customers or manufacturing facilities to load into the fabrication machines that actually make the logic or processor.

While certain features set forth herein have been described with reference to various implementations, this description is not intended to be construed in a limiting sense. Hence, various modifications of the implementations described herein, as well as other implementations, which are apparent to persons skilled in the art to which the present disclosure pertains are deemed to lie within the spirit and scope of the present disclosure.

The following examples pertain to additional implementations.

By an example one or more first implementations, a computer-implemented method of image processing, comprises obtaining image data of clothing having varying visible attributes and visible identifiers on the outer surface of the clothing; synthetically warping a version of the image data to form warped images of synthetically warped clothing; and training a neural network to recognize the identifiers by using the warped images.

By one or more second implementation, and further to the first implementation, wherein the clothing is athletic jerseys and the identifiers are jersey numbers on the athletic jerseys.

By one or more third implementations, and further to the first or second implementation, wherein synthetically warping comprises generating spatial transform parameters to be applied to clothing images formed by the image data.

By one or more fourth implementations, and further to any of the first to third implementation, wherein the method comprising combining attribute and identifier image data of the varying visible attributes and identifiers to generate attribute images each showing a different combination of the attributes and identifiers.

By one or more fifth implementations, and further to any of the first to third implementation, wherein the method comprising combining attribute and identifier image data of the varying visible attributes and identifiers to generate attribute images each showing a different combination of the attributes and identifiers; and generating pose data from images of at least one person wearing clothing in different poses.

By one or more sixth implementations, and further to any of the first to third implementation, wherein the method comprising combining attribute and identifier image data of the varying visible attributes and identifiers to generate attribute images each showing a different combination of the attributes and identifiers; generating pose data from images of at least one person wearing clothing in different poses; generating features from a first feature extraction neural network that receives both the attribute images and pose data; generating features from a second feature extraction neural network that receives the attribute images without receiving the pose data; and combining and regressing the features from the first and second feature extraction neural networks to form spatial transform parameters representing warping of the clothing.

By one or more seventh implementations, and further to any of the first to third implementation, wherein the method comprising combining attribute and identifier image data of the varying visible attributes and identifiers to generate attribute images each showing a different combination of the attributes and identifiers; generating pose data from images of at least one person wearing clothing in different poses; generating features from a first feature extraction neural network that receives both the attribute images and pose data; generating features from a second feature extraction neural network that receives the attribute images without receiving the pose data; combining and regressing the features from the first and second feature extraction neural networks to form spatial transform parameters representing warping of the clothing; and applying the spatial transform parameters to the attribute images to generate the warped images.

By one or more eighth implementations, and further to any of the first to seventh implementation, wherein the method comprising, evaluating whether the warped images meet a loss function criteria wherein the loss function includes at least two weighted normalization algorithms where one of the algorithms is a structural similarity index (SSIM) loss.

By one or more ninth implementations, a computer-implemented system comprising at least one memory; and at least one processor communicatively coupled to the memory and being arranged to operate by: obtaining image data of clothing having varying visible attributes and visible identifiers on the outer surface of the clothing; synthetically warping a version of the image data to form warped images of synthetically warped clothing; and training a neural network to recognize the identifiers by using the warped images.

By one or more tenth implementations, and further to the ninth implementation, wherein the clothing is athletic jerseys and the identifiers are numbers.

By one or more eleventh implementations, and further to any of the ninth or tenth implementation, wherein synthetically warping comprises generating spatial transform parameters to be applied to clothing images formed by the image data.

By one or more twelfth implementations, and further to any of the ninth or tenth implementation, wherein synthetically warping comprises generating spatial transform parameters to be applied to clothing images formed by the image data; and the at least one processor to operate by generating features from the image data of the clothing in various combinations of the attributes and identifiers and clothing on at least one person in different poses comprising using a neural network with layer blocks individually using residual sub-blocks each having a squeeze and excitation side path that provides weights to output of a main path; and using the features to form the spatial transform parameters.

By an example thirteenth implementation, at least one non-transitory computer readable medium comprising a plurality of instructions that in response to being executed on a computing device, causes the computing device to operate by: obtaining image data of clothing having varying visible attributes and visible identifiers on the outer surface of the clothing; synthetically warping a version of the image data to form warped images of synthetically warped clothing; and training a neural network to recognize the identifiers by using the warped images.

By one or more fourteenth implementations, and further to the thirteenth implementation, wherein the clothing is athletic jerseys and the identifiers are numbers.

By one or more fifteenth implementations, and further to the thirteenth or fourteenth implementation, wherein the instructions cause the computing device to operate by combining attribute and identifier image data of the varying visible attributes and identifiers to generate attribute images each showing a different combination of the attributes and identifiers; and generating pose data from images of at least one person wearing clothing in different poses.

By one or more sixteenth implementations, and further to any of the thirteenth to fourteenth implementation, wherein the instructions cause the computing device to operate by combining attribute and identifier image data of the varying visible attributes and identifiers to generate attribute images each showing a different combination of the attributes and identifiers; generating pose data from images of at least one person wearing clothing in different poses; generating features from a first feature extraction neural network that receives both the attribute images and pose data; generating features from a second feature extraction neural network that receives the attribute images without receiving the pose data; and combining and regressing the features from the first and second feature extraction neural networks to form spatial transform parameters representing warping of the clothing.

By one or more seventeenth implementations, and further to any of the thirteenth to fourteenth implementation, wherein the instructions cause the computing device to operate by combining attribute and identifier image data of the varying visible attributes and identifiers to generate attribute images each showing a different combination of the attributes and identifiers; generating pose data from images of at least one person wearing clothing in different poses; generating features from a first feature extraction neural network that receives both the attribute images and pose data; generating features from a second feature extraction neural network that receives the attribute images without receiving the pose data; combining and regressing the features from the first and second feature extraction neural networks to form spatial transform parameters representing warping of the clothing; and applying the spatial transform parameters to the attribute images to generate the warped images.

By one or more eighteenth implementations, and further to any of the thirteenth to seventeenth implementation, wherein the instructions cause the computing device to operate by evaluating whether the warped images meet a loss function criteria that uses a weighted structural similarity index normalization and showing the similarity between the warped images and a measure of ground truth; and only using the warped images for training when the warped images meet the criteria.

By one or more nineteenth implementations, at least one non-transitory computer readable medium comprising a plurality of instructions that in response to being executed on a computing device, causes the computing device to operate by: obtaining attribute images of clothing having varying visible attributes and visible identifiers on the outer surface of the clothing; obtaining pose data from images of clothing on at least one person in a variety of poses; extracting features representing the visible attributes and identifiers on clothing in various poses comprising inputting the attribute images and pose data into at least one neural network; and using the features to form warped images of the identifiers on clothing to use the warped images to train a neural network to recognize the identifiers.

By one or more twentieth implementations, and further to the nineteenth implementation, wherein the clothing is athletic jerseys and the identifier is a number.

By one or more twenty-first implementations, and further to the nineteenth or twentieth implementation, wherein the instructions cause the computing device to operate by inputting features from a first feature extraction neural network that receives both the attribute images and pose data.

By one or more twenty-second implementations, and further to any of the nineteenth to twentieth implementation, wherein the instructions cause the computing device to operate by inputting features from a first feature extraction neural network that receives both the attribute images and pose data; and inputting the attribute images without the pose data into a second feature extraction neural network; and combining and regressing the features from the first and second feature extraction neural networks to form spatial transform parameters used to synthetically warp the attribute images to generate the warped images.

By one or more twenty-third implementations, and further to any of the nineteenth to twentieth implementation, wherein the instructions cause the computing device to operate by inputting features from a first feature extraction neural network that receives both the attribute images and pose data; and inputting the attribute images without the pose data into a second feature extraction neural network; and combining and regressing the features from the first and second feature extraction neural networks to form spatial transform parameters used to synthetically warp the attribute images to generate the warped images, wherein the training comprises using the warped images as input to the neural network.

In one or more twenty-fourth implementations, a device or system includes a memory and a processor to perform a method according to any one of the above implementations.

In one or more twenty-fifth implementations, at least one machine readable medium includes a plurality of instructions that in response to being executed on a computing device, cause the computing device to perform a method according to any one of the above implementations.

In one or more twenty-sixth implementations, an apparatus may include means for performing a method according to any one of the above implementations.

The above examples may include specific combination of features. However, the above examples are not limited in this regard and, in various implementations, the above examples may include undertaking only a subset of such features, undertaking a different order of such features, undertaking a different combination of such features, and/or undertaking additional features than those features explicitly listed. For example, all features described with respect to any example methods herein may be implemented with respect to any example apparatus, example systems, and/or example articles, and vice versa.

Claims

A computer-implemented method of image processing, comprising:

obtaining image data of clothing having varying visible attributes and visible identifiers on the outer surface of the clothing;

synthetically warping a version of the image data to form warped images of synthetically warped clothing; and

training a neural network to recognize the identifiers by using the warped images.
The method of claim 1, wherein the clothing is athletic jerseys and the identifiers are jersey numbers on the athletic jerseys.
The method of claim 1 wherein synthetically warping comprises generating spatial transform parameters to be applied to clothing images formed by the image data.
The method of claim 1 comprising combining attribute and identifier image data of the varying visible attributes and identifiers to generate attribute images each showing a different combination of the attributes and identifiers.
The method of claim 4 comprising generating pose data from images of at least one person wearing clothing in different poses.
The method of claim 5 comprising:

generating features from a first feature extraction neural network that receives both the attribute images and pose data;

generating features from a second feature extraction neural network that receives the attribute images without receiving the pose data; and

combining and regressing the features from the first and second feature extraction neural networks to form spatial transform parameters representing warping of the clothing.
The method of claim 6 comprising applying the spatial transform parameters to the attribute images to generate the warped images.
The method of claim 1 comprising evaluating whether the warped images meet a loss function criteria wherein the loss function includes at least two weighted normalization algorithms where one of the algorithms is a structural similarity index (SSIM) loss.
A computer-implemented system comprising:

at least one memory; and

at least one processor communicatively coupled to the memory and being arranged to operate by:

obtaining image data of clothing having varying visible attributes and visible identifiers on the outer surface of the clothing;

synthetically warping a version of the image data to form warped images of synthetically warped clothing; and

training a neural network to recognize the identifiers by using the warped images.
The system of claim 9 wherein the clothing is athletic jerseys and the identifiers are numbers.
The system of claim 9 wherein synthetically warping comprises generating spatial transform parameters to be applied to clothing images formed by the image data.
The system of claim 11, wherein the at least one processor to operate by: generating features from the image data of the clothing in various combinations of the attributes and identifiers and clothing on at least one person in different poses comprising using a neural network with layer blocks individually using residual sub-blocks each having a squeeze and excitation side path that provides weights to output of a main path; and using the features to form the spatial transform parameters.
At least one non-transitory computer readable medium comprising a plurality of instructions that in response to being executed on a computing device, causes the computing device to operate by:

obtaining image data of clothing having varying visible attributes and visible identifiers on the outer surface of the clothing;

synthetically warping a version of the image data to form warped images of synthetically warped clothing; and

training a neural network to recognize the identifiers by using the warped images.
The medium of claim 13 wherein the clothing is athletic jerseys and the identifiers are numbers.
The medium of claim 13 wherein the instructions cause the computing device to operate by combining attribute and identifier image data of the varying visible attributes and identifiers to generate attribute images each showing a different combination of the attributes and identifiers; and generating pose data from images of at least one person wearing clothing in different poses.
The medium of claim 15 wherein the instructions cause the computing device to operate by:

generating features from a first feature extraction neural network that receives both the attribute images and pose data;

generating features from a second feature extraction neural network that receives the attribute images without receiving the pose data; and

combining and regressing the features from the first and second feature extraction neural networks to form spatial transform parameters representing warping of the clothing.
The medium of claim 16 wherein the instructions cause the computing device to operate by applying the spatial transform parameters to the attribute images to generate the warped images.
The medium of claim 13 wherein the instructions cause the computing device to operate by evaluating whether the warped images meet a loss function criteria that uses a weighted structural similarity index normalization and showing the similarity between the warped images and a measure of ground truth; and only using the warped images for training when the warped images meet the criteria.
At least one non-transitory computer readable medium comprising a plurality of instructions that in response to being executed on a computing device, causes the computing device to operate by:

obtaining attribute images of clothing having varying visible attributes and visible identifiers on the outer surface of the clothing;

obtaining pose data from images of clothing on at least one person in a variety of poses;

extracting features representing the visible attributes and identifiers on clothing in various poses comprising inputting the attribute images and pose data into at least one neural network; and

using the features to form warped images of the identifiers on clothing to use the warped images to train a neural network to recognize the identifiers.
The medium of claim 19 wherein the clothing is athletic jerseys and the identifier is a number.
The medium of claim 19 wherein the instructions cause the computing device to operate by inputting features from a first feature extraction neural network that receives both the attribute images and pose data.
The medium of claim 21 wherein the instructions cause the computing device to operate by: inputting the attribute images without the pose data into a second feature extraction neural network; and combining and regressing the features from the first and second feature extraction neural networks to form spatial transform parameters used to synthetically warp the attribute images to generate the warped images.
The medium of claim 22 wherein the training comprises using the warped images as input to the neural network.
At least one machine readable medium comprising a plurality of instructions that in response to being executed on a computing device causes the computing device to perform the method according to any one of claims 1–8.
An apparatus, comprising means for performing the methods according to any one of claims 1–8.