GB2536060B

GB2536060B - Virtual trying-on experience

Info

Publication number: GB2536060B
Application number: GB1503831.8A
Authority: GB
Inventors: Mark Groves David; Boisson Jerome
Original assignee: SPECSAVERS OPTICAL GROUP Ltd
Current assignee: SPECSAVERS OPTICAL GROUP Ltd
Priority date: 2015-03-06
Filing date: 2015-03-06
Publication date: 2019-10-16
Anticipated expiration: 2035-03-06
Also published as: EP3266000A1; AU2016230943B2; GB2536060A; WO2016142668A1; GB201503831D0; NZ736107A; AU2016230943A1

Description

Virtual Trying-On Experience

Field of the invention

This invention relates to a computer implemented method for providing a visualrepresentation of an item being tried on a user.

Summary

The present invention provides a method of providing a virtual trying on experience to a useras described in the accompanying claims.

Specific examples of the invention are set forth in the dependent claims.

These and other aspects of the invention will be apparent from and elucidated with referenceto the examples described hereinafter.

Brief description of the drawings

Further details, aspects and embodiments of the invention will be described, by way ofexample only, with reference to the drawings. In the drawings, like reference numbers are used toidentify like or functionally similar elements. Elements in the figures are illustrated for simplicityand clarity and have not necessarily been drawn to scale.

Figure 1 shows an example method of providing a virtual trying on experience to a useraccording to an example embodiment of the invention;

Figure 2 shows first and second more detailed portions of the method of Figure 1, accordingto an example embodiment of the invention;

Figure 3 shows a third more detailed portion of the method of Figure 1, according to anexample embodiment of the invention;

Figure 4 shows a high level diagram of the face tracking method, according to an exampleembodiment of the invention;

Figure 5 shows how the method retrieves faces in video sequences, according to an exampleembodiment of the invention;

Figure 6 shows a detected face, according to an example embodiment of the invention;

Figure 7 shows detected features of a face, according to an example embodiment of theinvention;

Figure 8 shows a pre-processing phase of the method that has the objective to find the mostreliable frame containing a face from the video sequence, according to an example embodiment ofthe invention;

Figure 9 shows an optional face model building phase of the method that serves to constructa suitable face model representation, according to an example embodiment of the invention;

Figure 10 shows a processed video frame along with its corresponding (e.g. generic) 3Dmodel of a head, according to an example embodiment of the invention;

Figure 11 shows a sequential face tracking portion of the disclosed method, according to anexample embodiment of the invention;

Figure 12 shows an exemplary embodiment of computer hardware on which the disclosedmethod may be run;

Figure 13 shows another exemplary embodiment of computer hardware on which thedisclosed method may be run.

Detailed description

Because the illustrated embodiments of the present invention may for the most part beimplemented using electronic components and circuits known to those skilled in the art, details willnot be explained in any greater extent than that considered necessary to illustrate the invention to aperson skilled in the relevant art. This is done for the understanding and appreciation of theunderlying concepts of the present invention, without unduly obfuscating or distracting from theteachings of the present invention.

Examples provide a method, apparatus and system for generating "a virtual try-onexperience" of an item on a user, such as a pair of spectacles/glasses being tried on a user’s head.The virtual try-on experience may be displayed on a computer display, for example on asmartphone or tablet screen. Examples also provide a computer program (or “app”) comprisinginstructions, which when executed by one or more processors, carry out the disclosed methods. Thedisclosed virtual try on experience methods and apparatuses allow a user to see what a selecteditem would look like on their person, typically their head. Whilst the following has been cast interms of trying on glasses on a human head, similar methods may also be used to virtually try onany other readily 3D model-able items that may be worn or attached to another object, especially ahuman object, including, but not limited to: earrings, tattoos, shoes, makeup, and the like.

Examples may use one or more generic 3D models of a human head, together with a one ormore 3D models of the item(s) to be tried on, for example models of selected pairs of glasses. Theone or more generic 3D models of a human head may include a female generic head and a malegeneric head. In some embodiments, different body shape generic head 3D models may beprovided and selected between to be used in the generation of the “virtual try-on experience". Forexample, the different body shape generic heads may comprise different widths and/or heights ofheads, or hat sizes.

According to some examples, the 3D models (of both the generic human heads and/or theitems to be placed on the head) may be placed into a 3D space by reference to an origin. The originof the 3D models may be defined as a location in the 3D space at which the coordinates of each 3D model is to be referenced from, in order to locate any given portion of the 3D model. The origin ofeach model may correspond to one another, and to a specified nominally universal location, such asthe location of a bridge of the nose. Thus, the origins of the 3D models may be readily co-locatedin the 3D space, together with a corresponding location of the item to be virtually tried on, so thatthey may be naturally/suitably aligned. There may also be provided one or more attachment pointsfor the item being tried on to the 3D model of a generic human head. In the trying on of glassesexample, these may be, for example, where the arms of the glasses rest on a human ear.

The origin is not in itself a point in the model. It is merely a location by which points in the3D models (both of the generic human head, but also of any item being tried on, such as glasses)may be referenced and suitably aligned. This is to say, examples may place both 3D models (i.e.the selected generic human head + item being tried on) into the same 3D space in a suitable (i.e.realistic) alignment by reference to the respective origins. The 3D model of the head may not bemade visible, but only used for occlusion or other calculations of the 3D model of the glasses. Thecombined generic head (invisible) and glasses 3D models (suitably occluded) can then be placed ona background comprising an extracted image of the user taken from a video, so that the overallcombination of the rendered 3D model of the glasses and the extracted video gives the impressionof the glasses being worn by the user. This combination process, as well as the occlusioncalculations using the “invisible” generic human head, may be repeated for a number of extractedimages at different nominal rotations.

By using pre-defined generic 3D head models, examples do not need to generate a 3D modelof a user’s head, and therefore reduce the processing overhead requirements. However, the utilityof the examples is not materially affected as key issues pertaining to the virtual try on experienceare maintained, such as occlusion of portions of the glasses by head extremities (e.g. eyes, nose,etc), during rotation, as discussed in more detail below.

Examples map the 3D models in the 3D space onto suitably captured and arranged images ofthe actual user of the system. This mapping process may include trying to find images of a user’shead having pre-defined angles of view matching predetermined angles. This matching maycomprise determining, for a captured head rotation video, a predetermined number of angles ofhead between the two maximum angles of head rotation contained within the captured headrotation video. In such a way, examples enable use of the specific captured head rotation video,regardless of whether or not a pre-determined preferable maximum of head rotation has occurred(i.e. these examples would not require the user to re-capture a new video because the user had notturned their head sufficiently in the original capturing of their head rotation). Thus, examples aremore efficient than the prior art that requires a minimum head rotation.

In examples, by establishing angles of images based on maximum angle of user head rotationin a captured video (and therefore under the direct control of the user), the viewing angle(s) may be user-determined. This enables the system to portray the generated 3D try on experience in a wayparticularly desirable to the user, as opposed to only being portrayed in a specific, pre-determinedmanner that the user must abide by in order for the system to work. Thus, examples are more“natural” to use than the prior art.

There now follows a detailed description of an exemplary embodiment of the presentinvention, in particular an embodiment in the form of a software application (often simply referredto as an “app”) used on smartphone device. The example software application is in the form of avirtualized method for a human user to try on glasses including a face tracking portion described inmore detail below, where face tracking is used in an application according to examples to‘recognize’ a user’s face (i.e. compute a user’s head/face pose).

Examples of the disclosed method may include extracting a still image(s) of a user (or justuser’s head portion) from a captured video of the user. The image of the user may be used as abackground image for a 3D space including 3D models of the item, such as glasses, to be virtuallytried on, thereby creating the appearance of the item being tried on the user actual capture head. A3D model of a generic head, i.e. not of the actual user, may also be placed into the 3D space,overlying the background image of the user. In this way, the generic human head model may beused as a mask, to allow suitable occlusion culling (i.e. hidden surface determination) to be carriedout on the 3D model of the item being tried on, in relation to the user’s head. Use of a generichuman head model provides higher processing efficiency/speed, without significantly reducingefficacy of the end result. In some embodiments a movement and/or orientation of the user’s head,i.e. position and viewing direction, may be determined from the extracted still image(s).

An origin of the 3D model of a generic human head may be located at a pre-determined pointin the model, for example, corresponding to a bridge of a nose in the model. Other locations andnumbers of reference points may be used instead. A position at which the 3D model is locatedwithin the 3D space may also set with reference to the origin of the model, i.e. by specifying thelocation of the origin of the 3D model within the 3D space. The orientation of the 3D model maycorrespond to the determined viewing direction of the user. A 3D model of the selected item to be tried on, for example the selected pair of glasses, maybe placed into the 3D space. An orientation of the glasses model may correspond to the viewingdirection of the user. An origin of the 3D glasses model may be provided and located at a pointcorresponding to the same point as the 3D model of the generic human head, for example alsobeing at a bridge of a nose in the glasses model. A position at which the 3D glasses model islocated within the 3D space may be set with reference to the origin of the glasses 3D model, i.e. byspecifying the location of the origin of the 3D model within the 3D space. The origin of the 3Dmodel of the glasses may be set so that the glasses substantially align to the normal wearingposition on the 3D model of the human head.

An image of the glasses located on the user’s head may then be generated based on the 3Dmodels of the glasses and generic head (which may be used to mask portions of the glasses modelwhich should not be visible and to generate shadow) and the background image of the user.

The position of the glasses relative to the head may be altered by moving the location of the3D glasses model in the 3D space, i.e. by setting a different location of an origin of the model, orby moving the origin of the 3D glasses model out of alignment with the origin of the 3D model of ageneric human head.

The example application also may include video capture, which may refer to capturing avideo of the user’s head and splitting that video up into a plurality of video frames. In someexamples, the video capture may occur outside of the device displaying the visualization. Eachvideo frame may therefore comprise an image extracted from a video capture device or a videosequence captured by that or another video capture device. Examples may include one or more 3Dmodels, where a 3D model is a 3D representation of an object. In specific examples, the 3D modelsmay be of a generic human head and of an item to be visualized upon the head, such as a pair ofglasses. A 3D model as used herein may comprise a data set including one or more of: a set oflocations in a 3D space defining the item being modelled; a set of data representing a texture ormaterial of the item (or portion thereof) in the model; a mesh of data points defining the object; anorigin, or reference point for the model, and other data useful in defining the physical item aboutwhich the 3D model relates. Examples may also use a scene, where the scene may contain one ormore models, and including, for example, all the meshes for the 3D models used to visualize theglasses on a user’s head. Other data sets that may also be used in some examples include: amaterial data set describing how a 3D model should be rendered, often based upon textures; a meshdata set that may be the technical 3D representation of the 3D model; a texture data set that mayinclude a graphic file that may be applied to a 3D model in order to give it a texture and/or a color.

Data sets that may be used in some embodiments may include CSV (for Comma SeparatedValues), which is an exchange format used in software such as Excel™; JSON (for JavaScriptObject Notation) is an exchange format used mainly in Web; and metrics, that are a way to record,for example, the usage of the application. Other data sets are also envisaged for use in examples,and the invention is not so limited.

Example embodiments may comprise code portions or software modules including, but notlimited to: code portions provided by or through a Software Development Kit (SDK) of the targetOperating System (OS), operable to enable execution of the application on that target OS, forexample portions provided in the iOS SDK environment, XCode (RTM); 3D model rendering,lighting and shadowing code portion (for example, for applying the glasses on user’s face); facetacking code portions; and metric provision code portions.

The software application comprises three core actions: video recording of the user’s facewith face-tracking; 3D model download and interpretation/representation of the 3D models (ofgeneric user head and glasses being visualized on the user’s head); and display of the combinationof the 3D Models and recorded video imagery. Examples may also include cloud / web enabledservices catalog handling, thereby enabling onward use of the visualization to the user, for examplefor providing the selected glasses to the user for real-world trying on and/or sale.

Figure 1 shows an example method 100 of providing a virtual try on experience for glasseson a user’s head.

The method starts by capturing video 110 of the user’s head rotating. However, due to thebeneficial aspects of the disclosed examples (in particular, the freedom to use any form/extent ofhead rotation), in the alternative, a previously captured video may be used instead.

The method then extracts images 120, for later processing, as disclosed in more detail below.From the extracted images, the method determines the object (in this example, the user’s head)movement in the extracted images 130. Next, 3D models of the items (i.e. glasses) to be placed,and a 3D model of a generic human head on which to place the item models, are acquired 140,either from local storage (e.g. in the case of the generic human head model) or from a remote datarepository (e.g. in the case of the item/glasses, as this may be a new model). More detaileddescription of these process 130 and 140 are disclosed below with reference to Figure 2.

The 3D models are combined with one another and the extracted images (as background) atstep 150. Then, an image of the visual representation of the object (user’s head) with the item(glasses) thereon can be generated 160. This is described in more detail with respect of the Figure3, below.

Optionally, the location of the items with respect to the object may be adjusted 170, typicallyaccording to user input. This step may occur after display of the image, as a result of the userdesiring a slightly different output image.

Figure 2 shows a more detailed view 200 of a portion of the method, in particular, the objectmovement determination step 130 and 3D model acquisition step 140.

The object movement determination step 130, may be broken down in to sub steps in whicha maximum rotation of the object (i.e. head) in a first direction (e.g. to the left) is determined 132,then the maximum rotation in the second direction (e.g. to the right) may then be determined 134,finally, for this portion of the method, output values may be provided 136 indicative of themaximum rotation of the head in both first and second directions, for use in the subsequentprocessing of the extracted images and/or 3D models for placement within the 3D space relating toeach extracted image. In some examples, the different steps noted above in respect of the objectmovement determination may only be optional.

The 3D model acquisition step 140, may be broken down in to sub steps in which a 3Dmodel of a generic head is acquired 142, or optionally, to include a selection step 144 of a one 3Dmodel of a generic human head out of a number of acquired 3D generic models of a human head(e.g. choosing between a male or female generic head 3D model). The choice of generic headmodel may be under direct User control, or by automated selection, as described in more detailbelow. Next, the 3D models of the item(s) to be placed on the head, e.g. glasses may then beacquired 146. Whilst the two acquisition steps 142 and 146 may be carried out either way round, itis advantageous to choose the generic human head in use, because this may allow the choice of 3Dmodels of the items to be placed to be filtered so that only applicable models are available forsubsequent acquisition. For example, choosing a female generic human head 3D model can filterout all male glasses.

Figure 3 shows amore detailed view 300 of the image generation step 160 ofFigure 1.

The image generation step 160 may start by applying an extracted image as the background162 to the visual representation of the item being tried on the user’s head. Then, using the facetracking data (i.e. detected movement, such as the extent of rotation values discussed above, at step136) may be used to align the 3D models of the generic human head and the 3D model of theglasses to the extracted image used as background 164 (the 3D models may already have beenaligned to one another, for example using their origins, or that alignment can be carried out at thispoint as well, instead).

Hidden surface detection calculations (i.e. occlusion calculations) 166 may be carried out onthe 3D model of the glasses, using the 3D model of the generic head, so that any parts of theglasses that should not be visible in the context of the particular extracted image in use at this pointin time may be left out of the overall end 3D rendering of the combined scene (comprisingextracted image background, and 3D model of glasses “on top”). The combined scene may then beoutput as a rendered image 168. The process may repeat for a number of different extracted images,each depicting a different rotation of the user’s head in space.

The extracted images used above may be taken from a video recording of the user’s face,which may be carried out with a face tracking portion of the example method. This allows the userto record a video of themselves, so that the virtual glasses can be shown as they would look on theiractual person. This is achieved in multiple steps. First the application records a video capture of theuser's head. Then the application will intelligently split this video into frames and send these to theFace tracking library module. The face tracking library module may then return the location resultsfor each frame (i.e. where the user’s face is in the frame and/or 3D space/world (related to aCoordinate System, (CS) that is linked to the camera). These results may be used to position the 3Dglasses on the users face virtually.

The face recording may be approximately 8 seconds long, and may be captured in highresolution video.

There is now described in more detail an exemplary chain of production describing how thevirtual glasses are suitably rendered on the captured video of user’s head end.

Video recording and face-tracking:

When starting the application, the application may prompt the user to record a video of theirhead turning in a non-predefined, i.e. user-controllable, substantially horizontal sweep of the user’shead. The camera is typically located dead-ahead of the user’s face, when the user’s head is at thecentral point of the overall sweep, such that the entirety of the user’s head visible in the frame ofthe video. However, in other examples, the camera may not be so aligned. The user has to move hishead left and right to give the best results possible. The location of the head in the sweep may beassessed by the face tracking module prior to capture of the video for use in the method, such thatthe user may be prompted to re-align their head before capture. In this way, the user may besuitably prompted so that only a single video capture is necessary, which ultimately provides abetter user experience. However, in some examples, the method captures the video as is providedby the user, and carries on without requiring a second video capture.

When the video is recorded (and, optionally, the user is happy with it), the video may then beprocessed through the following steps.

Video split

The captured video is to be interpreted by the face-tracking process carried out by the face-tracking module. However, to aid this, the captured video of the user’s head may be sampled, sothat only a sub-set of the captured video images are used in the later processing steps. This mayresult in faster and/or more efficient processing, which in turn may also allow the exampleapplication to be performed by lesser processing resources or at greater energy efficiency.

One exemplary way to provide this sampling of the captured video images is to split thevideo into comprehensible frames. Initially, this splitting action may involve the video that isrecorded at a higher initial capture rate (e.g. of 30 frame per seconds, at 8 seconds total length, thatgives a total of 240 video frames), but only selecting or further processing a pre-determined or userdefinable number of those frames. For example, the splitting process may select every third fame ofthe originally capture video, which in the above example provides 80 output frame for subsequentprocessing, at a rate of 10 frames per second. Thus the processing overload is now approximately33% of the original processing load.

Face-tracking process

The sub-selected 80 video frames (i.e. 80 distinct images) are then sent to the Face-trackingmodule for analysis, as described in more detail below with respect to figures 3 to 10. By the end of the Face-tracking process, the application may have 80 sets of data: one for each sub-selected videoframe. These sets of data contain, for each video frame, the position and orientation of the face.

Face-tracking data selection

It may be unnecessary for the application to process all 80 sets of data at this point, so theapplication may include a step of selecting a pre-defined number of best frames offered by theresults returned by the face-tracking module. For example, the 9 best frames may be selected, basedupon the face orientation, thereby covering all the angles of the face as it turns from left to right (orvice versa).

The selection may be made as follows: for frame 1 (left most), the face may be turned 35degrees to the left; for frame 2, the face may be turned 28 degrees to the left; for frame 3, the facemay be turned 20 degrees to the left; for frame 4, the face may be turned 10 degrees to the left; forframe 5, the face may be centered; for frame 6, the face may be turned 10 degrees to the right; forframe 7, the face may be turned 20 degrees to the right; for frame 8, the face may be turned 28degrees to the right; for frame 9, the face may be turned 35 degrees to the right. Other specificangles selected for each of the selection of best frames may be used, and may also be defined bythe user instead.

In some examples, non-linear/contiguous capture of images/frames of the head in the 3Dspace may be used. This is to say, in these alternative examples, the user’s head may pass throughany given target angle more than once during a recording. For example, if one degree left of centrewere a target angle and the recording starts from a straight ahead position, then the head beingcaptured passes through this one degree left of centre angle twice - once en route *to* the left-mostposition and once more after rebound *from* the left-most position. Thus, in these examples, themethod has the option to decide which of the different instances is the best version of the angle touse for actual display to the user. Thus, the images actually used to display to the user may not allbe contiguous/sequential in time.

In an alternative example, instead of selecting best frames for further processing according topre-defined angles (which assumes a pre-defined head sweep - e.g. 180 degrees sweep, with 90degree (left and right) max turn from central dead-ahead position), the method may instead use anyarbitrary user provided turn of head, and determine the actual maximum turn in each direction, andthen split that determined actual head turn into a discreet number of ‘best frames’. This processmay also take into account a lack of symmetry of the overall head turn (i.e. more turn to the leftthat right, or vice versa). For example, the actual head turn may be, in actual fact, 35 degrees leftand 45 degrees right. Therefore, a total of 70 degrees, which in turn may be then split into 9 framesat 7.77 degrees each, or simply 3 on the left, one central, and 4 on the right).

By the end of the face-tracking data selection portion of the overall method, the applicationmay have selected 9 frames and associated sets of data. In some examples, if the application was not able to select a suitable number of “best” frames, the user’s video may be rejected, and the usermay be kindly asked to take a new head turning video. For example, if the leftmost frame does notoffer a face turned at least 20 degrees to the left, or the right most frame does not offer a faceturned at least 20 degrees to the right, the user's video will be rejected.

Face-tracking process end

When the application has the requisite number (e.g. 9) best frames, the respective best frameimages and data sets are saved within the application data storage location. These may then be usedin a later state, with the 3D models, which may also be stored in the application data storagelocation, or another memory location in the device carrying out the example application, or even ina networked location, such as central cloud storage repository. 3D chain of production and process

All the best frames are produced within the application, following 3D modeling techniquesknown in the art. For example, the application may start from the captured High Definition, Highpolygons models (e.g. of the glasses (or other product) to be tried on). Since the application has torun on mobile devices, these 3D models may be reworked in order to adapt to low calculationpower and low memory offered by the mobile devices, for example to reduce the number ofpolygons in each of the models.

Then, the application can work on the textures. The textures may be images and, if notreworked, may overflow the device memory and lead to application crashes. For this application,there may be two sets of textures generated each time: one for a first type of device (e.g. a mobiledevice such as a smartphone, using iOS, where the textures used may be smaller, and hence moresuited for a 3G connection) and one for a second type of device, such as a portable device like atablet (i.e. using textures that may be more suited for a physically larger screen, and/or higher ratewifi connection). Once the number of polygons has been reduced and/or the textures have beentreated according to desirable advantages to the target execution environment, the final 3D modelsmay be exported, for example in a mesh format. The 3D models may be exported in any suitable3D model data format, and the invention is not so limited. An example of a suitable data format isthe Ogre3D format. 3D models in the cloud

The 3D models may be located in a central data repository, e.g. on a server, and may beoptionally compressed, for example, archived in a ZIP format. When compression is used to storethe 3D model data, in order to reduce data storage and transmission requirements, then theapplication may include respective decompression modules.

In order to get the 3D models of the glasses (and generic heads), the application maydownload them from the server and unzip them. When that is done, the application can pass the 3Dmodels to the rendering engine. 3D Rendering Engine

The 3D rendering engine used in this application gets a 3D model, and it will pass all therendered files along with the face tracking data sets and the respective video frames from the videoto the graphics/display engine. The 3D graphics engine may render the end image according to theprocess as described in relation to Figure 3.

Thus, in the example discussed above, using 9 extracted images, the rendering engine maydo the following steps to create an image of the user wearing the virtual images: 1) open the 3Dfiles and interpret them to create a 3D representation (e.g. the 3D glasses); 2) for each of the 9frames used in the app: apply the video frame in the background (so the user’s face is in thebackground) and then display the 3D glasses in front of the background; using the face trackingdata set (face position and orientation), the engine will position the 3D models exactly on the user’sface; 3) a “screenshot” of the 3D frames placed on the background will be taken; 4) the 9screenshots are then displayed to the user. 3D Rendering Process end

Using inbuilt swipe gestures of the target OS, the user may now “browse” through therendered screenshots for each frames, in which the illusion of the rendered glasses are on the user’sface.

Web services, cloud and catalogs

The catalog containing all the frames is downloaded by the application from a static URL onthe server. The catalog will allow the application to know where to look for 3D glasses and when todisplay them. This catalog will for example describe all the frames for the “Designer” category, sothe application can fetch the corresponding 3D files. The catalog may use a CSV format for thedata storage.

As described above, example applications include processes to: carry out video recording,processing and face tracking data extraction; download 3D models from a server, interpreting andadjusting those models according face-tracking data. The downloading of the 3D models maycomprise downloading a catalog of different useable 3D models of the items to be shown (e.g.glasses), or different generic human head 3D models.

Face tracking analysis process

The following describes the face-tracking algorithms, as used in offline (i.e. non real-time)application scenarios, such as when the disclosed example methods, systems and devices detect andtrack human faces on a pre-recorded video sequence. This example discloses use of the followingterms/notations: a frame is an image extracted from a video captured by a video capture device or apreviously captured input video sequence; a face model is a 3D mesh that represents a face; a keypoint (also named interest point) is a point that corresponds to an interesting location in the image because of its neighborhood variations; a pose is a vector composed of a position and a translationto describe rigid affine transformations in space.

Figure 4 shows a high level diagram of the face tracking method. A set of input images 402are used by face tracking module 410 to provide an output set of vectors 402, which may bereferred to as “pose vectors”.

Figure 5 details how the Software retrieves faces in video sequences. The face-trackingprocess may include a face-tracking engine that may be decomposed into four main phases: (1)Pre-processing the (pre-recorded) video sequence 510, in order to find the frame containing themost “reliable” face 520; (2) optionally the method may include building a 2.5D face modelcorresponding to the current user’s face, or choosing a most applicable generic model of a humanhead to the captured user head image 530; (3) Tracking the face model sequentially using the (partor whole) video sequence 540. (1) Pre-Processing phase

Figure 8 shows a pre-processing phase of the method 800 that has the objective to find themost reliable frame containing a face from the video sequence. This phase is decomposed into 3main sub-steps: - (a) Face detection step 810 (and figure 6) which includes detecting the presence of a face ineach video frame. When a face is found, its position is calculated. - (b) Non-rigid face detection step 830 (and figure 7) which includes discovering facefeatures positions (e.g. eyes, nose, mouth, etc). - (c) Retrieving the video frame containing the most reliable face image 870 out of a numberof candidates 850.

The face detection step (a) 810 may discover faces in the video frames using a slidingwindows technique. This technique includes comparing each part of the frame using pyramidalimages techniques and finding if a part of the frame is similar to a face signature. Face signature(s)is stored in a file or a data structure and is named a classifier. To learn the classifier, thousands ofpreviously known face images may have been processed. The face detection reiterates 820 until asuitable face is output.

The Non-rigid face detection step (b) is more complex since it tries to detect elements of theface (also called face features, or landmarks). This non-rigid face detection step may takeadvantage of the fact that a face has been correctly detected in step (a). Then face detection isrefined to detect face elements, for example using face detections techniques known in the art. Asin (a), a signature of face elements has been leamt using hundreds of face representations. This step(b) is then able to compute a 2D shape that corresponds to the face features (see an illustration infigure 7).

Steps (a) and (b) may be repeated on all or on a subset of the captured frames that comprisesthe video sequence being assessed. The number of frames processed depends on the total numberof frames of the video sequence, or the sub-selection of video frames used. These may be basedupon, for example, the processing capacity of the system (e.g. processor, memory, etc), or on thetime the user is (or deemed to be) willing to wait before results to appear. (c) If steps (a) and (b) have succeeded for at least one frame, then step (c) is processed tofind the frame in the video sequence that contains the most reliable face. Notion of reliable face canbe defined as follow: - find the candidate frames with facing orientation i.e. faces that looks toward the camera,using a threshold value on the angle (e.g. less than few radians); and - amongst these candidate frames, find the frame containing a face not too far and not tooclose to the camera using two threshold values as well.

Once the frame(s) with the most reliable face is found in the video sequence, the face-tracking algorithm changes state and tries to construct a face model representation, or chose a mostappropriate generic head model for use, or simply uses a standard generic model without anyselection thereof 890. (2) Building 3D face model phase

Figure 9 shows the optional face model building phase of the method 900 that serves toconstruct a suitable face model representation, i.e. building an approximate geometry of the facealong with a textured signature of the face and corresponding keypoints. In some examples, thistextured 3D model is referred as a keyframe. The approximate geometry of the face may instead betaken from a pre-determined generic 3D model of a human face.

The keyframe may be constructed using the most reliable frame of the video sequence. Thisphase is decomposed in following steps: - (a) Creating a 3D model/mesh of the face 910 using the position of the face and the non-rigid face shape built during phase (1). - (b) Finding keypoints on the face image 920 and re-project them on the 3D mesh to findtheir 3D positions. - (c) Saving a 2D image of the face by cropping the face available in the most reliable frame.

In respect of step (a), the position of the face elements may be used to create the 3D modelof the face. These face elements may give essential information about the deformation of the face.A mean (i.e. average) 3D face model available statically is then deformed using these 2D faceelements. This face model may then be positioned and oriented according to the camera position.This may be done by optimizing an energy function that is expressed using the image position offace elements and their corresponding 3D position on the model.

In respect of step (b), keypoints (referred sometimes as interest points or comer points) maybe computed on the face image using the most reliable frame. In some examples, a keypoint can bedetected at a specific image location if the neighboring pixels intensities are varying substantiallyin both horizontal and vertical directions.

In respect of step (c), along with the 3D model and keypoints, the face representation (animage of the face) may also be memorized (i.e. saved) so that the process can match its appearancein the remaining frames of the video capture.

Steps (a), (b) and (c) aim to construct a keyframe of the face. This keyframe is used to trackthe face of the user in the remaining video frames. (3) Tracking the face sequentially phase (see Figure 11).

Once the face model of the user has been reconstructed, or generic model chosen, theremaining video frames may be processed with the objective to track the face sequentially.Assuming that the face's appearance in contiguous video frames is similar helps the describedmethod track the face frame after frame. This is because the portion of image around eachkeypoint(s) does not change too much from one frame to another, therefore comparing/matchingkeypoint(s) (in fact neighbouring image appearance) is easier. Any suitable technique to track theface sequentially known in the art may be used. For example, as described in “Stable Real-Time 3DTracking using Online and Offline Information” - by L. Vacchetti, V. Lepetit and P. Fua, wherethe keyframe may be used to match keypoints computed in earlier described face model buildingphase and keypoints computed in each video frame. The pose of the face (i.e. its position andorientation) may then be computed for each new frame using an optimization technique.

Figure 10 shows a processed video frame along with its corresponding (e.g. generic) 3Dmodel of a head.

When the video sequence is completed, face poses (and, in some examples, thecorresponding generic human face model) are sent to the 3D rendering engine, so that renderingmodule can use this information to display virtual objects on top of the video sequence. Thisprocess is shown in Figure 11, and includes tracking sequentially the Face model using thekeyframe 1110, and returning Face poses when available 1130, via iterative process 1120 whilstframes are available for processing, until no more frames are available for processing.

The invention may be implemented as a computer program for running on a computersystem, said computer system comprising at least one processer, where the computer programincludes executable code portions for execution by the said at least one processor, in order for thecomputer system to perform any method according to the described examples. The computersystem may be a programmable apparatus, such as, but not limited to a personal computer, tablet orsmartphone apparatus.

Figure 12 shoes an exemplary generic embodiment of such a computer system 1200comprising one or more processor(s) 1240, system control logic 1220 coupled with at least one ofthe processor(s) 1240, system memory 1210 coupled with system control logic 1220, non-volatilememory (NVM)/storage 1230 coupled with system control logic 1220, and a network interface1260 coupled with system control logic 1220. The system control logic 1220 may also be coupledto Input/Output devices 1250.

Processor(s) 1240 may include one or more single-core or multi-core processors.Processor(s) 1240 may include any combination of general-purpose processors and dedicatedprocessors (e.g., graphics processors, application processors, etc.). Processors 1240 may beoperable to carry out the above described methods, using suitable instructions or programs (i.e.operate via use of processor, or other logic, instructions). The instructions may be stored in systemmemory 1210, as glasses visualisation application 1205, or additionally or alternatively may bestored in (NVM)/storage 1230, as NVM glasses visualisation application portion 1235, to therebyinstruct the one or more processors 1240 to carry out the virtual trying on experience methodsdescribed herein. The system memory 1210 may also include 3D model data 1215, whilst NVMstorage 1230 may include 3D model Data 1237. These may serve to store 3D models of the itemsto be placed, such as glasses, and one or more generic 3D models of a human head.

System control logic 1220 for one embodiment may include any suitable interfacecontrollers to provide for any suitable interface to at least one of the processor(s) 1240 and/or toany suitable device or component in communication with system control logic 1220.

System control logic 1220 for one embodiment may include one or more memorycontroller(s) (not shown) to provide an interface to system memory 1210. System memory 1210may be used to load and store data and/or instructions, for example, for system 1200. Systemmemory 1210 for one embodiment may include any suitable volatile memory, such as suitabledynamic random access memory (DRAM), for example. NVM/storage 1230 may include one or more tangible, non-transitory computer-readablemedia used to store data and/or instructions, for example. NVM/storage 1230 may include anysuitable non-volatile memory, such as flash memory, for example, and/or may include any suitablenon-volatile storage device(s), such as one or more hard disk drive(s) (HDD(s)), one or morecompact disk (CD) drive(s), and/or one or more digital versatile disk (DVD) drive(s), for example.

The NVM/storage 1230 may include a storage resource physically part of a device on whichthe system 1200 is installed or it may be accessible by, but not necessarily a part of, the device.For example, the NVM/storage 1230 may be accessed over a network via the network interface1260.

System memory 1210 and NVM/storage 1230 may respectively include, in particular,temporal and persistent copies of, for example, the instructions memory portions holding theglasses visualisation application 1205 and 1235, respectively.

Network interface 1260 may provide a radio interface for system 1200 to communicate overone or more network(s) (e.g. wireless communication network) and/or with any other suitabledevice.

Figure 13 shows more specific example device to carry out the disclosed virtual tryingexperience method, in particular a smartphone embodiment 1300, where the method is carried outby an “app” downloaded to the smartphone 1300 via antenna 1310, to be run on a computer system1200 (as per figure 12) within the smartphone 1300. The smartphone 1300 further includes adisplay and/or touch screen display 1320 for displaying the virtual try-on experience image formedaccording to the above described examples. The smartphone 1300 may optionally also include a setof dedicated input devices, such as keyboard 1320, especially when a touchscreen display is notprovided. A computer program may be formed of a list of executable instructions such as a particularapplication program and/or an operating system. The computer program may for example includeone or more of: a subroutine, a function, a procedure, an object method, an object implementation,an executable application (“app”), an applet, a servlet, a source code portion, an object codeportion, a shared library/dynamic load library and/or any other sequence of instructions designedfor execution on a suitable computer system.

The computer program may be stored internally on a computer readable storage medium ortransmitted to the computer system via a computer readable transmission medium. All or some ofthe computer program may be provided on computer readable media permanently, removably orremotely coupled to the programmable apparatus, such as an information processing system. Thecomputer readable media may include, for example and without limitation, any one or more of thefollowing: magnetic storage media including disk and tape storage media; optical storage mediasuch as compact disk media (e.g., CD-ROM, CD-R, Blu-Ray (RTM), etc.) digital video diskstorage media (DVD, DVD-R, DVD-RW, etc) or high density optical media (e.g. Blu-Ray (RTM),etc); non-volatile memory storage media including semiconductor-based memory units such asFLASH memory, EEPROM, EPROM, ROM; ferromagnetic digital memories; MRAM; volatilestorage media including registers, buffers or caches, main memory, RAM, DRAM, DDR RAMetc.; and data transmission media including computer networks, point-to-point telecommunicationequipment, and carrier wave transmission media, and the like. Embodiments of the invention mayinclude tangible and non-tangible embodiments, transitory and non-transitory embodiments and arenot limited to any specific form of computer readable media used. A computer process typically includes an executing (running) program or portion of aprogram, current program values and state information, and the resources used by the operatingsystem to manage the execution of the process. An operating system (OS) is the software thatmanages the sharing of the resources of a computer and provides programmers with an interfaceused to access those resources. An operating system processes system data and user input, andresponds by allocating and managing tasks and internal system resources as a service to users andprograms of the system.

The computer system may for instance include at least one processing unit, associatedmemory and a number of input/output (I/O) devices. When executing the computer program, thecomputer system processes information according to the computer program and produces resultantoutput information via I/O devices.

In the foregoing specification, the invention has been described with reference to specificexamples of embodiments of the invention. It will, however, be evident that various modificationsand changes may be made therein without departing from the broader scope of the invention as setforth in the appended claims.

Those skilled in the art will recognize that the boundaries between logic blocks are merelyillustrative and that alternative embodiments may merge logic blocks or circuit elements or imposean alternate decomposition of functionality upon various logic blocks or circuit elements. Thus, it isto be understood that the architectures depicted herein are merely exemplary, and that in fact manyother architectures can be implemented which achieve the same functionality.

Any arrangement of components to achieve the same functionality is effectively "associated"such that the desired functionality is achieved. Hence, any two components herein combined toachieve a particular functionality can be seen as "associated with" each other such that the desiredfunctionality is achieved, irrespective of architectures or intermedial components. Likewise, anytwo components so associated can also be viewed as being "operably connected," or "operablycoupled," to each other to achieve the desired functionality.

Furthermore, those skilled in the art will recognize that boundaries between the abovedescribed operations are merely illustrative. The multiple operations may be combined into a singleoperation, a single operation may be distributed in additional operations and operations may beexecuted at least partially overlapping in time. Moreover, alternative embodiments may includemultiple instances of a particular operation, and the order of operations may be altered in variousother embodiments.

Also for example, the examples, or portions thereof, may be implemented as soft or coderepresentations of physical circuitry or of logical representations convertible into physical circuitry,such as in a hardware description language of any appropriate type.

Also, the invention is not limited to physical devices or units implemented in non-programmable hardware but can also be applied in programmable devices or units able to performthe desired device functions by operating in accordance with suitable program code, such asmainframes, minicomputers, servers, workstations, personal computers, notepads, personal digitalassistants, electronic games, automotive and other embedded systems, cell phones and variousother wireless devices, commonly denoted in this application as ‘computer systems’.

However, other modifications, variations and alternatives are also possible. Thespecifications and drawings are, accordingly, to be regarded in an illustrative rather than in arestrictive sense.

In the claims, any reference signs placed between parentheses shall not be construed aslimiting the claim. The word ‘comprising’ does not exclude the presence of other elements or stepsthen those listed in a claim. Furthermore, the terms “a” or “an,” as used herein, are defined as oneor more than one. Also, the use of introductory phrases such as “at least one” and “one or more” inthe claims should not be construed to imply that the introduction of another claim element by theindefinite articles "a" or "an" limits any particular claim containing such introduced claim elementto inventions containing only one such element, even when the same claim includes theintroductory phrases "one or more" or "at least one" and indefinite articles such as "a" or "an." Thesame holds true for the use of definite articles. Unless stated otherwise, terms such as “first” and“second” are used to arbitrarily distinguish between the elements such terms describe. Thus, theseterms are not necessarily intended to indicate temporal or other prioritization of such elements. Themere fact that certain measures are recited in mutually different claims does not indicate that acombination of these measures cannot be used to advantage.

Examples provide a method of providing a virtual trying on experience to a user comprisingextracting at least one image from a video including a plurality of video frames of a user indifferent orientations to provide at least one extracted image, determining user movement in the atleast one extracted image, acquiring 3D models of an item to be tried on the user and a genericrepresentation of a human, combining the acquired 3D models and at least one extracted image asthe background, and generating an output image representative of the virtual trying-on experience.

In some examples, the determining user movement in the at least one extracted image furthercomprises determining a maximum angle of rotation of the user in a first direction.

In some examples, the determining user movement in the at least one extracted image furthercomprises determining a maximum angle of rotation of the user in a second direction.

In some examples, the determining user movement in the at least one extracted image furthercomprises outputting a value indicative of the determined maximum angle of rotation of the user inthe first or second directions.

In some examples, the acquiring 3D models of an item to be tried on the user and a genericrepresentation of a human further comprises selecting a one of a plurality of 3D models of availablegeneric humans.

In some examples, the method further comprises determining an origin point in each of the 3Dmodels used, wherein the respective origin point in each 3D model is placed to allow alignment ofthe 3D models with one another.

In some examples, the method further comprises determining an orientation of the user in the atleast one extracted image and corresponding the orientation of the 3D models in a 3D spaceaccording to the determined orientation of the user.

In some examples, the method further comprises adjusting an origin of at least one 3D model.

In some examples, the method further comprises aligning the origins of the 3D models.

In some examples, the method further comprises dividing the maximum rotation of the user infirst and second directions into a predetermined number of set angles, and extracting as manyimages as there are determined number of set angles.

In some examples, the method further comprises adjusting respective positions of the 3Dmodels and the background according to user input.

In some examples, the method further comprises capturing the rotation of the user using avideo capture device.

In some examples, the method further comprises determining user movement comprisesdetermining movement of a user’s head.

There is also provided a method of providing a virtual trying on experience for a user,comprising receiving a plurality of video frames of a user’s head in different orientations to providecaptured oriented user images, identifying an origin reference points on the captured oriented userimages, identifying an origin on a 3D model of a generic user, identifying an origin reference pointon a 3D model of a user-selected item to be tried on, aligning the reference points of the selectedcaptured oriented user images, the 3D model of a generic user and the 3D model of an item to betried on, combining the captured oriented user images with a generated representation of user-selected item to be tried on to provide a combined image and displaying the combined image.

In some examples, the receiving a plurality of video frames of a user’s head in differentorientations to provide captured oriented user images further comprises selecting only a subset ofall the captured video frames to use in the subsequent processing of the captured oriented userimages.

In some examples, the selecting only a subset is a pre-determined sub set, or user-selectable.

In some examples, the method further comprises identifying one or more attachment points ofthe item to the user.

In some examples, the method further comprises rotating or translating the attachment points inthe 3D space to re-align the item to the user in a user specified way.

In some examples, the providing a virtual trying on experience for a user comprises generatinga visual representation of a user trying on an item, and wherein the trying on of an item on a usercomprises trying on an item on a user’s head. In some examples, the item being tried on is a pair ofglasses.

Unless otherwise stated as incompatible, or the physics or otherwise of the embodimentsprevent such a combination, the features of the following claims may be integrated together in anysuitable and beneficial arrangement. This is to say that the combination of features is not limited bythe claims specific form, particularly the form of the dependent claims, such as claim numberingand the like.

Claims

1. A method of providing a virtual trying on experience to a user comprising: extracting at least one image from a video including a plurality of video frames of auser in different orientations to provide at least one extracted image; acquiring a 3D model of an item to be tried on the user and a 3D model of a genericrepresentation of a human; and combining the acquired 3D models with at least one extracted image, having the atleast one extracted image as a background, to generate an output image representative of thevirtual trying-on experience; wherein each of the 3D models comprises an origin point, andthe combining the acquired 3D models with the at least one extracted image comprisesaligning the origin points of each of the 3D models in 3D space .

2. The method of claim 1, comprising determining user movement in the at least one extractedimage.

3. The method of claim 2, wherein determining user movement in the at least one extracted imagefurther comprises determining a maximum angle of rotation of the user in a first direction.

4. The method of claim 2 or 3, wherein determining user movement in the at least one extractedimage further comprises determining a maximum angle of rotation of the user in a second direction.

5. The method of claim 3 or 4, wherein determining user movement in the at least one extractedimage further comprises outputting a value indicative of the determined maximum angle of rotationof the user in the first direction when dependent on claim 3 or the second direction when dependenton claim 4.

6. The method of any preceding claim, wherein the acquiring the 3D model of an item to be triedon the user and the 3D model of a generic representation of a human further comprises selecting aone of a plurality of 3D models of generic humans.

7. The method of any preceding claim, further comprising determining an orientation of the userin the at least one extracted image and corresponding the orientation of the 3D models in a 3Dspace according to the determined orientation of the user.

8. The method of claim 1, further comprising adjusting the origin point of at least one of the 3Dmodels.

9. The method of any of claims 3 to 5 or any claim dependent thereon, further comprisingdividing the maximum rotation of the user in the first or the second direction into a predeterminednumber of set angles, and extracting as many images as there are determined number of set angles.

10. The method of any preceding claim, further comprising adjusting respective positions of the3D models and the background according to user input.

11. The method of any preceding claim, further comprising capturing the rotation of the user usinga video capture device.

12. The method of any preceding claim, wherein determining user movement comprisesdetermining movement of a user’s head

13. A method of providing a virtual trying on experience for a user, comprising: receiving a plurality of video frames of a user’s head in different orientations to providecaptured oriented user images; identifying an origin point on a 3D model of a generic user; identifying an origin point on a 3D model of a user-selected item to be tried on; aligning the origin points of the 3D model of the generic user and the 3D model of an itemto be tried on; combining each captured oriented user image as a background with a generatedrepresentation of the user-selected item to be tried on based on the aligned 3D model of the genericuser and the 3D model of the item to provide a series of combined images representative of thevirtual trying on experience; and displaying the series of combined images.

14. The method of claim 13, wherein receiving a plurality of video frames of a user’s head indifferent orientations to provide captured oriented user images further comprises selecting only asubset of all the captured video frames to use in the subsequent processing of the captured orienteduser images

15. The method of claim 14, wherein the selecting only a subset is a pre-determined sub set, oruser-selectable.

16. The method of any of claims 13 to 15, further comprising identifying one or more attachmentpoints of the item to the user.

17. The method of claim 16 wherein the method further comprises rotating or translating theattachment points in the 3D space to re-align the item to the user in a user specified way.

18. The method of any of claims 13 to 17, wherein providing a virtual trying on experience for auser comprises generating a visual representation of a user trying on an item, and wherein thetrying on of an item on a user comprises trying on an item on a user’s head.

19. The method of claim 18 wherein the item is a pair of glasses.

20. A computer readable medium comprising instructions, which, when executed by one orprocessors, result in the one or more processors carrying out the method of any preceding claim.

21. A computer system arranged to carry out any preceding method claim or provide instructionsto carry out any preceding method claim.