WO2010022185A1 - Methods and systems for content processing - Google Patents
Methods and systems for content processing Download PDFInfo
- Publication number
- WO2010022185A1 WO2010022185A1 PCT/US2009/054358 US2009054358W WO2010022185A1 WO 2010022185 A1 WO2010022185 A1 WO 2010022185A1 US 2009054358 W US2009054358 W US 2009054358W WO 2010022185 A1 WO2010022185 A1 WO 2010022185A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- image
- data
- processing
- user
- mobile phone
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/10—Terrestrial scenes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/40—Software arrangements specially adapted for pattern recognition, e.g. user interfaces or toolboxes therefor
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T1/00—General purpose image data processing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/44—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/46—Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
- G06V10/462—Salient features, e.g. scale invariant feature transforms [SIFT]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/50—Extraction of image or video features by performing operations within image blocks; by using histograms, e.g. histogram of oriented gradients [HoG]; by summing image-intensity values; Projection analysis
- G06V10/507—Summing image-intensity values; Histogram projection analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/56—Extraction of image or video features relating to colour
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/80—Recognising image objects characterised by unique random patterns
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/14—Image acquisition
- G06V30/142—Image acquisition using hand-held instruments; Constructional details of the instruments
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/172—Classification, e.g. identification
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04B—TRANSMISSION
- H04B1/00—Details of transmission systems, not covered by a single one of groups H04B3/00 - H04B13/00; Details of transmission systems not characterised by the medium used for transmission
- H04B1/38—Transceivers, i.e. devices in which transmitter and receiver form a structural unit and in which at least one part is used for functions of transmitting and receiving
- H04B1/40—Circuits
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04M—TELEPHONIC COMMUNICATION
- H04M1/00—Substation equipment, e.g. for use by subscribers
- H04M1/02—Constructional features of telephone sets
- H04M1/0202—Portable telephone sets, e.g. cordless phones, mobile phones or bar type handsets
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04M—TELEPHONIC COMMUNICATION
- H04M1/00—Substation equipment, e.g. for use by subscribers
- H04M1/02—Constructional features of telephone sets
- H04M1/0202—Portable telephone sets, e.g. cordless phones, mobile phones or bar type handsets
- H04M1/026—Details of the structure or mounting of specific components
- H04M1/0264—Details of the structure or mounting of specific components for a camera module assembly
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N1/00—Scanning, transmission or reproduction of documents or the like, e.g. facsimile transmission; Details thereof
- H04N1/21—Intermediate information storage
- H04N1/2104—Intermediate information storage for one or a few pictures
- H04N1/2112—Intermediate information storage for one or a few pictures using still video cameras
- H04N1/2129—Recording in, or reproducing from, a specific memory area or areas, or recording or reproducing at a specific moment
- H04N1/2133—Recording or reproducing at a specific moment, e.g. time interval or time-lapse
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N23/00—Cameras or camera modules comprising electronic image sensors; Control thereof
- H04N23/60—Control of cameras or camera modules
- H04N23/61—Control of cameras or camera modules based on recognised objects
- H04N23/611—Control of cameras or camera modules based on recognised objects where the recognised objects include parts of the human body
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N23/00—Cameras or camera modules comprising electronic image sensors; Control thereof
- H04N23/60—Control of cameras or camera modules
- H04N23/64—Computer-aided capture of images, e.g. transfer from script file into camera, check of taken image quality, advice or proposal for image composition or decision on when to take image
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04W—WIRELESS COMMUNICATION NETWORKS
- H04W4/00—Services specially adapted for wireless communication networks; Facilities therefor
- H04W4/50—Service provisioning or reconfiguring
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N1/00—Scanning, transmission or reproduction of documents or the like, e.g. facsimile transmission; Details thereof
- H04N1/00127—Connection or combination of a still picture apparatus with another apparatus, e.g. for storage, processing or transmission of still picture signals or of information associated with a still picture
- H04N1/00281—Connection or combination of a still picture apparatus with another apparatus, e.g. for storage, processing or transmission of still picture signals or of information associated with a still picture with a telecommunication apparatus, e.g. a switched network of teleprinters for the distribution of text-based information, a selective call terminal
- H04N1/00307—Connection or combination of a still picture apparatus with another apparatus, e.g. for storage, processing or transmission of still picture signals or of information associated with a still picture with a telecommunication apparatus, e.g. a switched network of teleprinters for the distribution of text-based information, a selective call terminal with a mobile telephone apparatus
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N1/00—Scanning, transmission or reproduction of documents or the like, e.g. facsimile transmission; Details thereof
- H04N1/32—Circuits or arrangements for control or supervision between transmitter and receiver or between image input and image output device, e.g. between a still-image camera and its memory or between a still-image camera and a printer device
- H04N1/32101—Display, printing, storage or transmission of additional information, e.g. ID code, date and time or title
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N2101/00—Still video cameras
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N2201/00—Indexing scheme relating to scanning, transmission or reproduction of documents or the like, and to details thereof
- H04N2201/32—Circuits or arrangements for control or supervision between transmitter and receiver or between image input and image output device, e.g. between a still-image camera and its memory or between a still-image camera and a printer device
- H04N2201/3201—Display, printing, storage or transmission of additional information, e.g. ID code, date and time or title
- H04N2201/3225—Display, printing, storage or transmission of additional information, e.g. ID code, date and time or title of data relating to an image, a page or a document
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N2201/00—Indexing scheme relating to scanning, transmission or reproduction of documents or the like, and to details thereof
- H04N2201/32—Circuits or arrangements for control or supervision between transmitter and receiver or between image input and image output device, e.g. between a still-image camera and its memory or between a still-image camera and a printer device
- H04N2201/3201—Display, printing, storage or transmission of additional information, e.g. ID code, date and time or title
- H04N2201/3274—Storage or retrieval of prestored additional information
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N2201/00—Indexing scheme relating to scanning, transmission or reproduction of documents or the like, and to details thereof
- H04N2201/32—Circuits or arrangements for control or supervision between transmitter and receiver or between image input and image output device, e.g. between a still-image camera and its memory or between a still-image camera and a printer device
- H04N2201/3201—Display, printing, storage or transmission of additional information, e.g. ID code, date and time or title
- H04N2201/3278—Transmission
Definitions
- a user's mobile phone captures imagery (either in response to user command, or autonomously), and objects within the scene are recognized. Information associated with each object is identified, and made available to the user through a scene-registered interactive visual "bauble" that is graphically overlaid on the imagery.
- the bauble may itself present information, or may simply be an indicia that the user can tap at the indicated location to obtain a lengthier listing of related information, or launch a related function/application.
- the camera has recognized the face in the foreground as "Bob” and annotated the image accordingly.
- a billboard promoting the Godzilla movie has been recognized, and a bauble saying "Show Times" has been blitted onto the display - inviting the user to tap for screening information.
- the phone has recognized the user's car from the scene, and has also identified - by make and year - another vehicle in the picture. Both are noted by overlaid text.
- a restaurant has also been identified, and an initial review from a collection of reviews ("Jane's review: Pretty Good!) is shown. Tapping brings up more reviews.
- this scenario is implemented as a cloud-side service assisted by local device object recognition core services. Users may leave notes on both fixed and mobile objects. Tapped baubles can trigger other applications. Social networks can keep track of objection relationships - forming a virtual "web of objects.”
- Object identification events will primarily fetch and associate public domain information and social-web connections to the baubles.
- Applications employing barcodes, digital watermarks, facial recognition, OCR, etc., can help support initial deployment of the technology.
- the arrangement is expected to evolve into an auction market, in which paying enterprises want to place their own baubles (or associated information) onto highly targeted demographic user screens.
- User profiles in conjunction with the input visual stimuli (aided, in some cases by GPS/magnetometer data), is fed into a Google-esque mix-master in the cloud, matching buyers of mobile device-screen real estate to users requesting the baubles.
- Digimarc's patent 6,947,571 shows a system in which a cell phone camera captures content (e.g., image data), and processes same to derive an identifier related to the imagery. This derived identifier is submitted to a data structure (e.g., a database), which indicates corresponding data or actions. The cell phone then displays responsive information, or takes responsive action. Such sequence of operations is sometimes referred to as "visual search.”
- Fig. 0 is a diagram showing an exemplary embodiment incorporating certain aspects of the technology detailed herein.
- Fig. 1 is a high level view of an embodiment incorporating aspects of the present technology.
- Fig. 2 shows some of the applications that a user may request a camera-equipped cell phone to perform.
- Fig. 3 identifies some of the commercial entities in an embodiment incorporating aspects of the present technology.
- Figs. 4, 4A and 4B conceptually illustrate how pixel data, and derivatives, are applied in different tasks, and packaged into packet form.
- Fig. 5 shows how different tasks may have certain image processing operations in common.
- Fig. 6 is a diagram illustrating how common image processing operations can be identified, and used to configure cell phone processing hardware to perform these operations.
- Fig. 7 is a diagram showing how a cell phone can send certain pixel-related data across an internal bus for local processing, and send other pixel-related data across a communications channel for processing in the cloud.
- Fig. 8 shows how the cloud processing in Fig. 7 allows tremendously more "intelligence" to be applied to a task desired by a user.
- Fig. 9 details how keyvector data is distributed to different external service providers, who perform services in exchange for compensation, which is handled in consolidated fashion for the user.
- Fig. 10 shows an embodiment incorporating aspects of the present technology, noting how cell phone- based processing is suited for simple object identification tasks - such as template matching, whereas cloud- based processing is suited for complex tasks - such as data association.
- Fig. 1OA shows an embodiment incorporating aspects of the present technology, noting that the user experience is optimized by performing visual keyvector processing as close to a sensor as possible, and administering traffic to the cloud as low in a communications stack as possible.
- Fig. 11 illustrates that tasks referred for external processing can be routed to a first group of service providers who routinely perform certain tasks for the cell phone, or can be routed to a second group of service providers who compete on a dynamic basis for processing tasks from the cell phone.
- Fig. 12 further expands on concepts of Fig. 11, e.g., showing how a bid filter and broadcast agent software module may oversee a reverse auction process.
- Fig. 13 is a high level block diagram of a processing arrangement incorporating aspects of the present technology.
- Fig. 14 is a high level block diagram of another processing arrangement incorporating aspects of the present technology.
- Fig. 15 shows an illustrative range of image types that may be captured by a cell phone camera.
- Fig. 16 shows a particular hardware implementation incorporating aspects of the present technology.
- Fig. 17 illustrates aspects of a packet used in an exemplary embodiment.
- Fig. 18 is a block diagram illustrating an implementation of the SIFT technique.
- Fig. 19 is a block diagram illustrating, e.g., how packet header data can be changed during processing, through use of a memory.
- Fig. 19A shows a prior art architecture from the robotic Player Project.
- Fig. 19B shows how various factors can influence how different operations may be handled.
- Fig. 20 shows an arrangement by which a cell phone camera and a cell phone projector share a lens.
- Fig. 2OA shows a reference platform architecture that can be used in embodiments of the present technology.
- Fig. 21 shows an image of a desktop telephone captured by a cell phone camera.
- Fig. 22 shows a collection of similar images found in a repository of public images, by reference to characteristics discerned from the image of Fig. 21.
- Figs. 23-28A, and 30-34 are flow diagrams detailing methods incorporating aspects of the present technology.
- Fig. 29 is an arty shot of the Eiffel Tower, captured by a cell phone user.
- Fig. 35 is another image captured by a cell phone user.
- Fig. 36 is an image of an underside of a telephone, discovered using methods according to aspects of the present technology.
- Fig. 37 shows part of the physical user interface of one style of cell phone.
- Figs. 37A and 38B illustrate different linking topologies.
- Fig. 38 is an image captured by a cell phone user, depicting an Appalachian Trail trail marker.
- Figs. 39-43 detail methods incorporating aspects of the present technology.
- Fig. 44 shows the user interface of one style of cell phone.
- Figs. 45A and 45B illustrate how different dimensions of commonality may be explored through use of a user interface control of a cell phone.
- Figs 46 A and 46B detail a particular method incorporating aspects of the present technology, by which keywords such as Prometheus and Paul Manship are automatically determined from a cell phone image.
- Fig. 47 shows some of the different data sources that may be consulted in processing imagery according to aspects of the present technology.
- Figs. 48A, 48B and 49 show different processing methods according to aspects of the present technology.
- Fig. 50 identifies some of the different processing that may be performed on image data, in accordance with aspects of the present technology.
- Fig. 51 shows an illustrative tree structure that can be employed in accordance with certain aspects of the present technology.
- Fig. 52 shows a network of wearable computers (e.g., cell phones) that can cooperate with each other, e.g., in a peer-to-peer network.
- wearable computers e.g., cell phones
- Figs. 53-55 detail how a glossary of signs can be identified by a cell phone, and used to trigger different actions.
- Fig. 56 illustrates aspects of prior art digital camera technology.
- Fig. 57 details an embodiment incorporating aspects of the present technology.
- Fig. 58 shows how a cell phone can be used to sense and display affine parameters.
- Fig. 59 illustrates certain state machine aspects of the present technology.
- Fig. 60 illustrates how even "still" imagery can include temporal, or motion, aspects.
- Fig. 61 shows some metadata that may be involved in an implementation incorporating aspects of the present technology.
- Fig. 62 shows an image that may be captured by a cell phone camera user.
- Figs. 63-66 detail how the image of Fig. 62 can be processed to convey semantic metadata.
- Fig. 67 shows another image that may be captured by a cell phone camera user.
- Figs. 68 and 69 detail how the image of Fig. 67 can be processed to convey semantic metadata.
- Fig. 70 shows an image that may be captured by a cell phone camera user.
- Fig. 71 details how the image of Fig. 70 can be processed to convey semantic metadata.
- Fig. 72 is a chart showing aspects of the human visual system.
- Fig. 73 shows different low, mid and high frequency components of an image.
- Fig. 74 shows a newspaper page.
- Fig. 75 shows the layout of the Fig. 74 page, as set by layout software.
- Fig. 76 details how user interaction with imagery captured from printed text may be enhanced.
- Fig. 77 illustrates how semantic conveyance of metadata can have a progressive aspect, akin to JPEG2000 and the like.
- Fig. 78 is a block diagram of a prior art thermostat.
- Fig. 79 is an exterior view of the thermostat of Fig. 78.
- Fig. 80 is a block diagram of a thermostat employing certain aspects of the present technology ("ThingPipe").
- Fig. 81 is a block diagram of a cell phone embodying certain aspects of the present technology.
- Fig. 82 is a block diagram by which certain operations of the thermostat of Fig. 80 are explained.
- Fig. 83 shows a cell phone display depicting an image captured from a thermostat, onto which is overlaid certain touch-screen targets that the user can touch to increment or decrement the thermostat temperature.
- Fig. 84 is similar to Fig. 83, but shows a graphical user interface for use on a phone without a touchscreen.
- Fig. 85 is a block diagram of an alarm clock employing aspects of the present technology.
- Fig. 86 shows a screen of an alarm clock user interface that may be presented on a cell phone, in accordance with one aspect of the technology.
- Fig. 87 shows a screen of a user interface, detailing nearby devices that may be controlled through use of the cell phone.
- a distributed network of pixel processing engines serve such mobile device users and meet most qualitative "human real time interactivity" requirements, generally with feedback in much less than one second.
- Implementation desirably provides certain basic features on the mobile device, including a rather intimate relationship between the image sensor's output pixels and the native communications channel available to the mobile device.
- a session further indicates fast responses transmitted back to the mobile device, where for some services marketed as “real time” or “interactive," a session essentially represents a duplex, generally packet-based, communication, where several outgoing "pixel packets" and several incoming response packets (which may be pixel packets updated with the processed data) may occur every second.
- Fig. 1 is but one graphic perspective on some of these plumbing features of what might be called a visually intelligent network. (Conventional details of a cell phone, such as the microphone, A/D converter, modulation and demodulation systems, IF stages, cellular transceiver, etc., are not shown for clarity of illustration.)
- Fig. 2 depicts a non-exhaustive but illustrative list of visual processing applications for mobile devices. Again, it is hard not to see analogies between this list and the fundamentals of how the human visual system and the human brain operate. It is a well studied academic area that deals with how "optimized" the human visual system is relative to any given object recognition task, where a general consensus is that the eye-retina-optic nerve-cortex system is pretty darn wonderful in how efficiently it serves a vast array of cognitive demands.
- This aspect of the technology relates to how similarly efficient and broadly enabling elements can be built into mobile devices, mobile device connections and network services, all with the goal of serving the applications depicted in Fig. 2, as well as those new applications which may show up as the technology dance continues.
- Fig. 4 sprints toward the abstract in the introduction of the technology aspect now being considered.
- Fig. 4A then quickly introduces the intuitively well-known concept that singular bits of visual information aren't worth much outside of their role in both spatial and temporal groupings. This core concept is well exploited in modern video compression standards such as MPEG7 and H.264.
- the "visual" character of the bits may be pretty far removed from the visual domain by certain of the processing (consider, e.g., the vector strings representing eigenface data).
- keyvector data or “keyvector strings” to refer collectively to raw sensor/stimulus data (e.g., pixel data), and/or to processed information and associated derivatives.
- a keyvector may take the form of a container in which such information is conveyed (e.g., a data structure such as a packet).
- a tag or other data can be included to identify the type of information (e.g., JPEG image data, eigenface data), or the data type may be otherwise evident from the data or from context.
- One or more instructions, or operations may be associated with keyvector data - either expressly detailed in the keyvector, or implied.
- An operation may be implied in default fashion, for keyvector data of certain types (e.g., for JPEG data it may be "store the image;” for eigenface data is may be "match this eigenface template"). Or an implied operation may be dependent on context.
- Figs. 4A and 4B also introduce a central player in this disclosure: the packaged and address-labeled pixel packet, into a body of which keyvector data is inserted.
- the keyvector data may be a single patch, or a collection of patches, or a time-series of patches/collections.
- a pixel packet may be less than a kilobyte, or its size can be much much larger. It may convey information about an isolated patch of pixels excerpted from a larger image, or it may convey a massive Photosynth of Notre Dame cathedral.
- a pixel packet is an application layer construct. When actually pushed around a network, however, it may be broken into smaller portions - as transport layer constraints in a network may require.
- Fig. 5 is a segue diagram - still at an abstract level, but pointing toward the concrete.
- a list of user- defined applications, such as illustrated in Fig. 2 will map to a state-of-the-art inventory of pixel processing methods and approaches which can accomplish each and every application. These pixel processing methods break down into common and not-so-common component sub-tasks.
- Object recognition textbooks are filled with a wide variety of approaches and terminologies which bring a sense of order into what at first glance might appear to be a bewildering array of "unique requirements" relative to the applications shown in Fig. 2.
- Fig. 5 attempts to show that there are indeed a set of common steps and processes shared between visual processing applications.
- the differently shaded pie slices attempt to illustrate that certain pixel operations are of a specific class and may simply have differences in low level variables or optimizations.
- the size of the overall pie (thought of in a logarithmic sense, where a pie twice the size of another may represent 10 times more Flops, for example), and the percentage size of the slice, represent degrees of commonality.
- Fig. 6 takes a major step toward the concrete, sacrificing simplicity in the process.
- the turned on applications negotiate to identify their common component tasks, labeled the "Common Processes Sorter" - first generating an overall common list of pixel processing routines available for on-device processing, chosen from a library of these elemental image processing routines (e.g., FFT, filtering, edge detection, resampling, color histogramming, log-polar transform, etc.).
- a library of these elemental image processing routines e.g., FFT, filtering, edge detection, resampling, color histogramming, log-polar transform, etc.
- Generation of corresponding Flow Gate Configuration/Software Programming information follows, which literally loads library elements into properly ordered places in a field programmable gate array set-up, or otherwise configures a suitable processor to perform the required component tasks.
- Fig. 6 also includes depictions of the image sensor, followed by a universal pixel segmenter.
- This pixel segmenter breaks down the massive stream of imagery from the sensor into manageable spatial and/or temporal blobs (e.g., akin to MPEG macroblocks, wavelet transform blocks, 64 x 64 pixel blocks, etc.). After the torrent of pixels has been broken down into chewable chunks, they are fed into the newly programmed gate array (or other hardware), which performs the elemental image processing tasks associated with the selected applications.
- manageable spatial and/or temporal blobs e.g., akin to MPEG macroblocks, wavelet transform blocks, 64 x 64 pixel blocks, etc.
- Various output products are sent to a routing engine, which refers the elementally-processed data (e.g., keyvector data) to other resources (internal and/or external) for further processing.
- This further processing typically is more complex that that already performed. Examples include making associations, deriving inferences, pattern and template matching, etc.
- This further processing can be highly application-specific.
- the cell phone When an image match is found, the cell phone immediately reports same wirelessly to Pepsi. The winner is the user whose cell phone first reports detection of the specially-marked six-pack.
- some of the component tasks in the SIFT pattern matching operation are performed by the elemental image processing in the configured hardware; others are referred for more specialized processing - either internal or external.)
- Fig. 7 up-levels the picture to a generic distributed pixel services network view, where local device pixel services and "cloud based" pixel services have a kind of symmetry in how they operate.
- the router in Fig. 7 takes care of how any given packaged pixel packet gets sent to the appropriate pixel processing location, whether local or remote (with the style of fill pattern denoting different component processing functions; only a few of the processing functions required by the enabled visual processing services are depicted).
- Some of the data shipped to cloud-based pixel services may have been first processed by local device pixel services.
- the circles indicate that the routing functionality may have components in the cloud - nodes that serve to distribute tasks to active service providers, and collect results for transmission back to the device.
- Fig. 8 is an expanded view of the lower right portion of Fig. 7 and represents the moment where
- Fig. 8 attempts to illustrate this radical extra dimensionality of pixel processing in the cloud as opposed to the local device. This virtually goes without saying (or without a picture), but Fig. 8 is also a segue figure to Fig. 9, where Dorothy gets back to Kansas and is happy about it.
- Fig. 9 is all about cash, cash flow, and happy humans using cameras on their mobile devices and getting highly meaningful results back from their visual queries, all the while paying one monthly bill. It turns out the Google "AdWords" auction genie is out of the bottle. Behind the scenes of the moment-by-moment visual scans from a mobile user of their immediate visual environment are hundreds and thousands of micro-decisions, pixel routings, results comparisons and micro-auctioned channels back to the mobile device user for the hard good they are "truly" looking for, whether they know it or not. This last point is deliberately cheeky, in that searching of any kind is inherently open ended and magical at some level, and part of the fun of searching in the first place is that surprisingly new associations are part of the results.
- the search user knows after the fact what they were truly looking for.
- the system represented in Fig. 9 as the carrier-based financial tracking server, now sees the addition of our networked pixel services module and its role in facilitating pertinent results being sent back to a user, all the while monitoring the uses of the services in order to populate the monthly bill and send the proceeds to the proper entities.
- money flow may not exclusively be to remote service providers.
- Other money flows can arise, such as to users or other parties, e.g., to induce or reward certain actions.
- Fig. 10 focuses on functional division of processing - illustrating how tasks in the nature of template matching can be performed on the cell phone itself, whereas more sophisticated tasks (in the nature of data association) desirably are referred to the cloud for processing.
- FIG. 1OA Elements of the foregoing are distilled in Fig. 1OA, showing an implementation of aspects of the technology as a physical matter of (usually) software components.
- the two ovals in the figure highlight the symmetric pair of software components which are involved in setting up a "human real-time" visual recognition session between a mobile device and the generic cloud or service providers, data associations and visual query results.
- the oval on the left refers to "keyvectors” and more specifically "visual keyvectors.”
- this term can encompass everything from simple JPEG compressed blocks all the way through log-polar transformed facial feature vectors and anything in between and beyond.
- the point of a keyvector is that the essential raw information of some given visual recognition task has been optimally pre-processed and packaged (possibly compressed).
- the oval on the left assembles these packets, and typically inserts some addressing information by which they will be routed. (Final addressing may not be possible, as the packet may ultimately be routed to remote service providers - the details of which may not yet be known.) Desirably, this processing is performed as close to the raw sensor data as possible, such as by processing circuitry integrated on the same substrate as the image sensor, which is responsive to software instructions stored in memory or provided from another stage in packet form.
- the oval on the right administers the remote processing of keyvector data, e.g., attending to arranging appropriate services, directing traffic flow, etc. Desirably, this software process is implemented as low down on a communications stack as possible, generally on a "cloud side" device, access point, or cell tower.
- Figs. 11 and 12 illustrate the concept that some providers of some cloud-based pixel processing services may be established in advance, in a pseudo-static fashion, whereas other providers may periodically vie for the privilege of processing a user's keyvector data, through participation in a reverse auction. In many implementations, these latter providers compete each time a packet is available for processing.
- a startup vendor may offer to perform recognition for free - to build its brand or collect data. Imagery submitted to this service returns information simply indicating the car's make and model. Consumer Reports may offer an alternative service - which provides make and model data, but also provides technical specifications for the car. However, they may charge 2 cents for the service (or the cost may be bandwidth based, e.g., 1 cent per megapixel). Edmunds, or JD Powers, may offer still another service, which provides data like Consumer Reports, but pays the user for the privilege of providing data. In exchange, the vendor is given the right to have one of its partners send a text message to the user promoting goods or services. The payment may take the form of a credit on the user's monthly cell phone voice/data service billing.
- a query router and response manager determines whether the packet of data needing processing should be handled by one of the service providers in the stable of static standbys, or whether it should be offered to providers on an auction basis - in which case it arbitrates the outcome of the auction.
- the static standby service providers may be identified when the phone is initially programmed, and only reconfigured when the phone is reprogrammed.
- For example, Verizon may specify that all FFT operations on its phones be routed to a server that it provides for this purpose.
- the user may be able to periodically identify preferred providers for certain tasks, as through a configuration menu, or specify that certain tasks should be referred for auction.
- Some applications may emerge where static service providers are favored; the task may be so mundane, or one provider's services may be so un-paralleled, that competition for the provision of services isn't warranted.
- one input to the query router and response manager may be the user's location, so that a different service provider may be selected when the user is at home in Oregon, than when she is vacationing in Mexico.
- the required turnaround time is specified, which may disqualify some vendors, and make others more competitive.
- the query router and response manager need not decide at all, e.g., if cached results identifying a service provider selected in a previous auction are still available and not beyond a "freshness" threshold.
- Pricing offered by the vendors may change with processing load, bandwidth, time of day, and other considerations.
- the providers may be informed of offers submitted by competitors (using known trust arrangements assuring data integrity), and given the opportunity to make their offers more enticing. Such a bidding war may continue until no bidder is willing to change the offered terms.
- the query router and response manager (or in some implementations, the user) then makes a selection.
- Fig. 12 shows a software module labeled "Bid Filter and Broadcast Agent.” In most implementations this forms part of the query router and response manager module.
- the bid filter module decides which vendors - from a universe of possible vendors - should be given a chance to bid on a processing task. (The user's preference data, or historical experience, may indicate that certain service providers be disqualified.)
- the broadcast agent module then communicates with the selected bidders to inform them of a user task for processing, and provides information needed for them to make a bid. Desirably, the bid filter and broadcast agent do at least some their work in advance of data being available for processing.
- these modules start working to identify a provider to perform a service expected to be required. A few hundred milliseconds later the user keyvector data may actually be available for processing (if the prediction turns out to be accurate).
- the service providers are not consulted at each user transaction. Instead, each provides bidding parameters, which are stored and consulted whenever a transaction is considered, to determine which service provider wins. These stored parameters may be updated occasionally. In some implementations the service provider pushes updated parameters to the bid filter and broadcast agent whenever available.
- the bid filter and broadcast agent may serve a large population of users, such as all Verizon subscribers in area code 503, or all subscribers to an ISP in a community, or all users at the domain well-dot-com, etc.; or more localized agents may be employed, such as one for each cell phone tower.) If there is a lull in traffic, a service provider may discount its services for the next minute. The service provider may thus transmit (or post) a message stating that it will perform eigenvector extraction on an image file of up to 10 megabytes for 2 cents until 1244754176 Coordinated Universal Time in the Unix epoch, after which time the price will return to 3 cents. The bid filter and broadcast agent updates a table with stored bidding parameters accordingly.
- the broadcast agent polls the bidders - communicating relevant parameters, and soliciting bid responses whenever a transaction is offered for processing.
- the broadcast agent transmits the keyvector data (and other parameters as may be appropriate to a particular task) to the winning bidder.
- the bidder then performs the requested operation, and returns the processed data to the query router and response manager.
- This module logs the processed data, and attends to any necessary accounting (e.g., crediting the service provider with the appropriate fee).
- the response data is then forwarded back to the user device.
- one or more of the competing service providers actually performs some or all of the requested processing, but "teases" the user (or the query router and response manager) by presenting only partial results. With a taste of what's available, the user (or the query router and response manager) may be induced to make a different choice than relevant criteria/heuristics would otherwise indicate.
- the function calls sent to external service providers do not have to provide the ultimate result sought by a consumer (e.g., identifying a car, or translating a menu listing from French to English). They can be component operations, such as calculating an FFT, or performing a SIFT procedure or a log-polar transform, or computing a histogram or eigenvectors, or identifying edges, etc.
- Additional business models can be enabled, involving the subsidization of consumed remote services by the service providers themselves in exchange for user information (e.g., for audience measurement), or in exchange for action taken by the user, such as completing a survey, visiting specific sites, locations in store, etc.
- Services may be subsidized by third parties as well, such a coffee shop that derives value by providing a differentiating service to its customers in the form of free/discounted usage of remote services while they are seated in the shop.
- an economy is enabled wherein a currency of remote processing credits is created and exchanged between users and remote service providers.
- This may be entirely transparent to the user and managed as part of a service plan, e.g., with the user's cell phone or data service provider. Or it can be exposed as a very explicit aspect of certain embodiments of the present technology.
- Service providers and others may award credits to users for taking actions or being part of a frequent-user program to build allegiance with specific providers.
- a service may pay a user for opting-in to an audience measurement panel.
- the Nielsen Company may provide services to the public - such as identification of television programming from audio or video samples submitted by consumers. These services may be provided free to consumers who agree to share some of their media consumption data with Nielsen (such as by serving as an anonymous member for a city's audience measurement panel), and provided on a fee basis to others. Nielsen may offer, for example, 100 units of credit - micropayments or other value - to participating consumers each month, or may provide credit each time the user submits information to Nielsen.
- a consumer may be rewarded for accepting commercials, or commercial impressions, from a company. If a consumer goes into the Pepsi Center in Denver, she may receive a reward for each Pepsi-branded experience she encounters.
- the amount of micropayment may scale with the amount of time that she interacts with the different Pepsi-branded objects (including audio and imagery) in the venue. Not just large brand owners can provide credits to individuals. Credits can be routed to friends and social/business acquaintances.
- a user of Facebook may share credit (redeemable for goods/services, or exhangable for cash) from his Facebook page - enticing others to visit, or linger. In some cases, the credit can be made available only to people who navigate to the Facebook page in a certain manner - such as by linking to the page from the user's business card, or from another launch page.
- a Facebook user who has earned, or paid for, or otherwise received credit that can be applied to certain services - such as for downloading songs from iTunes, or for music recognition services, or for identifying clothes that go with particular shoes (for which an image has been submitted), etc.
- These services may be associated with the particular Facebook page, so that friends can invoke the services from that page - essentially spending the host's credit (again, with suitable authorization or invitation by that hosting user).
- friends may submit images to a facial recognition service accessible through an application associated with the user's Facebook page. Images submitted in such fashion are analyzed for faces of the host's friends, and identification information is returned to the submitter, e.g., through a user interface presented on the originating Facebook page. Again, the host may be assessed a fee for each such operation, but may allow authorized friends to avail themselves of such service at no cost.
- a viewer exiting a theatre after a particularly horror movie about poverty in Bangladesh may capture an image of an associated movie poster, which serves as a portal for donations for a charity that serves the poor in Bangladesh.
- the cell phone can present a graphical/touch user interface through which the user spins dials to specify an amount of a charitable donation, which at the conclusion of the transaction is transferred from a financial account associated with the user, to one associated with the charity.
- Cloud can include anything external to the cell phone.
- An example is a nearby cell phone, or plural phones on a distributed network. Unused processing power on such other phone devices can be made available for hire (or for free) to call upon as needed.
- the cell phones of the implementations detailed herein can scavenge processing power from such other cell phones.
- Such a cloud may be ad hoc, e.g., other cell phones within Bluetooth range of the user's phone.
- the ad hoc network can be extended by having such other phones also extend the local cloud to further phones that they can reach by Bluetooth, but the user cannot.
- the “cloud” can also comprise other computational platforms, such as set-top boxes; processors in automobiles, thermostats, HVAC systems, wireless routers, local cell phone towers and other wireless network edges (including the processing hardware for their software-defined radio equipment), etc.
- processors can be used in conjunction with more traditional cloud computing resources - as are offered by Google, Amazon, etc.
- the phone desirably has a user-configurable option indicating whether the phone can refer data to cloud resources for processing.
- this option has a default value of "No,” limiting functionality and impairing battery life, but also limiting privacy concerns.
- this option has a default value of "Yes."
- image-responsive techniques should produce a short term "result or answer,” which generally requires some level of interactivity with a user - hopefully measured in fractions of a second for truly interactive applications, or a few seconds or fractions of a minute for nearer-term "I'm patient to wait” applications.
- objects in question they can break down into various categories, including (1) generic passive (clues to basic searches), (2) geographic passive (at least you know where you are, and may hook into geographic-specific resources), (3) “cloud supported” passive, as with “identified/enumerated objects” and their associated sites, and (4) active/controllable, a Ia ThingPipe (a reference to technology detailed below, such as WiFi-equipped thermostats and parking meters).
- An object recognition platform should not, it seems, be conceived in the classic "local device and local resources only" software mentality. However, it may be conceived as a local device optimization problem. That is, the software on the local device, and its processing hardware, should be designed in contemplation of their interaction with off-device software and hardware. Ditto the balance and interplay of both control functionality, pixel crunching functionality, and application software/GUI provided on the device, versus off the device. (In many implementations, certain databases useful for object identification/recognition will reside remote from the device.)
- such a processing platform employs image processing near the sensor - optimally on the same chip, with at least some processing tasks desirably performed by dedicated, special purpose hardware.
- FIG. 13 shows an architecture of a cell phone 10 in which an image sensor 12 feeds two processing paths.
- One, 13, is tailored for the human visual system, and includes processing such as JPEG compression.
- Another, 14, is tailored for object recognition. As discussed, some of this processing may be performed by the mobile device, while other processing may be referred to the cloud 16.
- Fig. 14 takes an application-centric view of the object recognition processing path. Some applications reside wholly in the cell phone. Other applications reside wholly outside the cell phone - e.g., simply taking keyvector data as stimulus. More common are hybrids, such as where some processing is done in the cell phone, other processing is done externally, and the application software orchestrating the process resides in the cell phone.
- Fig. 15 shows a range 40 of some of the different types of images 41-46 that may be captured by a particular user's cell phone. A few brief (and incomplete) comments about some of the processing that may be applied to each image are provided in the following paragraphs.
- Image 41 depicts a thermostat.
- a steganographic digital watermark 47 is textured or printed on the thermostat's case. (The watermark is shown as visible in Fig. 15, but is typically imperceptible to the viewer). The watermark conveys information intended for the cell phone, allowing it to present a graphic user interface by which the user can interact with the thermostat. A bar code or other data carrier can alternatively be used. Such technology is further detailed below.
- Image 42 depicts an item including a barcode 48.
- This barcode conveys Universal Product Code (UPC) data.
- Other barcodes may convey other information.
- the barcode payload is not primarily intended for reading by a user cell phone (in contrast to watermark 47), but it nonetheless may be used by the cell phone to help determine an appropriate response for the user.
- Image 43 shows a product that may be identified without reference to any express machine readable information (such as a bar code or watermark).
- a segmentation algorithm may be applied to edge-detected image data to distinguish the apparent image subject from the apparent background.
- the image subject may be identified through its shape, color and texture.
- Image fingerprinting may be used to identify reference images having similar labels, and metadata associated with those other images may be harvested.
- SIFT techniques may be employed for such pattern-based recognition tasks. Specular reflections in low texture regions may tend to indicate the image subject is made of glass. Optical character recognition can be applied for further information (reading the visible text). All of these clues can be employed to identify the depicted item, and help determine an appropriate response for the user. Additionally (or alternatively), similar-image search systems, such as Google Similar Images, and
- Facial detection and recognition may be employed (i.e., to indicate that there are faces in the image, and to identify particular faces and annotate the image with metadata accordingly, e.g., by reference to user-associated data maintained by Apple's iPhoto service, Google's Picasa service, Facebook, etc.)
- Some facial recognition applications can be trained for non-human faces, e.g., cats, dogs animated characters including avatars, etc.
- Geolocation and date/time information from the cell phone may also provide useful information.
- a score of 90 is required to be considered a match (out of an arbitrary top match score of 100), in searching such a group-constrained set of images a score of 70 or 80 might suffice.
- image 44 there are two persons depicted without sunglasses, the occurrence of both of these individuals in a photo with one or more other individuals may increase its relevance to such an analysis - implemented, e.g., by increasing a weighting factor in a matching algorithm.
- Image 45 shows part of the statue of Prometheus in Rockefeller Center, NY. Its identification can follow teachings detailed elsewhere in this specification.
- Image 46 is a landscape, depicting the Maroon Bells mountain range in Colorado. This image subject may be recognized by reference to geolocation data from the cell phone, in conjunction with geographic information services such as GeoNames or Yahool's GeoPlanet.
- Fig. 16 gets into the nitty gritty of a particular implementation - incorporating certain of the features earlier discussed. (The other discussed features can be implemented by the artisan within this architecture, based on the provided disclosure.)
- operation of a cell phone camera 32 is dynamically controlled in accordance with packet data sent by a setup module 34, which in turn is controlled by a control processor module 36.
- Control processor module 36 may be the cell phone's primary processor, or an auxiliary processor, or this function may be distributed.
- the packet data further specifies operations to be performed by an ensuing chain of processing stages 38.
- setup module 34 dictates - on a frame by frame basis - the parameters that are to be employed by camera 32 in gathering an exposure.
- Setup module 34 also specifies the type of data the camera is to output. These instructional parameters are conveyed in a first field 55 of a header portion 56 of a data packet 57 corresponding to that frame (Fig. 17).
- the setup module 34 may issue a packet 57 whose first field 55 instructs the camera about, e.g., the length of the exposure, the aperture size, the lens focus, the depth of field, etc.
- Module 34 may further author the field 55 to specify that the sensor is to sum sensor charges to reduce resolution (e.g., producing a frame of 640 x 480 data from a sensor capable of 1280 x 960), output data only from red- filtered sensor cells, output data only from a horizontal line of cells across the middle of the sensor, output data only from a 128 x 128 patch of cells in the center of the pixel array, etc.
- the camera instruction field 55 may further specify the exact time that the camera is to capture data - so as to allow, e.g., desired synchronization with ambient lighting (as detailed later).
- Each packet 56 issued by setup module 34 may include different camera parameters in the first header field 55.
- a first packet may cause camera 32 to capture a full frame image with an exposure time of 1 millisecond.
- a next packet may cause the camera to capture a full frame image with an exposure time of 10 milliseconds, and a third may dictate an exposure time of 100 milliseconds.
- a fourth packet may instruct the camera to down-sample data from the image sensor, and combine signals from differently color-filtered sensor cells, so as to output a 4x3 array of grayscale luminance values.
- a fifth packet may instruct the camera to output data only from an 8x8 patch of pixels at the center of the frame.
- a sixth packet may instruct the camera to output only five lines of image data, from the top, bottom, middle, and mid-upper and mid-lower rows of the sensor.
- a seventh packet may instruct the camera to output only data from blue-filtered sensor cells.
- An eighth packet may instruct the camera to disregard any auto-focus instructions but instead capture a full frame at infinity focus. And so on.
- Each such packet 57 is provided from setup module 34 across a bus or other data channel 60 to a camera controller module associated with the camera.
- Camera 32 captures digital image data in accordance with instructions in the header field 55 of the packet and stuffs the resulting image data into a body 59 of the packet. It also deletes the camera instructions 55 from the packet header (or otherwise marks header field 55 in a manner permitting it to be disregarded by subsequent processing stages).
- the packet 57 was authored by setup module 34 it also included a series of further header fields
- Camera 32 outputs the image-stuffed packet produced by the camera (a pixel packet) onto a bus or other data channel 61, which conveys it to a first processing stage 38.
- Stage 38 examines the header of the packet. Since the camera deleted the instruction field 55 that conveyed camera instructions (or marked it to be disregarded), the first header field encountered by a control portion of stage 38 is field 58a. This field details parameters of an operation to be applied by stage 38 to data in the body of the packet.
- field 58a may specify parameters of an edge detection algorithm to be applied by stage 38 to the packet's image data (or simply that such an algorithm should be applied). It may further specify that stage 38 is to substitute the resulting edge-detected set of data for the original image data in the body of the packet. (Substituting of data, rather than appending, may be indicated by the value of a single bit flag in the packet header.) Stage 38 performs the requested operation (which may involve configuring programmable hardware in certain implementations). First stage 38 then deletes instructions 58a from the packet header 56 (or marks them to be disregarded) and outputs the processed pixel packet for action by a next processing stage.
- a control portion of a next processing stage (which here comprises stages 38a and 38b, discussed later) examines the header of the packet. Since field 58a was deleted (or marked to be disregarded), the first field encountered is field 58b. In this particular packet, field 58b may instruct the second stage not to perform any processing on the data in the body of the packet, but instead simply delete field 58b from the packet header and pass the pixel packet to the next stage.
- a next field of the packet header may instruct the third stage 38c to perform 2D FFT operations on the image data found in the packet body, based on 16 x 16 blocks. It may further direct the stage to hand-off the processed FFT data to a wireless interface, for internet transmission to address 216.239.32.10, accompanied by specified data (detailing, e.g., the task to be performed on the received FFT data by the computer at that address, such as texture classification).
- stage 38c may instruct stage 38c to replace the body of the packet with the single 16 x 16 block of FFT data dispatched to the wireless interface.
- the stage also edits the packet header to delete (or mark) the instructions to which it responded, so that a header instruction field for the next processing stage is the first to be encountered.
- the addresses of the remote computers are not hard-coded.
- the packet may include a pointer to a database record or memory location (in the phone or in the cloud), which contains the destination address.
- stage 38c may be directed to hand-off the processed pixel packet to the Query Router and Response Manager (e.g., Fig. 7).
- This module examines the pixel packet to determine what type of processing is next required, and it routes it to an appropriate provider (which may be in the cell phone if resources permit, or in the cloud - among the stable of static providers, or to a provider identified through an auction).
- the provider returns the requested output data (e.g., texture classification information, and information about any matching FFT in the archive), and processing continues per the next item of instruction in the pixel packet header.
- each processing stage 38 strips-out, from the packet header, the instructions on which it acted.
- the instructions are ordered in the header in the sequence of processing stages, so this removal allows each stage to look to the first instructions remaining in the header for direction.
- Other arrangements can alternatively be employed. (For example, a module may insert new information into the header - at the front, tail, or elsewhere in the sequence - based on processing results. This amended header then controls packet flow and therefore processing.)
- each stage 38 may further have an output 31 providing data back to the control processor module 36.
- processing undertaken by one of the local stages 38 may indicate that the exposure or focus of the camera should be adjusted to optimize suitability of an upcoming frame of captured data for a particular type of processing (e.g., object identification).
- This focus/exposure information can be used as predictive setup data for the camera the next time a frame of the same or similar type is captured.
- the control processor module 36 can set up a frame request using a filtered or time-series prediction sequence of focus information from previous frames, or a sub-set of those frames.
- Error and status reporting functions may also be accomplished using outputs 31.
- Each stage may also have one or more other outputs 33 for providing data to other processes or modules - locally within the cell phone, or remote ("in the cloud").
- Data (in packet form, or in other format) may be directed to such outputs in accordance with instructions in packet 57, or otherwise.
- a processing module 38 may make a data flow selection based on some result of processing it performs. E.g., if an edge detection stage discerns a sharp contrast image, then an outgoing packet may be routed to an external service provider for FFT processing. That provider may return the resultant FFT data to other stages.
- Fig. 19 shows one arrangement. Instructions 58d originally in packet 57 specify a condition, and specify a location in a memory 79 from which replacement subsequent instructions (58e' - 58g') can be read, and substituted into the packet header, if the condition is met. If the condition is not met, execution proceeds in accordance with header instructions already in the packet.
- conditional instructions can be provided in the packet.
- a packet architecture is still used, but one or more of the header fields do not include explicit instructions. Rather, they simply point to memory locations from which corresponding instructions (or data) are retrieved, e.g., by the corresponding processing stage 38.
- Memory 79 (which can include a cloud component) can also facilitate adaptation of processing flow even if conditional branching is not employed.
- a processing stage may yield output data that determines parameters of a filter or other algorithm to be applied by a later stage (e.g., a convolution kernel, a time delay, a pixel mask, etc).
- processing stage 38 produces parameters that are stored in memory 79.
- a subsequent processing stage 38c later retrieves these parameters, and uses them in execution of its assigned operation.
- the information in memory can be labeled to identify the module/provider from which they originated, or to which they are destined ⁇ if known>, or other addressing arrangements can be used.
- each of the processing stages 38 comprises hardware circuitry dedicated to a particular task.
- the first stage 38 may be a dedicated edge-detection processor.
- the third stage 38c may be a dedicated FFT processor.
- Other stages may be dedicated to other processes. These may include DCT, wavelet, Haar, Hough, and Fourier-Mellin transform processors, filters of different sorts (e.g., Wiener, low pass, bandpass, highpass), and stages for performing all or part of operations such as facial recognition, optical character recognition, computation of eigenvalues, extraction of shape, color and texture feature data, barcode decoding, watermark decoding, object segmentation, pattern recognition, age and gender detection, emotion classification, orientation determination, compression, decompression, log-polar mapping, convolution, interpolation, decimation/down-sampling/anti-aliasing; correlation, performing square-root and squaring operations, array multiplication, perspective transformation, butterfly operations (combining results of smaller DFTs into a larger DFT, or decomposing a larger DCT into sub transforms), etc.
- filters of different sorts e.g., Wien
- each of the processing blocks in Fig. 16 may be dynamically reconfigurable, as circumstances warrant.
- a block may be configured as an FFT processing module.
- the next instant it may be configured as a filter stage, etc.
- the hardware processing chain may be configured as a barcode reader; the next it may be configured as a facial recognition system, etc.
- Such hardware reconfiguration information can be downloaded from the cloud, or from services such as the Apple AppStore. And the information needn't be statically resident on the phone once downloaded - it can be summoned from the cloud/ AppStore whenever needed.
- the hardware reconfiguration data can be downloaded to the cell phone each time it is turned on, or otherwise initialized - or whenever a particular function is initialized. Gone would be the dilemma of dozens of different versions of an application being deployed in the market at any given time - depending on when different users last downloaded updates, and the conundrums that companies confront in supporting disparate versions of products in the field. Each time a device or application is initialized, the latest version of all or selected functionalities is downloaded to the phone.
- the respective purpose processors may be chained in a fixed order.
- the edge detection processor may be first, the FFT processor may be third, and so on.
- the processing modules may be interconnected by one or more busses (and/or a crossbar arrangement or other interconnection architecture) that permit any stage to receive data from any stage, and to output data to any stage.
- Another interconnect method is a network on a chip (effectively a packet-based LAN; similar to crossbar in adaptability, but programmable by network protocols). Such arrangements can also support having one or more stages iteratively process data - taking output as input, to perform further processing.
- stages 38a/38b are shown by stages 38a/38b in Fig. 16. Output from stage 38a can be taken as input to stage 38b. Stage 38b can be instructed to do no processing on the data, but simply apply it again back to the input of stage 38a. This can loop as many times as desired. When iterative processing by stage 38a is completed, its output can be passed to a next stage 38c in the chain.
- stage 38b can perform its own type of processing on the data processed by stage 38a. Its output can be applied to the input of stage 38a. Stage 38a can be instructed to apply, again, its process to the data produced by stage 38b, or to pass it through. Any serial combination of stage 38a/38b processing can thus be achieved.
- stage 38a and 38b in the foregoing can also be reversed.
- stages 38a and 38b can be operated to (1) apply a stage 38a process one or more times to data; (2) apply a stage 38b process one or more times to data; (3) apply any combination and order of 38a and 38b processes to data; or (4) simply pass the input data to the next stage, without processing.
- the camera stage can be incorporated into an iterative processing loop.
- a packet may be passed from the camera to a processing module that assesses focus.
- a processing module that assesses focus.
- samples may include an FFT stage - looking for high frequency image components; an edge detector stage - looking for strong edges; etc.
- Sample edge detection algorithms include Canny, Sobel, and differential. Edge detection is also useful for object tracking.
- An output from such a processing module can loop back to the camera's controller module and vary a focus signal. The camera captures a subsequent frame with the varied focus signal, and the resulting image is again provided to the processing module that assesses focus. This loop continues until the processing module reports focus within a threshold range is achieved.
- the packet header, or a parameter in memory can specify an iteration limit, e.g., specifying that the iterating should terminate and output an error signal if no focus meeting the specified requirement is met within ten iterations.
- image or other data may be processed in two or more parallel paths.
- the output of stage 38d may be applied to two subsequent stages, each of which starts a respective branch of a fork in the processing.
- Those two chains can be processed independently thereafter, or data resulting from such processing can be combined - or used in conjunction - in a subsequent stage. (Each of those processing chains, in turn, can be forked, etc.)
- a fork commonly will appear much earlier in the chain. That is, in most implementations, a parallel processing chain will be employed to produce imagery for human - as opposed to machine - consumption.
- a parallel process may fork immediately following the camera sensor 12, as shown by juncture 17 in Fig. 13.
- the processing for the human visual system 13 includes operations such as noise reduction, white balance, and compression.
- Processing for object identification 14, in contrast, may include the operations detailed in this specification.
- the different modules may finish their processing at different times. They may output data as they finish - asynchronously, as the pipeline or other interconnection network permits. When the pipeline/network is free, a next module can transfer its completed results.
- Flow control may involve some arbitration, such as giving one path or data a higher priority. Packets may convey priority data - determining their precedence in case arbitration is needed. For example, many image processing operations/modules make use of Fourier domain data, such as produced by an FFT module. The output from an FFT module may thus be given a high priority, and precedence over others in arbitrating data traffic, so that the Fourier data that may be needed by other modules can be made available with a minimum of delay.
- some or all of the processing stages are not dedicated purpose processors, but rather are general purpose microprocessors programmed by software.
- the processors are hardware- reconfigurable.
- some or all may be field programmable gate arrays, such as Xilinx Virtex series devices.
- they may be digital signal processing cores, such as Texas Instruments TMS320 series devices.
- PicoChip devices such as the PC302 and PC312 multicore DSPs.
- Their programming model allows each core to be coded independently (e.g., in C), and then to communicate with others over an internal interconnect mesh.
- the associated tools particularly lend themselves to use of such processors in cellular equipment.
- a processor can include a region of configuration logic - mixed with dedicated logic. This allows configurable logic in a pipeline, with dedicated pipeline or bus interface circuitry.
- An implementation can also include one or more modules with a small CPU and RAM, with programmable code space for firmware, and workspace for processing - essentially a dedicated core.
- modules can perform fairly extensive computations - configurable as needed by the process that is using the hardware at the time. All such devices can be deployed in a bus, crossbar or other interconnection architecture that again permits any stage to receive data from, and output data to, any stage. (A FFT or other transform processor implemented in this fashion may be reconfigured dynamically to process blocks of 16x16, 64x64, 4096x4096, 1x64, 32x128, etc.)
- some processing modules are replicated - permitting parallel execution on parallel hardware. For example, several FFTs may be processing simultaneously.
- a packet conveys instructions that serve to reconfigure hardware of one or more of the processing modules. As a packet enters a module, the header causes the module to reconfigure the hardware before the image-related data is accepted for processing.
- the architecture is thus configured on the fly by packets (which may convey image related data, or not).
- the packets can similarly convey firmware to be loaded into a module having a CPU core, or into an application- or cloud-based layer; likewise with software instructions.
- the module configuration instructions may be received over a wireless or other external network; it needn't always be resident on the local system. If the user requests an operation for which local instructions are not available, the system can request the configuration data from a remote source. Instead of conveying the configuration data/instructions themselves, the packet may simply convey an index number, pointer, or other address information. This information can be used by the processing module to access a corresponding memory store from which the needed data/instructions can be retrieved. Like a cache, if the local memory store is not found to contain the needed data/instructions, they can then be requested from another source (e.g., across an external network). Such arrangements bring the dynamic of routability down to the hardware layer - configuring the module as data arrives at it.
- GPUs graphics processing units
- Many computer systems employ GPUs as auxiliary processors to handle operations such as graphics rendering.
- Cell phones increasingly include GPU chips to allow the phones to serve as gaming platforms; these can be employed to advantage in certain implementations of the present technology.
- a GPU can be used to perform bilinear and bicubic interpolation, projective transformations, filtering, etc.
- a GPU is used to correct for lens aberrations and other optical distortion.
- Cell phone cameras often display optical non-linearieties, such as barrel distortion, focus anomalies at the perimeter, etc. This is particularly a problem when decoding digital watermark information from captured imagery.
- the image With a GPU, the image can be treated as a texture map, and applied to a correction surface.
- texture mapping is used to put a picture of bricks or a stone wall onto a surface, e.g., of a dungeon.
- Texture memory data is referenced, and mapped onto a plane or polygon as it is drawn.
- a plane or polygon In the present context it is the image that is applied to a surface.
- the surface is shaped so that the image is drawn with an arbitrary, correcting transform.
- Steganographic calibration signals in a digitally watermarked image can be used to discern the distortion by which an image has been transformed.
- Each patch of a watermarked image can be characterized by affine transformation parameters, such as translation and scale.
- An error function for each location in the captured frame can thereby be derived. From this error information, a corresponding surface can be devised which - when the distorted image is projected onto it by the GPU, the surface causes the image to appear in its counter-distorted, original form.
- a lens can be characterized in this fashion with a reference watermark image.
- the associated correction surface Once the associated correction surface has been devised, it can be re-used with other imagery captured through that optical system (since the associated distortion is fixed). Other imagery can be projected onto this correction surface by the GPU to correct the lens distortion. (Different focal depths, and apertures, may require characterization of different correction functions, since the optical path through the lens may be different.)
- a new image When a new image is captured, it can be initially rectilinearized, to rid it of keystone/trapezoidal perspective effect. Once rectilinearized (e.g., re-squared relative to the camera lens), the local distortions can be corrected by mapping the rectilinearized image onto the correction surface, using the GPU.
- the correction model is in essence a polygon surface, where the tilts and elevations correspond to focus irregularities.
- Each region of the image has a local transform matrix allowing for correction of that piece of the image.
- the same arrangement can likewise be used to correct distortion of a lens in an image projection system.
- the image Before projection, the image is mapped - like a texture - onto a correction surface synthesized to counteract lens distortion.
- the lens distortion counteracts the correction surface distortion earlier applied, causing a corrected image to be projected from the system.
- the depth of field as one of the parameters that can be employed by camera 32 in gathering exposures. Although a lens can precisely focus at only one distance, the decrease in sharpness is gradual on either side of the focused distance. (The depth of field depends on the point spread function of the optics - including the lens focal length and aperture.) As long as the captured pixels yield information useful for the intended operation, they need not be in perfect focus.
- Sometime focusing algorithms hunt for, but fail to achieve focus - wasting cycles and battery life. Better, in some instances, is to simply grab frames at a series of different focus settings.
- a search tree of focus depths, or depths of field may be used. This is particularly useful where an image may include multiple subjects of potential interest - each at a different plane.
- the system may capture a frame focused at 6 inches and another at 24 inches.
- the different frames may reveal that there are two objects of interest within the field of view - one better captured in one frame, the other better captured in the other.
- the 24 inch- focused frame may be found to have no useful data, but the 6 inch- focused frame may include enough discriminatory frequency content to see that there are two or more subject image planes.
- one or more frames with other focus settings may then be captured.
- a region in the 24 inch-focused frame may have one set of Fourier attributes, and the same region in the 6 inch- focused frame may have a different set of Fourier attributes, and from the difference between the two frames a next trial focus setting may be identified (e.g., at 10 inches), and a further frame at that focus setting may be captured.
- Feedback is applied - not necessarily to obtain perfect focus lock, but in accordance with search criteria to make decisions about further captures that may reveal additional useful detail.
- the search may fork and branch, depending on the number of subjects discerned, and associated Fourier, etc., information, until satisfactory information about all subjects has been fathered.
- a related approach is to capture and buffer plural frames as a camera lens system is undergoing adjustment to an intended focus setting. Analysis of the frame finally captured at the intended focus may suggest that intermediate focus frames would reveal useful information, e.g., about subjects not earlier apparent or significant. One or more of the frames earlier captured and buffered can then be recalled and processed to provide information whose significance was not earlier recognized.
- Camera control can also be responsive to spatial coordinate information.
- the camera set-up module may request images of not just certain exposure parameters, but also of certain subjects, or locations.
- a camera is in the correct position to capture a specific subject (which may have been previously user-specified, or identified by a computer process)
- one or more frames of image data automatically can be captured.
- the orientation of the camera is controlled by stepper motors or other electromechanical arrangements, so that the camera can autonomously set the azimuth and elevation to capture image data from a particular direction, to capture a desired subject. Electronic or fluid steering of the lens direction can also be utilized.
- the camera setup module may instruct the camera to capture a sequence of frames.
- such frames can also be aligned and combined to obtain super-resolution images.
- super-resolution can be achieved by diverse methods. For example, the frequency content of the images can be analyzed, related to each other by linear transform, affine-transformed to correct alignment, then overlaid and combined. In addition to other applications, this can be used in decoding digital watermark data from imagery. If the subject is too far from the camera to obtain satisfactory image resolution normally, it may be doubled by such super-resolution techniques to obtain the higher resolution needed for successful watermark decoding.
- each processing stage substituted the results of its processing for the input data contained in the packet when received.
- the processed data can be added to the packet body, while maintaining the data originally present. In such case the packet grows during processing - as more information is added. While this may be disadvantageous in some contexts, it can also provide advantages. For example, it may obviate the need to fork a processing chain into two packets or two threads.
- an FFT stage may add frequency domain information to a packet containing original pixel domain imagery. Both of these may be used by a subsequent stage, e.g., in performing sub-pixel alignment for super-resolution processing.
- a focus metric may be extracted from imagery and used - with accompanying image data - by a subsequent stage. It will be recognized that the detailed arrangements can be used to control the camera to generate different types of image data on a per-frame basis, and to control subsequent stages of the system to process each such frame differently.
- the system may capture a first frame under conditions selected to optimize green watermark detection, capture a second frame under conditions selected to optimize barcode reading, capture a third frame under conditions selected to optimize facial recognition, etc. Subsequent stages may be directed to process each of these frames differently, in order to best extract the data sought. All of the frames may be processed to sense illumination variations. Every other frame may be processed to assess focus, e.g., by computing 16 x 16 pixel FFTs at nine different locations within the image frame. (Or there may be a fork that allows all frames to be assessed for focus, and the focus branch may be disabled when not needed, or reconfigured to serve another purpose.) Etc., etc.
- frame capture can be tuned to capture the steganographic calibration signals present in a digital watermark signal, without regard to successful decoding of the watermark payload data itself.
- captured image data can be at a lower resolution - sufficient to discern the calibration signals, but insufficient to discern the payload.
- the camera can expose the image without regard to human perception, e.g., overexposing so image highlights are washed-out, or underexposed so other parts of the image are indistinguishable. Yet such an exposure may be adequate to capture the watermark orientation signal. (Feedback can of course be employed to capture one or more subsequent image frames - redressing one or more shortcomings of a previous image frame.)
- Some digital watermarks are embedded in specific color channels (e.g., blue), rather than across colors as modulation of image luminance (see, e.g., commonly-owned patent application 12/337,029 to Reed).
- exposure can be selected to yield maximum dynamic range in the blue channel (e.g., 0-255 in an 8-bit sensor), without regard to exposure of other colors in the image.
- One frame may be captured to maximize dynamic range of one color, such as blue, and a later frame may be captured to maximize dynamic range of another color channel, such as yellow (i.e., along the red-green axis). These frames may then be aligned, and the blue- yellow difference determined.
- the frames may have wholly different exposure times, depending on lighting, subject, etc.
- the system has an operational mode in which it captures and processes imagery even when the user is not intending to "snap" a picture. If the user pushes a shutter button, the otherwise-scheduled image capture/processing operations may be suspended, and a consumer photo taking mode can take precedence. In this mode, capture parameters and processes designed to enhance human visual system aspects of the image can be employed instead.
- Fig. 1OA packets may be established prior to the capture of image data by the camera, in which case the visual keyvector processing and packaging module serves to insert the pixel data - or more typically, sub-sets or super-sets of the pixel data - into earlier-formed packets. Similarly, in Fig. 16, the packets need not be created until after the camera has captured image data.
- one or more of the processing stages can be remote from the cell phone.
- One or more pixel packets can be routed to the cloud (or through the cloud) for processing.
- the results can be returned to the cell phone, or forwarded to another cloud processing stage (or both).
- Once back at the cell phone one or more further local operations may be performed. Data may then be sent back out the cloud, etc. Processing can thus alternate between the cell phone and the cloud.
- result data is usually presented to the user back at the cell phone.
- different vendors will offer competing cloud services for specialized processing tasks. For example, Apple, Google and Facebook, may each offer cloud-based facial recognition services.
- a user device would transmit a packet of processed data for processing.
- the header of the packet can indicate the user, the requested service, and - optionally - micropayment instructions.
- the header could convey an index or other identifier by which a desired transaction is looked-up in a cloud database, or which serves to arrange an operation, or a sequence of processes for some transaction - such as a purchase, a posting on
- a server may examine the incoming packet, look-up the user's iPhoto account, access facial recognition data for the user's friends from that account, compute facial recognition features from image data conveyed with the packet, determine a best match, and return result information (e.g., a name of a depicted individual) back to the originating device.
- a server may undertake similar operations, but would refer to the user's Picasa account. Ditto for Facebook. Identifying a face from among faces for dozens or hundreds of known friends is easier than identifying faces of strangers.
- Other vendors may offer services of the latter sort. For example, L-I Identity Solutions, Inc. maintains databases of images from government-issued credentials - such as drivers' licenses. With appropriate permissions, it may offer facial recognition services drawing from such databases.
- a service may support one, a few, or dozens of different types of barcode.
- the decoded data may be returned to the phone, or the service provider can access further data indexed by the decoded data, such as product information, instructions, purchase options, etc., and return such further data to the phone. (Or both can be provided.)
- Another service is digital watermark reading.
- OCR service provider may further offer translation services, e.g., converting processed image data into ASCII symbols, and then submitting the ASCII words to a translation engine to render them in a different language.
- Other services are sampled in Fig. 2. (Practicality prevents enumeration of the myriad other services, and component operations, that may also be provided.)
- the output from the remote service provider is commonly returned to the cell phone.
- the remote service provider will return processed image data. In some cases it may return ASCII or other such data.
- the remote service provider may produce other forms of output, including audio (e.g., MP3) and/or video (e.g., MPEG4 and Adobe Flash). Video returned to the cell phone from the remote provider may be presented on the cell phone display.
- such video presents a user interface screen, inviting the user to touch or gesture within the displayed presentation to select information or an operation, or issue an instruction.
- Software in the cell phone can receive such user input and undertake responsive operations, or present responsive information.
- the data provided back to the cell phone from the remote service provider can include JavaScript or other such instructions.
- the JavaScript When run by the cell phone, the JavaScript provides a response associated with the processed data referred out to the remote provider.
- Remote processing services can be provided under a variety of different financial models.
- An Apple iPhone service plan may be bundled with a variety of remote services at no additional cost, e.g., iPhoto-based facial recognition.
- Other services may bill on a per-use, monthly subscription, or other usage plans. Some services will doubtless be highly branded, and marketed. Others may compete on quality; others on price.
- stored data may indicate preferred providers for different services. These may be explicitly identified (e.g., send all FFT operations to the Fraunhofer Institute service), or they can be specified by other attributes.
- a cell phone user may direct that all remote service requests are to be routed to providers that are ranked as fastest in a periodically updated survey of providers (e.g., by Consumers Union). The cell phone can periodically check the published results for this information, or it can be checked dynamically when a service is requested.
- Another user may specify that service requests are to be routed to service providers that have highest customer satisfaction scores - again by reference to an online rating resource. Still another user may specify that requests should be routed to the providers having highest customer satisfaction scores - but only if the service is provided for free; else route to the lowest cost provider.
- the user may, in a particular case, specify a particular service provider - trumping any selection that would be made by the stored profile data.
- the user's request for service can be externally posted, and several service providers may express interest in performing the requested operation.
- the request can be sent to several specific service providers for proposals (e.g., to Amazon, Google and Microsoft).
- the different providers' responses may be presented to the user, who selects between them, or a selection may be made automatically - based on previously stored rules.
- one or more competing service providers can be provided user data with which they start performing, or wholly perform, the subject operation before a service provider selection is finally made - giving such providers a chance to speed their response times, and encounter additional real- world data. (See, also, the earlier discussion of remote service providers, including auction-based services, e.g., in connection with Figs. 7-12.)
- certain external service requests may pass through a common hub (module), which is responsible for distributing the requests to appropriate service providers.
- results from certain external service requests may similarly be routed through a common hub.
- payloads decoded by different service providers from different digital watermarks or payloads decoded from different barcodes, or fingerprints computed from different content objects
- a common hub which may compile statistics and aggregate information (akin to Nielsen's monitoring services - surveying consumer encounters with different data).
- the hub may also (or alternatively) be provided with a quality or confidence metric associated with each decoding/computing operation. This may help reveal packaging issues, print issues, media corruption issues, etc., that need consideration.
- a pipe manager 51 which may be realized as the cell phone-side portion of the query router and response manager of Fig. 7) performs a variety of functions relating to communicating across a data pipe 52. (It will be recognized that pipe 52 is a data construct that may comprise a variety of communication channels.)
- pipe manager 51 One function performed by pipe manager 51 is to negotiate for needed communication resources.
- the cell phone can employ a variety of communication networks and commercial data carriers, e.g., cellular data, WiFi, Bluetooth, etc. - any or all of which may be utilized. Each may have its own protocol stack.
- the pipe manager 51 interacts with respective interfaces for these data channels - determining the availability of bandwidth for different data payloads.
- the pipe manager may alert the cellular data carrier local interface and network that there will be a payload ready for transmission starting in about 450 milliseconds. It may further specify the size of the payload (e.g., two megabits), its character (e.g., block data), and a needed quality of service (e.g., data throughput rate). It may also specify a priority level for the transmission, so that the interface and network can service such transmission ahead of lower-priority data exchanges, in the event of a conflict.
- the size of the payload e.g., two megabits
- its character e.g., block data
- a needed quality of service e.g., data throughput rate
- the pipe manager knows the expected size of the payload due to information provided by the control processor module 36.
- the control processor module specifies the particular processing that will yield the payload, and so it can estimate the size of the resulting data).
- the control processor module can also predict the character of the data, e.g., whether it will be available as a fixed block or intermittently in bursts, the rate at which it will be provided for transmission, etc.
- the control processor module 36 can also predict the time at which the data will be ready for transmission.
- the priority information is known by the control processor module. In some instances the control processor module autonomously sets the priority level. In other instances the priority level is dictated by the user, or by the particular application being serviced.
- the user may expressly signal - through the cell phone's graphical user interface, or a particular application may regularly require, that an image-based action is to be processed immediately. This may be the case, for example, where further action from the user is expected based on the results of the image processing. In other cases the user may expressly signal, or a particular application may normally permit, that an image-based action can be performed whenever convenient (e.g., when needed resources have low or nil utilization). This may be the case, for example, if a user is posting a snapshot to a social networking site such as Facebook, and would like the image annotated with names of depicted individuals - through facial recognition processing. Intermediate prioritization (expressed by the user, or by the application) can also be employed, e.g., process within a minute, ten minutes, an hour, a day, etc.
- control processor module 36 informs the pipe manager of the expected data size, character, timing, and priority, so that the pipe manager can use same in negotiating for the desired service. (In other embodiments, less or more information can be provided.)
- the pipe manager may establish a secure socket connection with a particular computer in the cloud that is to receive that particular data payload, and identify the user. If the cloud computer is to perform a facial recognition operation, it may prepare for the operation by retrieving from Apple/Google/Facebook the facial recognition features, and associated names, for friends of the specified user.
- the pipe manager enables pre- warming of the remote computer, to ready it for the expected service request. (The service may request may not follow.) In some instances the user may operate the shutter button, and the cell phone may not know what operation will follow.
- the pipe manager - or control processor module - may pre-warm several processes. Or it may predict, based on past experience, what operation will be undertaken, and warm appropriate resources. (E.g., if the user performed facial recognition operations following the last three shutter operations, there's a good chance the user will request facial recognition again.)
- the cell phone may actually start performing component operations for various of the possible functions before any has been selected - particularly those operations whose results may be useful to several of the functions. Pre-warming can also include resources within the cell phone: configuring processors, loading caches, etc.
- control processor module 36 may change the schedule of image processing, buffer results, or take other responsive action.
- the carrier or interface may respond to the pipe manager that the requested transmission cannot be accommodated, e.g., at the requested time or with the requested quality of service.
- the pipe manager may report same to the control processor module 36.
- the control processor module may abort the process that was to result in the two megabit data service requirement and reschedule it for later.
- the control processor module may decide that the two megabit payload may be generated as originally scheduled, and the results may be locally buffered for transmission when the carrier and interface are able to do so. Or other action may be taken.
- control processor module causes the system to process frames of image data, and is identifying apparent faces in the field of view (e.g., oval shapes, with two seeming eyes in expected positions). These may be highlighted by rectangles on the cell phone's viewfinder (screen) display.
- imaging devices may additionally (or alternatively) have different image-processing modes.
- One mode may be selected by the user to obtain names of people depicted in a photo (e.g., through facial recognition).
- Another mode may be selected to perform optical character recognition of text found in an image frame.
- Another may trigger operations relating to purchasing a depicted item. Ditto for selling a depicted item. Ditto for obtaining information about a depicted object, scene or person (e.g., from
- These modes may be selected by the user in advance of operating a shutter control, or after.
- plural shutter controls (physical or GUI) are provided for the user - respectively invoking different of the available operations.
- the device infers what operation(s) is/are possibly desired, rather than having the user expressly indicate same.
- the pipeline manager 51 may report back to the control processor module (or to application software) that the requested service cannot be provided. Due to a bottleneck or other constraint, the manager 51 may report that identification of only three of the depicted faces can be accommodated within service quality parameters considered to constitute an "immediate" basis. Another three faces may be recognized within two seconds, and recognition of the full set of faces may be expected in five seconds. (This may be due to a constraint by the remote service provider, rather than the carrier, per se.)
- the control processor module 36 may respond to this report in accordance with an algorithm, or by reference to a rule set stored in a local or remote data structure.
- the algorithm or rule set may conclude that for facial recognition operations, delayed service should be accepted on whatever terms are available, and the user should be alerted (through the device GUI) that there will be a delay of about N seconds before full results are available.
- the reported cause of the expected delay may also be exposed to the user.
- Other service exceptions may be handled differently - in some cases with the operation aborted or rescheduled or routed to a less-preferred provider, and/or with the user not alerted.
- the pipeline manager may also query resources out in the cloud - to ensure that they are able to perform whatever services are requested (within specified parameters).
- These cloud resources can include, e.g., data networks and remote computers. If any responds in the negative, or with a service level qualification, this too can be reported back to the control processor module 36, so that appropriate action can be taken.
- the control process 36 may issue corresponding instructions to the pipe manager and/or other modules, as necessary.
- the pipe manager can also act as a flow control manager - orchestrating the transfer of data from the different modules out of the cell phone, resolving conflicts, and reporting errors back to the control processor module 36.
- robotic toolkit a set of free software tools for robot and sensor applications, available as open source from sourceforge-dot-net.
- FIG. 19A An illustration of the Player Project architecture is shown in Fig. 19A.
- the mobile robot which typically has a relatively low performance processor
- communicates with a fixed server with a relatively higher performance processor
- Various sensor peripherals are coupled to the mobile robot (client) processor through respective drivers, and an API.
- services may be invoked by the server processor from software libraries, through another API.
- the CMU CMVision library is shown in Fig. 19A.
- the Player Project includes "Stage” software that simulates a population of mobile robots moving in a 2D environment, with various sensors and processing - including visual blob detection. "Gazebo” extends the Stage model to 3D.)
- Fig. 2OA discussed below, also provides a layer of abstraction between the sensors, the locally- available operations, and the externally-available operations.
- Certain embodiments of the present technology can be implemented using a local process & remote process paradigm akin to that of the Player Project, connected by a packet network and inter-process & intra- process communication constructs familiar to artisans (e.g., named pipes, sockets, etc.).
- a protocol by which different processes may communicate this may take the form of a message passing paradigm and message queue, or more of a network centric approach where collisions of keyvectors are addressed after the fact (re-transmission, drop if timely in nature, etc.).
- data from sensors on the mobile device e.g., microphone, camera
- the instruction(s) associated with data may not be express; they can be implicit (such as Bayer conversion) or session specific - based on context or user desires (in a photo taking mode, face detection may be presumed.)
- keyvectors from each sensor are created and packaged by device driver software processes that abstract the hardware specific embodiments of the sensor and provide a fully formed keyvector adhering to a selected protocol.
- the device driver software can then place the formed keyvector on an output queue unique to that sensor, or in a common message queue shared by all the sensors. Regardless of approach, local processes can consume the keyvectors and perform the needed operations before placing the resultant keyvectors back on the queue. Those keyvectors that are to be processed by remote services are then placed in packets and transmitted directly to a remote processes for additional processing or to a remote service that distributes the keyvectors - similar to a router. It should be clear to the reader, that commands to initialize or setup any of the sensors or processes in the system can be distributed in a similar fashion from a Control Process (e.g., box 36 in fig. 16.)
- a Control Process e.g., box 36 in fig. 16.
- branch prediction arose to meet the needs of increasingly complex processor hardware; it allows processors with lengthy pipelines to fetch data and instructions (and in some cases, execute the instructions), without waiting for conditional branches to be resolved.
- a similar science can be applied in the present context - predicting what action a human user will take. For example, as discussed above, the just-detailed system may "pre-warm" certain processors, or communication channels, in anticipation that certain data or processing operations will be forthcoming.
- Recent releases are good prospects (except those rated G, or rated high for violence - stored profile data indicates the user just doesn't have a history of watching those). So are movies that she's watched in the past (as indicated by historical rental records - also available to the phone).
- Google Streetview can suggest she's looking at business signage along 5 th Avenue.
- feature recognition reference data for this geography should be downloaded into the cache for rapid matching against to-be-acquired image data.
- the cache should be loaded in a rational fashion - so that the most likely object is considered first.
- Google Streetview for that location includes metadata indicating 5 th Avenue has signs for a Starbucks, a Nordstrom store, and a Thai restaurant.
- Stored profile data for the user reveals she visits Starbucks daily (she has their branded loyalty card); she is a frequent clothes shopper (albeit with a Macy's, rather than a Nordstrom's charge card); and she's never eaten at a Thai restaurant. Perhaps the cache should be loaded so as to most quickly identify the Starbucks sign, followed by Nordstrom, followed by the Thai restaurant.
- Low resolution imagery captured for presentation on the viewfinder fails to trigger the camera's feature highlighting probable faces (e.g., for exposure optimization purposes). That helps. There's no need to pre- warm the complex processing associated with facial recognition. She touches the virtual shutter button, capturing a frame of high resolution imagery, and image analysis gets underway - trying to recognize what's in the field of view, so that the camera application can overlay graphical links related to objects in the captured frame. (Or this may happen without user action - the camera may be watching proactively.)
- visual "baubles" (Fig. 0) are overlaid on the captured imagery. Tapping on any of the baubles pulls up a screen of information, such as a ranked list of links.
- a screen of information such as a ranked list of links.
- Google web search which ranks search results in an order based on aggregate user data
- the camera application attempts a ranking customized to the user's profile. If a Starbucks sign or logo is found in the frame, the Starbucks link gets top position for this user.
- the cell phone application may have a capitalistic bent and be willing to promote a link by a position or two (although perhaps not to the top position) if circumstances warrant.
- the cell phone routinely sent IP packets to the web servers at addresses associated with each of the links, alerting them that an iPhone user had recognized their corporate signage from a particular latitude/longitude.
- the Thai restaurant server responds back in an instant - offering to the next two customers 25% off any one item (the restaurant's point of sale system indicates only four tables are occupied and no order is pending; the cook is idle).
- the restaurant server offers three cents if the phone will present the discount offer to the user in its presentation of search results, or five cents if it will also promote the link to second place in the ranked list, or ten cents if it will do that and be the only discount offer presented in the results list. (Starbucks also responded with an incentive, but not as attractive).
- the cell phone quickly accepts the restaurant's offer, and payments are quickly made - either to the user (e.g., defraying the monthly phone bill) or more likely to the phone carrier (e.g., AT&T). Links are presented to Starbucks, the Thai restaurant, and Nordstrom, in that order, with the restaurant's link noting the discount for the next two customers.
- Google's AdWord technology has already been noted. It decides, based on factors including anauction determined payment, which ads to present as Sponsored Links adjacent the results of a Google web search. Google has adapted this technology to present ads on third party web sites and blogs, based on the particular contents of those sites, terming the service AdSense. In accordance with another aspect of the present technology, the AdWord/ AdSense technology is extended to visual image search on cell phones.
- the cell phone again pings the servers of companies for whom links will be presented - helping them track their physical world-based online visibility.
- the pings can include the location of the user, and an identification of the object that prompted the ping.
- alldiscountbooks-dot-net may check inventory and find it has a significant overstock of Snowball. As in the example earlier given, it may offer an extra payment for some extra promotion (e.g., including "We have 732 copies - cheap! in the presented link).
- a company may also offer additional bandwidth to serve information to a customer.
- a user may capture video imagery from an electronic billboard, and want to download a copy to show to friends.
- the user's cell phone identifies the content as a popular clip of user generated content (e.g., by reference to an encoded watermark), and finds that the clip is available from several sites - the most popular of which is YouTube, followed by MySpace.
- MySpace may offer to upgrade the user's baseline wireless service from 3 megabits per second to 10 megabits per second, so the video will download in a third of the time. This upgraded service can be only for the video download, or it can be longer.
- the link presented on the screen of the user's cell phone can be amended to highlight the availability of the faster service. (Again, MySpace may make an associated payment.)
- the quality of service (e.g., bandwidth) is managed by pipe manager 51.
- Instructions from MySpace may request that the pipe manager start requesting augmented service quality, and setting up the expected high bandwidth session, even before the user selects the MySpace link.
- vendors may negotiate preferential bandwidth for its content.
- MySpace may make a deal with AT&T, for example, that all MySpace content delivered to AT&T phone subscribers be delivered at 10 megabits per second - even though most subscribers normally only receive 3 megabits per second service. The higher quality service may be highlighted to the user in the presented links.
- the information presented by a mobile phone in response to visual stimuli is a function both of (1) the user's preferences, and (2) third party competition for that user's attention, probably based on the user's demographic profile.
- Demographically identical users, but with different tastes in food, will likely be presented with different baubles, or associated information, when viewing a street crowded with restaurants.
- Users with identical tastes in food and other preference information - but differing in a demographic factor e.g., age, gender
- simulation models of human computer interaction with the physical world can be based on tools and techniques from fields as disperse as robotics, and audience measurement.
- An example of this might be the number of expected mobile devices in a museum at a particular time; the particular sensors that such devices are likely to be using; and what stimuli are expected to be captured by those sensors (e.g., where are they pointing the camera, what is the microphone hearing, etc.). Additional information can include assumptions about social relationships between users: Are they likely to share common interests?
- modeling can be based on generalized heuristics derived from observations at past events (e.g., how many people used their cell phone cameras to capture imagery from the Portland Trailblazers' Scoreboard during a basketball game, etc.), to more evolved predictive models that are based on innate human behavior (e.g., people are more likely to capture imagery from a Scoreboard during overtime than during a game's half-time).
- Such models can inform many aspects of the experience for the users, in addition to the business entities involved in provisioning and measuring the experience.
- Metrics for more static environments may consist of Revenue Per Unit (RPU) created by digital traffic created on the digital service provider network (how much bandwidth is being consumed) to more evolved models of Click Through Rates (CTR) for particular sensor stimuli.
- RPU Revenue Per Unit
- the Mona-Lisa painting in the Louvre is likely to have a much higher CTR than other paintings in the museum, informing matters such as priority for content provisioning, e.g., content related to the Mona Lisa should be cached and be as close to the edge of the cloud as possible, if not pre-loaded onto the mobile device itself when the user approaches or enters the museum. (Of equal importance is the role that CTR plays in monetizing the experience and environment.)
- the museum may provide content related to Rodin and his works on servers or infrastructure (e.g., router caches) that serve the garden. Moreover, because the visitors comprise a pre-established social group, the museum may expect some social connectivity. So the museum may enable sharing capabilities (e.g., ad hoc networking) that might not otherwise be used. If one student queries the museum's online content to learn more about a particular Rodin sculpture, the system may accompany delivery of the solicited information with a prompt inviting the student to share this information with others in the group.
- sharing capabilities e.g., ad hoc networking
- the museum server can suggest particular "friends" of the student with whom such information might be shared - if such information is publicly accessible from Facebook or other social networking data source.
- a social networking data source can also provide device identifiers, IP addresses, profile information, etc., for the student's friends - which may be leveraged to assist the dissemination of educational material to others in the group.
- These other students may find this particular information relevant, since it was of interest to another in their group - even if the original student's name is not identified. If the original student is identified with the conveyed information, then this may heighten the information's interest to others in the group.
- Detection of a socially- linked group may be inferred from review of the museum's network traffic. For example, if a device sends packets of data to another, and the museum's network handles both ends of the communication - dispatch and delivery, then there's an association between two devices in the museum. If the devices are not ones that have historical patterns of network usage, e.g., employees, then the system can conclude that two visitors to the museum are socially connected. If a web of such communications is detected - involving several unfamiliar devices, then a social group of visitors can be discerned. The size of the group can be gauged by the number of different participants in such network traffic.
- Demographic information about the group can be inferred from external addresses with which data is exchanged; middle schoolers may have a high incidence of MySpace traffic; college students may communicate with external addresses at a university domain; senior citizens may demonstrate a different traffic profile. All such information can be employed in automatically adapting the information and services provided to the visitors - as well as providing useful information to the museum's administration and marketing departments.
- Examples include rights clearance for associated content, rendering virtual worlds and other synthesized content, throttling down routine time-insensitive network traffic, queuing commercial resources that may be invoked as people purchase souvenir books/music from Amazon (caching pages, authenticating users to financial sites), propagating links for post-game interviews (some prebuilt/edited and ready to go), caching the Twitter feeds of the star players, buffering video from city center showing the hometown crowds watching on a Jumbotron display - erupting with joy at the buzzer, etc.; anything relating to the experience or follow-on actions should prepped/cached in advance, where possible.
- VAST Digital Video Ad Serving Template
- VAST Digital Video Ad Serving Template
- VAST helps standardize the service of video ads to independent web sites (replacing old-style banner ads), commonly based on a bit of Javascript included in the web page code - code that also aids in tracking traffic and managing cookies. VAST can also insert promotional messages in the pre-roll and post-roll viewing of other video content delivered by the web site.
- the web site owner doesn't concern itself with selling or running the advertisements, yet at the end of the month the web site owner receives payment based on audience viewership/impressions.
- physical stimuli presented to users in the real world, sensed by mobile technology can be the basis for payments to the parties involved.
- Dynamic environments in which stimulus presented to users and their mobile devices can be controlled provide new opportunities for measurement and utilization of metrics such as CTR.
- Background music, content on digital displays, illumination, etc. can be modified to maximize CTR and shape traffic.
- illumination on particular signage can be increased, or flash, as a targeted individual passes by.
- digital signage, music, etc. can all be modified overtly (change in the advertising to the interests of the expected audience) or covertly (changing the linked experience to take the user to the Japanese language website), to maximize the CTR.
- Mechanisms may be introduced as well to contend with rogue or un-approved sensor stimuli.
- stimuli posts, music, digital signage, etc.
- This can be accomplished through the use of simple blocking mechanisms that are geography-specific (not dissimilar to region coding on DVD's), indicating that all attempts within specific GPS coordinates to route a keyvector to a specific place in the cloud must be mediated by a routing service or gateway managed by the domain owner.
- content rules such as the Movielabs Content Recognition Rules related to conflicting media content (c.f., www.movielabs-dot-com/CRR), parental controls provided by carriers to the device, or by adhering to DMCA Automatic Take Down Notices.
- licenses play a key role in determining how content can be consumed, shared, modified etc.
- a result of extracting semantic meaning from stimulus presented to the user (and the user's mobile device), and/or the location in which stimulus is presented, can be issuance of a license to desired content or experiences (games, etc.) by third parties.
- a license to desired content or experiences games, etc.
- passengers disembarking from an international flight may be granted location-based or time-limited licenses to translation services or navigation services (e.g., an augmented reality system overlaying directions for baggage claim, bathrooms, etc., on camera-captured scenes) for their mobile devices, while they transit through customs, are in the airport, for 90 minutes after their arrival, etc.
- translation services or navigation services e.g., an augmented reality system overlaying directions for baggage claim, bathrooms, etc., on camera-captured scenes
- Such arrangements can serve as metaphors for experience, and as filtering mechanisms.
- One embodiment in which sharing of experiences are triggered by sensor stimuli is through broadcast social networks (e.g., Twitter) and syndication protocols (e.g., RSS web feeds/channels).
- Other users, entities or devices can subscribe to such broadcasts/feeds as the basis for subsequent communication (social, information retrieval, etc.), as logging of activities (e.g., a person's daily journal), or measurement (audience, etc.).
- Traffic associated with such networks/feeds can also be measured by devices at a particular location - allowing users to traverse in time to understand who was communicating what at a particular point in time. This enables searching for and mining additional information, e.g., was my friend here last week?
- Such traffic also enables real-time monitoring of how users share experiences. Monitoring "tweets" about a performer's song selection during a concert may cause the performer to alter the songs to be played for the remainder of a concert. The same is true for brand management. For example, if users share their opinions about a car during a car show, live keyword filtering on the traffic can allow the brand owner to re-position certain products for maximum effect (e.g., the new model of Corvette should spend more time on the spinning platform, etc.).
- Predicting the user's action or intent is one form of optimization. Another form involves configuring the processing so as to improve performance.
- One step in the process is to determine which operations need to occur. This determination can be based on express requests from the user, historical patterns of usage, context and status, etc. Many operations are high level functions, which involve a number of component operations - performed in a particular order. For example, optical character recognition may require edge detection, followed by region-of-interest segmentation, followed by template pattern matching. Facial recognition may involve skintone detection, Hough transforms (to identify oval-shaped areas), identification of feature locations
- the system can identify the component operations that may need to be performed, and the order in which their respective results are required. Rules and heuristics can be applied to help determine whether these operations should be performed locally or remotely.
- the rules may specify that simple operations, such as color histograms and thresholding, should generally be performed locally.
- complex operations may usually default to outside providers. Scheduling can be determined based on which operations are preconditions to other operations. This can also influence whether an operation is performed locally or remotely (local performance may provide quicker results - allowing subsequent operations to be started with less delay).
- the rules may seek to identify the operation whose output(s) is used by the greatest number of subsequent operations, and perform this operation first (its respective precedent(s) permitting). Operations that are preconditions to successively fewer other operations are performed successively later.
- the operations, and their sequence may be conceived as a tree structure - with the most globally important performed first, and operations of lesser relevance to other operations performed later.
- Such determinations may also be tempered (or dominated) by other factors.
- the limited processing capability of the cell phone may mean that processing locally is slower than processing remotely (e.g., where a more robust, parallel, architecture might be available to perform the operation).
- the delays of establishing communication with a remote server, and establishing a session may make local performance of an operation quicker.
- the speed with which results are returned may be important, or not.
- Still another factor is user preferences.
- the user may set parameters influencing where, and when, operations are performed. For example, a user may specify that an operation may be referred to remote processing by a domestic service provider, but if none is available, then the operation should be performed locally.
- Routing constraints are another factor.
- the cell phone will be in a WiFi or other service area (e.g., in a concert arena) in which the local network provider places limits or conditions on remote service requests that may be accessed through that network.
- the local network may be configured to block access to external image processing service providers for the duration of the concert. In this case, services normally routed for external execution should be performed locally.
- a related factor is current hardware utilization. Even if a cell phone is equipped with hardware that is well configured for a certain task, it may be so busy and backlogged that the system may refer a next task of this sort to an external resource for completion. Another factor may be the length of the local processing chain, and the risk of a stall. Pipelined processing architectures may become stalled for intervals as they wait for data needed to complete an operation. Such a stall can cause all other subsequent operations to be similarly delayed.
- the risk of a possible stall can be assessed (e.g., by historical patterns, or knowledge that completion of an operation requires further data whose timely availability is not assured - such as a result from another external process) and, if the risk is great enough, the operation may be referred for external processing to avoid stalling the local processing chain.
- Geographical considerations of different sorts can also be factors.
- One is network proximity to the service provider.
- Another is whether the cell phone has unlimited access to the network (as in a home region), or a pay-per-use arrangement (as when roaming in another country).
- Information about the remote service provider(s) can also be factored. Is the service provider offering immediate turnaround, or are requested operations placed in a long queue, behind other users awaiting service? Once the provider is ready to process the task, what speed of execution is expected? Costs may also be key factors, together with other attributes of importance to the user (e.g., whether the service provider meets "green" standards of environmental responsibility). A great many other factors can also be considered, as may be appropriate in particular contexts. Sources for such data can include the various elements shown in the illustrative block diagrams, as well as external resources.
- FIG. 19B A conceptual illustration of the foregoing is provided in Fig. 19B. Based on the various factors, a determination can be made as to whether an operation should be performed locally, or remotely. (The same factors may be assessed in determining the order in which operations should be performed.)
- the different factors can be quantified by scores, which can be combined in polynomial fashion to yield an overall score, indicating how an operation should be handled.
- Such an overall score serves as a metric indicating the relative suitability of the operation for remote or external processing. (A similar scoring approach can be employed to choose between different service providers.)
- a given operation may be performed locally at one instant, and performed remotely at a later instant (or vice versa). Or, the same operation may be performed on two sets of keyvector data at the same time - one locally, and one remotely. While described in the context of determining whether an operation should be performed locally or remotely, the same factors can influence other matters as well. For example, they can also be used in deciding what information is conveyed by keyvectors.
- unprocessed pixel data from a captured image may be sent to a remote service provider to make this determination.
- the cell phone may perform initial processing, such as edge detection, and then package the edge-detected data in keyvector form, and route to an external provider to complete the OCR operation.
- the cell phone may perform all of the component OCR operations up until the last (template matching), and send out data only for this last operation.
- the OCR operation may be completed wholly by the cell phone, or different components of operation can be performed alternately by the cell phone and remote service provider(s), etc.
- the Pepsi Center may provide wireless communication services to patrons, through its own WiFi or other network. Naturally, the Pepsi Center is reluctant for its network resources to be used for the benefit of competitors, such as Coca Cola.
- the host network may thus influence cloud services that can be utilized by its patrons (e.g., by making some inaccessible, or by giving lower priority to data traffic of certain types, or with certain destinations).
- the domain owner may exert control over what operations a mobile device is capable of performing.
- This control can influence the local/remote decision, as well as the type of data conveyed in keyvector packets.
- a gym which may want to impede usage of cell phone cameras, e.g., by interfering with access to remote service providers for imagery, as well as photo sharing sites such as Flickr and Picasa.
- Still another example is a school which, for privacy reasons, may want to discourage facial recognition of its students and staff. In such case, access to facial recognition service providers can be blocked, or granted only on a moderated case-by-case basis.
- Venues may find it difficult to stop individuals from using cell phone cameras - or using them for particular purposes, but they can take various actions to impede such use (e.g., by denying services that would promote or facilitate such use).
- ⁇ opportunity cost given the current state of the device, e.g., what other processes should take priority such as a voice call, GPS navigation, etc.
- ⁇ is there a pattern of exposure to stimuli, such as a user walking through an airport terminal being repeatedly exposed to CNN that is being presented at each gate
- LRU Least Recently Used
- a particular embodiment may undertake some suitability testing before engaging a processing resource for what may be more than a threshold number of clock cycles.
- a simple suitability test is to make sure the image data is potentially useful for the intended purpose, as contrasted with data that can be quickly disqualified from analysis. For example, whether it is all black (e.g., a frame captured in the user's pocket). Adequate focus can also be checked quickly before committing to an extended operation.
- a GPU-equipped cell phone may invoke instructions - when its camera is activated in a user photo-shoot mode - that configure 20 clusters of scalar processors in the GPU. (Such a cluster is sometimes termed a "stream processor.")
- each cluster is configured to perform a Hough transform on a small tile from a captured image frame - looking for one or more oval shapes that may be candidate faces.
- the GPU thus processes the entire frame in parallel, by 20 concurrent Hough transforms.
- the GPU may be reconfigured into a lesser number of stream processors - one dedicated to analyzing each candidate oval shape, to determine positions of eye pupils, nose location, and distance across the mouth.
- associated parameters would be packaged in keyvector form, and transmitted to a cloud service that checks the keyvectors of analyzed facial parameters against known templates, e.g., of the user's Facebook friends. (Or, such checking could also be performed by the GPU, or by another processor in the cell phone.)
- a processing stage 38 monitors, e.g., the average intensity, redness, greenness or other coloration of the image data contained in the bodies of packets.
- This intensity data can be applied to an output 33 of that stage.
- each packet can convey a timestamp indicating the particular time (absolute, or based on a local clock) at which the image data was captured. This time data, too, can be provided on output 33.
- a synchronization processor 35 coupled to such an output 33 can examine the variation in frame-to- frame intensity (or color), as a function of timestamp data, to discern its periodicity. Moreover, this module can predict the next time instant at which the intensity (or color) will have a maxima, minima, or other particular state.
- a phase-locked loop may control an oscillator that is synced to mirror the periodicity of an aspect of the illumination. More typically, a digital filter computes a time interval that is used to set or compare against timers - optionally with software interrupts. A digital phased-locked loop or delay-locked loop can also be used.
- Control processor module 36 can poll the synchronization module 35 to determine when a lighting condition is expected to have a desired state. With this information, control processor module 36 can direct setup module 34 to capture a frame of data under favorable lighting conditions for a particular purpose. For example, if the camera is imaging an object suspected of having a digital watermark encoded in a green color channel, processor 36 may direct camera 32 to capture a frame of imagery at an instant that green illumination is expected to be at a maximum, and direct processing stages 38 to process that frame for detection of such a watermark.
- the camera phone may be equipped with plural LED light sources that are usually operated in tandem to produce a flash of white light illumination on a subject. Operated individually or in different combinations, however, they can cast different colors of light on the subject.
- the phone processor may control the component LED sources individually, to capture frames with non- white illumination. If capturing an image that is to be read to decode a green-channel watermark, only green illumination may be applied when the frame is captured. Or a camera may capture plural successive frames - with different LEDs illuminating the subject. One frame may be captured at a 1/25O 4 second exposure with a corresponding period of red-only illumination; a subsequent frame may be captured at a l/lOO 4 second exposure with a corresponding period of green-only illumination, etc.
- These frames may be analyzed separately, or may be combined, e.g., for analysis in the aggregate. Or a single frame of imagery may be captured over an interval of l/lOO 4 of a second, with the green LED activated for that entire interval, and the red LED activated for 1/25O 4 of a second during that l/lOO 4 second interval.
- the instantaneous ambient illumination can be sensed (or predicted, as above), and the component LED colored light sources can be operated in a responsive manner (e.g., to counteract orangeness of tungsten illumination by adding blue illumination from a blue LED).
- a processing stage 38 may be instructed to break a packet into multiple packets - such as by splitting image data into 16 tiled smaller sub-images. Thus, more packets may be present at the end of the system than were produced at the beginning.
- a single packet may contain a collection of data from a series of different images (e.g., images taken sequentially - with different focus, aperture, or shutter settings; a particular example is a set of focus regions from five images taken with focus bracketing, or depth of field bracketing - overlapping, abutting, or disjoint.)
- This set of data may then be processed by later stages - either as a set, or through a process that selects one or more excerpts of the packet payload that meet specified criteria (e.g., a focus sharpness metric).
- each processing stage 38 generally substituted the result of its processing for the data originally received in the body of the packet. In other arrangements this need not be the case.
- a stage may output a result of its processing to a module outside the depicted processing chain, e.g., on an output 33. (Or, as noted, a stage may maintain - in the body of the output packet - the data originally received, and augment it with further data - such as the result(s) of its processing.)
- Each stage typically conducts a handshaking exchange with an adjoining stage - each time data is passed to or received from the adjoining stage.
- Such handshaking is routine to the artisan familiar with digital system design, so is not belabored here.
- One function that benefits from multiple cameras is distinguishing objects.
- a single camera is unable to distinguish a human face from a picture of a face (e.g., as may be found in a magazine, on a billboard, or on an electronic display screen).
- the 3D aspect of the picture can readily be discerned, allowing a picture to be distinguished from a person. (Depending on the implementation, it may be the 3D aspect of the person that is actually discerned.)
- a processor can determine the device's distance from landmarks whose location may be precisely known. This allows refinement of other geolocation data available to the device (e.g., by WiFi node identification, GPS, etc.)
- a cell phone may have one, two (or more) sensors
- such a device may also have one, two (or more) projectors.
- Individual projectors are being deployed in cell phones by CKing (the N70 model, distributed by China Vision) and Samsung (the MPB200). LG and others have shown prototypes. (These projectors are understood to use Texas Instruments electronically-steerable digital micro-mirror arrays, in conjunction with LED or laser illumination.)
- Microvision offers the PicoP Display Engine, which can be integrated into a variety of devices to yield projector capability, using a micro-electro-mechanical scanning mirror (in conjunction with laser sources and an optical combiner).
- Other suitable projection technologies include 3M's liquid crystal on silicon (LCOS) and Displaytech's ferroelectric LCOS systems.
- Two projectors or two cameras
- two cameras imaging a digitally watermarked object.
- One camera's view of the object gives one measure of a transform that can be discerned from the object's surface (e.g., by encoded calibration signals). This information can be used to correct a view of the object by the other camera. And vice versa.
- the two cameras can iterate, yielding a comprehensive characterization of the object surface. (One camera may view a better-illuminated region of the surface, or see some edges that the other camera can't see. One view may thus reveal information that the other does not.)
- a reference pattern e.g., a grid
- the Fig. 16 architecture can be expanded to include a projector, which projects a pattern onto an object, for capture by the camera system.
- a projector which projects a pattern onto an object, for capture by the camera system.
- Processing of the resulting image by modules 38 provides information about the surface topology of the object. This 3D topology information can be used as a clue in identifying the object.
- shape information allows a surface to be virtually re- mapped to any other configuration, e.g., flat. Such remapping serves as a sort of normalization operation.
- system 30 operates a projector to project a reference pattern into the camera's field of view. While the pattern is being projected, the camera captures a frame of image data. The resulting image is processed to detect the reference pattern, and therefrom characterize the 3D shape of an imaged object. Subsequent processing then follows, based on the 3D shape data.
- the pattern will be in focus regardless of distance to the object onto which the pattern is projected. This can be used as an aid to adjust focus of a cell phone camera onto an arbitrary subject. Because the projected pattern is known in advance by the camera, the captured image data can be processed to optimize detection of the pattern - such as by correlation. (Or the pattern can be selected to facilitate detection - such as a checkerboard that appears strongly at a single frequency in the image frequency domain when properly focused.) Once the camera is adjusted for optimum focus of the known, collimated pattern, the projected pattern can be discontinued, and the camera can then capture a properly focused image of the underlying subject onto which the pattern was projected.
- collimated laser illumination such as the PicoP Display Engine
- Synchronous detection can also be employed.
- the pattern may be projected during capture of one frame, and then off for capture of the next.
- the two frames can then be subtracted.
- the common imagery in the two frames generally cancels - leaving the projected pattern at a much higher signal to noise ratio.
- a projected pattern can be used to determine correct focus for several subjects in the camera's field of view.
- a child may pose in front of the Grand Canyon.
- the laser-projected pattern allows the camera to focus on the child in a first frame, and on the background in a second frame. These frames can then be composited - taking from each the portion properly in focus.
- a lens arrangement is used in the cell phone's projector system, it can also be used for the cell phone's camera system.
- a mirror can be controllably moved to steer the camera or the projector to the lens.
- a beamsplitter arrangement 80 can be used (Fig. 20).
- the body of a cell phone 81 incorporates a lens 82, which provides a light to a beam-splitter 84. Part of the illumination is routed to the camera sensor 12. The other part of the optical path goes to a micro-mirror projector system 86.
- Lenses used in cell phone projectors typically are larger aperture than those used for cell phone cameras, so the camera may gain significant performance advantages (e.g., enabling shorter exposures) by use of such a shared lens.
- the beam splitter 84 can be asymmetrical - not equally favoring both optical paths.
- the beam-splitter can be a partially-silvered element that couples a smaller fraction (e.g., 2%, 8%, or 25%) of externally incident light to the sensor path 83.
- the beam-splitter may thus serve to couple a larger fraction (e.g., 98%, 92%, or 75%) of illumination from the micro-mirror projector externally, for projection.
- the camera sensor 12 receives light of a conventional - for a cell phone camera - intensity (notwithstanding the larger aperture lens), while the light output from the projector is only slightly dimmed by the lens sharing arrangement.
- a camera head is separate - or detachable - from the cell phone body.
- the cell phone body is carried in a user's pocket or purse, while the camera head is adapted for looking out over a user's pocket (e.g., in a form factor akin to a pen, with a pocket clip, and with a battery in the pen barrel).
- the two communicate by Bluetooth or other wireless arrangement, with capture instructions sent from the phone body, and image data sent from the camera head.
- Such configuration allows the camera to constantly survey the scene in front of the user - without requiring that the cell phone be removed from the user's pocket/purse.
- a strobe light for the camera is separate - or detachable - from the cell phone body.
- the light (which may incorporate LEDs) can be placed near the image subject, providing illumination from a desired angle and distance.
- the strobe can be fired by a wireless command issued by the cell phone camera system.
- the two projectors can project alternating or otherwise distinguishable patterns (e.g., simultaneous, but of differing color, pattern, polarization, etc) into the camera's field of view.
- alternating or otherwise distinguishable patterns e.g., simultaneous, but of differing color, pattern, polarization, etc.
- One user's phone may process an image with Hough transform and other eigenface extraction techniques, and then share the resulting keyvector of eigenface data with others in the user's social circle (either by pushing same to them, or allowing them to pull it).
- One or more of these socially-affiliated devices may then perform facial template matching that yields an identification of a formerly-unrecognized face in the imagery captured by the original user.
- Such arrangement takes a personal experience, and makes it a public experience. Moreover, the experience can become a viral experience, with the keyvector data shared - essentially without bounds - to a great number of further users.
- Another hardware arrangement suitable for use with certain implementations of the present technology uses the Mali-400 ARM graphics multiprocessor architecture, which includes plural fragment processors that can be devoted to the different types of image processing tasks referenced in this document.
- OpenGL ES2.0 defines hundreds of standardized graphics function calls for systems that include multiple CPUs and multiple GPUs (a direction in which cell phones are increasingly migrating). OpenGL ES2.0 attends to routing of different operations to different of the processing units - with such details being transparent to the application software. It thus provides a consistent software API usable with all manner of GPU/CPU hardware.
- OpenGL ES2. standard is extended to provide a standardized graphics processing library not just across different CPU/GPU hardware, but also across different cloud processing hardware - again with such details being transparent to the calling software.
- Java service requests JSRs
- JSRs Java service requests
- JSRs have been defined to standardize certain Java-implemented tasks. JSRs increasingly are designed for efficient implementations on top of OpenGL ES2.0 class hardware.
- some or all of the image processing operations noted in this specification can be implemented as JSRs - providing standardized implementations that are suitable across diverse hardware platforms.
- the extended standards specification can also support the Query Router and Response Manager functionality detailed earlier - including both static and auction-based service providers.
- OpenCV a computer vision library available under an open source license, permitting coders to invoke a variety of functions - without regard to the particular hardware that is being utilized to perform same.
- Symbian operating system e.g., Nokia cell phones.
- OpenCV provides support for a large variety of operations, including high level tasks such as facial recognition, gesture recognition, motion tracking/understanding, segmentation, etc., as well as an extensive assortment of more atomic, elemental vision/image processing operations.
- CMVision is another package of computer vision tools that can be employed in certain embodiments of the present technology - this package compiled by researchers at Carnegie Mellon University.
- Still another hardware architecture makes use of a field programmable object array (FPOA) arrangement, in which hundreds of diverse 16-bit "objects" are arrayed in a gridded node fashion, with each being able to exchange data with neighboring devices through very high bandwidth channels.
- FPOA field programmable object array
- the functionality of each can be reprogrammed, as with FPGAs.
- different of the image processing tasks can be performed by different of the FPOA objects. These tasks can be redefined on the fly, as needed (e.g., an object may perform SIFT processing in one state; FFT processing in another state; log-polar processing in a further state, etc.).
- certain embodiments of the present technology employ "extended depth of field" imaging systems (see, e.g., patents 7,218,448, 7,031,054 and 5,748,371).
- Such arrangements include a mask in the imaging path that modifies the optical transfer function of the system so as to be insensitive to the distance between the object and the imaging system. The image quality is then uniformly poor over the depth of field. Digital post processing of the image compensates for the mask modifications, restoring image quality, but retaining the increased depth of field.
- the cell phone camera can capture imagery having both nearer and further subjects all in focus (i.e., with greater high frequency detail), without requiring longer exposures - as would normally be required.
- new metadata regarding identified objects or groupings of pixels related to depth within the frame can produce simple "depth map” information, setting the stage for 3D video capture and storage of video streams using emerging standards on transmission of depth information.
- the cell phone may have the capability to perform a given operation locally, but may decide instead to have it performed by a cloud resource. The decision of whether to process locally or remotely can be based on "costs,” including bandwidth costs, external service provider costs, power costs to the cell phone battery, intangible costs in consumer (dis-)satisfaction by delaying processing, etc.
- the phone may decide to process the data locally, or to forward it for remote processing when the phone is closer to the cell site or the battery has been recharged.
- a set of stored rules can be applied to the relevant variables to establish a net "cost function" for different approaches (e.g., process locally, process remotely, defer processing), and these rules may indicate different outcomes depending on the states of these variables.
- Cellular networks include tower stations that are, in large part, software-defined radios - employing processors to perform - digitally - some or all of the operations traditionally performed by analog transmitting and receiving radio circuits, such as mixers, filters, demodulators, etc. Even smaller cell stations, so-called “femtocells,” typically have powerful signal processing hardware for such purposes.
- the PicoChip processors noted earlier, and other field programmable object arrays, are widely deployed in such applications.
- Radio signal processing, and image signal processing have many commonalities, e.g., employing FFT processing to convert sampled data to the frequency domain, applying various filtering operations, etc.
- Cell station equipment, including processors are designed to meet peak consumer demands. This means that significant processing capability is often left unused.
- this spare radio signal processing capability at cellular tower stations is repurposed in connection with image (and/or audio or other) signal processing for consumer wireless devices. Since an FFT operation is the same - whether processing sampled radio signals or image pixels - the repurposing is often straightforward: configuration data for the hardware processing cores needn't be changed much, if at all. And because 3G/4G networks are so fast, a processing task can be delegated quickly from a consumer device to a cell station processor, and the results returned with similar speed. In addition to the speed and computational muscle that such repurposing of cell station processors affords, another benefit is reducing the power consumption of the consumer devices.
- a cell phone Before sending image data for processing, a cell phone can quickly inquire of the cell tower station with which it is communicating to confirm that it has enough unused capacity sufficient to undertake the intended image processing operation.
- This query can be sent by the packager/router of Fig. 10; the local/remote router of Fig. 1OA, the query router and response manager of Fig. 7; the pipe manager 51 of Fig. 16, etc. Alerting the cell tower/base station of forthcoming processing requests, and/or bandwidth requirements, allows the cell site to better allocate its processing and bandwidth resources in anticipation of meeting such needs.
- Cell sites are at risk of becoming bottlenecked: undertaking service operations that exhaust their processing or bandwidth capacity. When this occurs, they must triage by unexpectedly throttling back the processing/bandwidth provided to one or more users, so others can be served. This sudden change in service is undesirable, since changing the parameters with which the channel was originally established (e.g., the bit rate at which video can be delivered), forces data services using the channel to reconfigure their respective parameters (e.g., requiring ESPN to provide a lower quality video feed). Renegotiating such details once the channel and services have been originally setup invariably causes glitches, e.g., video delivery stuttering, dropped syllables in phone calls, etc.
- glitches e.g., video delivery stuttering, dropped syllables in phone calls, etc.
- cell sites tend to adopt a conservative strategy - allocating bandwidth/processing resources parsimoniously, in order to reserve capacity for possible peak demands. But this approach impairs the quality of service that might otherwise be normally provided - sacrificing typical service in anticipation of the unexpected.
- a cell phone sends alerts to the cell tower station, specifying bandwidth or processing needs that it anticipates will be forthcoming.
- the cell phone asks to reserve a bit of future service capacity.
- the tower station still has a fixed capacity.
- knowing that a particular user will be needing e.g., a bandwidth of 8 Mbit/s for 3 seconds, commencing in 200 milliseconds, allows the cell site to take such anticipated demand into account as it serves other users.
- the cell site may determine that it has excess capacity at present, but expects to be more heavily burdened in a half second. In this case it may use the present excess capacity to speed throughput to one or more video subscribers, e.g., those for whom it has collected several packets of video data in a buffer memory, ready for delivery. These video packets may be sent through the enlarged channel now, in anticipation that the video channel will be slowed in a half second. Again, this is practical because the cell site has useful information about future bandwidth demands.
- the service reservation message sent from the cell phone may also include a priority indicator. This indicator can be used by the cell site to determine the relative importance of meeting the request on the stated terms, in case arbitration between conflicting service demands is required.
- Such anticipatory service requests from cell phones can also allow the cell site to provide higher quality sustained service than would normally be allocated.
- Cell sites are understood to employ statistical models of usage patterns, and allocate bandwidth accordingly.
- the allocations are typically set conservatively, in anticipation of realistic worst case usage scenarios, e.g., encompassing scenarios that occur 99.99% of the time. (Some theoretically possible scenarios are sufficiently improbable that they may be disregarded in bandwidth allocations. However, on the rare occasions when such improbable scenarios occur - as when thousands of subscribers sent cell phone picture messages from Washington DC during the Obama inauguration, some subscribers may simply not receive service.)
- Anticipatory service requests can also be communicated from the cell phone (or the cell site) to other cloud processes that are expected to be involved in the requested services, allowing them to similarly allocate their resources anticipatorily. Such anticipatory service requests may also serve to alert the cloud process to pre- warm associated processing. Additional information may be provided from the cell phone, or elsewhere, for this purpose, such as encryption keys, image dimensions (e.g., to configure a cloud FPOA to serve as an FFT processor for a 1024 x 768 image, to be processed in 16x16 tiles, and output coefficients for 32 spectral frequency bands), etc.
- the cloud resource may alert the cell phone of any information it expects might be requested from the phone in performance of the expected operation, or action it might request the cell phone to perform, so that the cell phone can similarly anticipate its own forthcoming actions and prepare accordingly.
- the cloud process may, under certain conditions, request a further set of input data, such as if it assesses that data originally provided is not sufficient for the intended purpose (e.g., the input data may be an image without sufficient focus resolution, or not enough contrast, or needing further filtering).
- Anticipatory service requests generally relate to events that may commence in few tens or hundreds of milliseconds - occasionally in a few single seconds. Situations in which the action will commence tens or hundreds of second in the future will be rare. However, while the period of advance warning may be short, significant advantages can be derived: if the randomness of the next second is reduced - each second, then system randomness can be reduced considerably. Moreover, the events to which the requests relate can, themselves, be of longer duration - such as transmission of a large image file, which may take ten seconds or more.
- advance set-up desirably any operation that takes more than a threshold interval of time to complete (e.g., a few hundred microseconds, a millisecond, ten microseconds, etc. - depending on implementation) should be prepped anticipatorily, if possible. (In some instances, of course, the anticipated service is never requested, in which case such preparation may be for naught.)
- a threshold interval of time to complete e.g., a few hundred microseconds, a millisecond, ten microseconds, etc. - depending on implementation
- the cell phone processor may selectively activate a Peltier device or other thermoelectric cooler coupled to the image sensor, in circumstances when thermal image noise (Johnson noise) is a potential problem. For example, if a cell phone detects a low light condition, it may activate a cooler on the sensor to try and enhance the image signal to noise ratio. Or the image processing stages can examine captured imagery for artifacts associated with thermal noise, and if such artifacts exceed a threshold, then the cooling device can be activated. (One approach captures a patch of imagery, such as a 16x16 pixel region, twice in quick succession. Absent random factors, the two patches should be identical - perfectly correlated.
- the variance of the correlation from 1.0 is a measure of noise - presumably thermal noise.
- a short interval after the cooling device is activated a substitute image can be captured - the interval depending on thermal response time for the cooler/sensor.
- a cooler may be activated, since the increased switching activity by circuitry on the sensor increases its temperature, and thus its thermal noise. (Whether to activate a cooler can also be application dependent, e.g., the cooler may be activated when capturing imagery from which watermark data may be read, but not activated when capturing imagery from which barcode data may be read.)
- packets in the Fig. 16 arrangement can convey a variety of instructions and data - in both the header and the packet body.
- a packet can additionally, or alternatively, contain a pointer to a cloud object, or to a record in a database.
- the cloud object/database record may contain information such as object properties, useful for object recognition (e.g., fingerprint or watermark properties for a particular object).
- the packet may contain the watermark payload, and the header (or body) may contain one or more database references where that payload can be associated with related information.
- a watermark payload read from a business card may be looked-up in one database; a watermark decoded from a photograph may be looked-up in another database, etc.
- a system may apply multiple different watermark decoding algorithms to a single image (e.g., MediaSec, Digimarc ImageBridge, Civolution, etc.). Depending on which application performed a particular decoding operation, the resulting watermark payload may be sent off to a corresponding destination database.
- the destination database address can be included in the application, or in configuration data. (Commonly, the addressing is performed indirectly, with an intermediate data store containing the address of the ultimate database, permitting relocation of the database without changing each cell phone application.)
- the system may perform a FFT on captured image data to obtain frequency domain information, and then feed that information to several watermark decoders operating in parallel - each applying a different decoding algorithm.
- the data is sent to a database corresponding to that format/technology of watermark.
- Plural such database pointers can be included in a packet, and used conditionally - depending on which watermark decoding operation (or barcode reading operation, or fingerprint calculation, etc.) yields useful data.
- the system may send a facial image to an intermediary cloud service, in a packet containing an identifier of the user (but not containing the user's Apple iPhoto, or Picasa, or Facebook user name).
- the intermediary cloud service can take the provided user identifier, and use it to access a database record from which the user's names on these other services are obtained.
- the intermediary cloud service can then route the facial image data to an Apple's server - with the user's iPhoto user name; to Picasa' s service with the user's Google user name; and to Facebook' s server with the user's Facebook user name.
- Those respective services can then perform facial recognition on the imagery, and return the names of identified persons identified from the user's iPhoto/Picasa/Facebook accounts (directly to the user, or through the intermediary service).
- the intermediate cloud service - which may serve large numbers of users - can keep informed of the current addresses for relevant servers (and alternate proximate servers, in case the user is away from home), rather than have each cell phone try to keep such data in updated fashion.
- Facial recognition applications can be used not just to identify persons, but also to identify relationships between individuals depicted in imagery.
- data maintained by iPhoto/Picasa/Facebook may contain not just facial recognition features, and associated names, but also terms indicating relationships between the named faces and the account owner (e.g., father, boyfriend, sibling, pet, roommate, etc.).
- the account owner e.g., father, boyfriend, sibling, pet, roommate, etc.
- searching a user's image collection for, e.g., all pictures of "David Smith” the user's collection may also be searched for all pictures depicting "sibling.”
- the application software in which photos are reviewed can present differently colored frames around different recognized faces - in accordance with associated relationship data (e.g., blue for siblings, red for boyfriends, etc.).
- the user's system can access such information stored in accounts maintained by the user's network "friends."
- a face that may not be recognized by facial recognition data associated with the user's account at Picasa may be recognized by consulting Picasa facial recognition data associated with the account of the user's friend "David Smith.”
- Relationship data indicated by David Smith's account can be similarly used to present, and organize, the user's photos.
- the earlier unrecognized face may thus be labeled with indicia indicating the person is David Smith's roommate. This essentially remaps the relationship information (e.g., mapping "roommate” - as indicated in David Smith's account, to "David Smith's roommate” in the user's account).
- the hardware in cell phones was originally introduced for specific purposes.
- the microphone for example, was used only for voice transmission over the cellular network: feeding an A/D converter that fed a modulator in the phone's radio transceiver.
- the camera was used only to capture snapshots. Etc.
- FIG. 2OA Such an arrangement is shown in Fig. 2OA, with the intermediate software layer being labeled "Reference Platform.”
- FIG. 2OA hardware elements are shown in dashed boxes, including processing hardware on the bottom, and peripherals on the left.
- the box “IC HW” is "intuitive computing hardware,” and comprises the earlier-discussed hardware that supports the different processing of image related data, such as modules 38 in Fig. 16, the configurable hardware of Fig. 6, etc.
- DSP is a general purpose digital signal processor, which can be configured to perform specialized operations
- CPU is the phone's primary processor
- GPU is a graphics processor unit.
- OpenCL and OpenGL are APIs through which graphics processing services (performed on the CPU and/or GPU) can be invoked.
- the reference platform establishes a standard interface through which different applications can interact with hardware, exchange information, and request services (e.g., by API calls). Similarly, the platform establishes a standard interface through which the different technologies can be accessed, and through which they can send and receive data to other of the system components. Likewise with the cloud services, for which the reference platform may also attend to details of identifying a service provider - whether by reverse auction, heuristics, etc. In cases where a service is available both from a technology in the cell phone, and from a remote service provider, the reference platform may also attend to weighing the costs and benefits of the different options, and deciding which should handle a particular service request.
- An application may call for the system to read text from an object in front of the cell phone. It needn't concern itself with the particular control parameters of the image sensor, nor the image format requirements of the OCR engine.
- An application may call for a read of the emotion of a person in front of the cell phone. A corresponding call is passed to whatever technology in the phone supports such functionality, and the results are returned in a standardized form.
- an improved technology becomes available, it can be added to the phone, and through the reference platform the system takes advantages of its enhanced capabilities.
- growing/changing collections of sensors, and growing/evolving sets of service providers can be set to the tasks of deriving meaning from input stimuli (audio as well as visual, e.g., speech recognition) through use of such an adaptable architecture.
- Arasan Chip Systems, Inc. offers a Mobile Industry Processor Interface UniPro Software Stack, a layered, kernel-level stack that aims to simplify integration of certain technologies into cell phones. That arrangement may be extended to provide the functionality detailed above. (The Arasan protocol is focused primarily on transport layer issues, but involves layers down to hardware drivers as well. The Mobile Industry Processor Interface Alliance is a large industry group working to advance cell phone technologies.) Leveraging Existing Image Collections, E.g., for Metadata
- An illustrative embodiment according to one aspect of the present technology works as follows.
- known image processing operations may be applied, e.g., to correct color or contrast, to perform ortho-normalization, etc. on the captured image.
- Known image object segmentation or classification techniques may also be used to identify an apparent subject region of the image, and isolate same for further processing.
- the image data is then processed to determine characterizing features that are useful in pattern matching and recognition.
- Color, shape, and texture metrics are commonly used for this purpose. Images may also be grouped based on layout and eigenvectors (the latter being particularly popular for facial recognition). Many other technologies can of course be employed, as noted elsewhere in this specification.
- a search is conducted through one or more publicly- accessible image repositories for images with similar metrics, thereby identifying apparently similar images.
- Flickr and other such repositories may calculate eigenvectors, color histograms, keypoint descriptors, FFTs, or other classification data on images at the time they are uploaded by users, and collect same in an index for public search.
- the search may yield the collection of apparently similar telephone images found in Flickr, depicted in Fig. 22.
- Metadata is then harvested from Flickr for each of these images, and the descriptive terms are parsed and ranked by frequency of occurrence.
- the descriptors harvested from such operation, and their incidence of occurrence may be as follows: Cisco (18)
- the inferred metadata can be augmented or enhanced, if desired, by known image recognition/classification techniques.
- image recognition/classification techniques seeks to provide automatic recognition of objects depicted in images. For example, by recognizing a TouchTone keypad layout, and a coiled cord, such a classifier may label the Fig. 21 image using the terms Telephone and Facsimile Machine.
- the terms returned by the image classifier can be added to the list and given a count value. (An arbitrary value, e.g., 2, may be used, or a value dependent on the classifier's reported confidence in the discerned identification can be employed.)
- the position of the term(s) in the list may be elevated.
- One way to elevate a term's position is by increasing its count value by a percentage (e.g., 30%).
- Another way is to increase its count value to one greater than the next-above term that is not discerned by the image classifier. (Since the classifier returned the term "Telephone” but not the term “Cisco,” this latter approach could rank the term Telephone with a count value of "19" - one above Cisco.)
- a variety of other techniques for augmenting/enhancing the inferred metadata with that resulting from the image classifier are straightforward to implement.
- a revised listing of metadata, resulting from the foregoing, may be as follows:
- the list of inferred metadata can be restricted to those terms that have the highest apparent reliability, e.g., count values.
- a variety of operations can be undertaken.
- One option is to submit the metadata, along with the captured content or data derived from the captured content (e.g., the Fig. 21 image, image feature data such as eigenvectors, color histograms, keypoint descriptors, FFTs, machine readable data decoded from the image, etc), to a service provider that acts on the submitted data, and provides a response to the user.
- the captured content or data derived from the captured content e.g., the Fig. 21 image, image feature data such as eigenvectors, color histograms, keypoint descriptors, FFTs, machine readable data decoded from the image, etc
- the service provider - or the user's device - can submit the metadata descriptors to one or more other services, e.g., a web search engine such as Google, to obtain a richer set of auxiliary information that may help better discern/infer/intuit an appropriate desired by the user. Or the information obtained from Google (or other such database resource) can be used to augment/refine the response delivered by the service provider to the user. (In some cases, the metadata - possibly accompanied by the auxiliary information received from Google - can allow the service provider to produce an appropriate response to the user, without even requiring the image data.)
- a web search engine such as Google
- one or more images obtained from Flickr may be substituted for the user's image. This may be done, for example, if a Flickr image appears to be of higher quality (using sharpness, illumination histogram, or other measures), and if the image metrics are sufficiently similar. (Similarity can be judged by a distance measure appropriate to the metrics being used. One embodiment checks whether the distance measure is below a threshold. If several alternate images pass this screen, then the closest image is used.) Or substitution may be used in other circumstances. The substituted image can then be used instead of (or in addition to) the captured image in the arrangements detailed herein.
- the substitute image data is submitted to the service provider.
- data for several substitute images are submitted.
- the original image data - together with one or more alternative sets of image data - are submitted.
- the service provider can use the redundancy to help reduce the chance of error - assuring an appropriate response is provided to the user. (Or the service provider can treat each submitted set of image data individually, and provide plural responses to the user. The client software on the cell phone can then assess the different responses, and pick between them (e.g., by a voting arrangement), or combine the responses, to help provide the user an enhanced response.)
- one or more related public image(s) may be composited or merged with the user's cell phone image. The resulting hybrid image can then be used in the different contexts detailed in this disclosure.
- a still further option is to use apparently-similar images gleaned from Flickr to inform enhancement of the user's image. Examples include color correction/matching, contrast correction, glare reduction, removing foreground/background objects, etc.
- Such arrangement for example, such a system may discern that the Fig. 21 image has foreground components (apparently Post-It notes) on the telephone that should be masked or disregarded.
- the user's image data can be enhanced accordingly, and the enhanced image data used thereafter.
- the user's image may suffer some impediment, e.g., such as depicting its subject from an odd perspective, or with poor lighting, etc.
- This impediment may cause the user's image not to be recognized by the service provider (i.e., the image data submitted by the user does not seem to match any image data in the database being searched).
- the service provider i.e., the image data submitted by the user does not seem to match any image data in the database being searched.
- data from similar images identified from Flickr may be submitted to the service provider as alternatives - hoping they might work better.
- Another approach - one that opens up many further possibilities - is to search Flickr for one or more images with similar image metrics, and collect metadata as described herein (e.g., Telephone, Cisco, Phone, VOIP). Flickr is then searched a second time, based on metadata. Plural images with similar metadata can thereby be identified. Data for these further images (including images with a variety of different perspectives, different lighting, etc.) can then be submitted to the service provider - notwithstanding that they may "look" different than the user's cell phone image. When doing metadata-based searches, identity of metadata may not be required. For example, in the second search of Flickr just-referenced, four terms of metadata may have been associated with the user's image: Telephone, Cisco, Phone and VOIP. A match may be regarded as an instance in which a subset (e.g., three) of these terms is found.
- a subset e.g., three
- Another approach is to rank matches based on the rankings of shared metadata terms.
- An image tagged with Telephone and Cisco would thus be ranked as a better match than an image tagged with Phone and VOIP.
- a threshold e.g. 60%
- Geolocation data e.g., GPS tags
- GPS tags can also be used to get a metadata toe-hold.
- GPS info captured with the image identifies the location of the image subject.
- Public databases including Flickr
- GPS descriptors Inputting GPS descriptors for the photograph yields the textual descriptors Paris and Eiffel.
- Google Images can be queried with the terms Eiffel and Paris to retrieve other, more perhaps conventional images of the Eiffel tower. One or more of those images can be submitted to the service provider to drive its process. (Alternatively, the GPS information from the user's image can be used to search Flickr for images from the same location; yielding imagery of the Eiffel Tower that can be submitted to the service provider.)
- GPS info can be automatically propagated across a collection of imagery that share visible features (by image metrics such as eigenvectors, color histograms, keypoint descriptors, FFTs, or other classification techniques), or that have a metadata match.
- the user can be submitted to a process that identifies matching Flickr/Google images of that fountain on a feature-recognition basis. To each of those images the process can add GPS information from the user's image.
- a second level of searching can also be employed. From the set of fountain images identified from the first search based on similarity of appearance, metadata can be harvested and ranked, as above. Flickr can then be searched a second time, for images having metadata that matches within a specified threshold (e.g., as reviewed above). To those images, too, GPS information from the user's image can be added. Alternatively, or in addition, a first set of images in Flickr/Google similar to the user's image of the fountain can be identified - not by pattern matching, but by GPS -matching (or both). Metadata can be harvested and ranked from these GPS-matched images. Flickr can be searched a second time for a second set of images with similar metadata. To this second set of images, GPS information from the user's image can be added.
- Another approach to geolocating imagery is by searching Flickr for images having similar image characteristics (e.g., gist, eigenvectors, color histograms, keypoint descriptors, FFTs, etc.), and assessing geolocation data in the identified images to infer the probable location of the original image.
- image characteristics e.g., gist, eigenvectors, color histograms, keypoint descriptors, FFTs, etc.
- IM2GPS Estimating geographic information from a single image, Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, 2008. Techniques detailed in the Hays paper are suited for use in conjunction with certain embodiments of the present technology (including use of probability functions as quantizing the uncertainty of inferential techniques).
- geolocation data When geolocation data is captured by the camera, it is highly reliable. Also generally reliable is metadata (location or otherwise) that is authored by the proprietor of the image. However, when metadata descriptors (geolocation or semantic) are inferred or estimated, or authored by a stranger to the image, uncertainty and other issues arise. Desirably, such intrinsic uncertainty should be memorialized in some fashion so that later users thereof (human or machine) can take this uncertainty into account.
- One approach is to segregate uncertain metadata from device-authored or creator-authored metadata. For example, different data structures can be used. Or different tags can be used to distinguish such classes of information. Or each metadata descriptor can have its own sub-metadata, indicating the author, creation date, and source of the data. The author or source field of the sub-metadata may have a data string indicating that the descriptor was inferred, estimated, deduced, etc., or such information may be a separate sub-metadata tag.
- Each uncertain descriptor may be given a confidence metric or rank. This data may be determined by the public, either expressly or inferentially. An example is the case when a user sees a Flickr picture she believes to be from Yellowstone, and adds a "Yellowstone" location tag, together with a "95%” confidence tag (her estimation of certainty about the contributed location metadata). She may add an alternate location metatag, indicating "Montana,” together with a corresponding 50% confidence tag. (The confidence tags needn't sum to 100%. Just one tag can be contributed - with a confidence less than 100%. Or several tags can be contributed - possibly overlapping, as in the case with Yellowstone and Montana).
- the combined contributions can be assessed to generate aggregate information.
- Such information may indicate, for example, that 5 of 6 users who contributed metadata tagged the image as Yellowstone, with an average 93% confidence; that 1 of 6 users tagged the image as Montana, with a 50% confidence, and 2 of 6 users tagged the image as Glacier National park, with a 15% confidence, etc.
- Inferential determination of metadata reliability can be performed, either when express estimates made by contributors are not available, or routinely.
- Crowd-sourcing techniques are known to parcel image-identification tasks to online workers, and collect the results.
- prior art arrangements are understood to seek simple, short-term consensus on identification. Better, it seems, is to quantify the diversity of opinion collected about image contents (and optionally its variation over time, and information about the sources relied-on), and use that richer data to enable automated systems to make more nuanced decisions about imagery, its value, its relevance, its use, etc.
- known crowd-sourcing image identification techniques may identify the Fig. 35 image with the identifiers "soccer ball” and "dog.” These are the consensus terms from one or several viewers.
- the sub-metadata may indicate, for example, that the tag "football” was contributed by a 21 year old male in Brazil on June 18, 2008. It may further indicate that the tags "afternoon,” “evening” and “morning” were contributed by an automated image classifier at the University of Texas that made these judgments on July 2, 2008 based, e.g., on the angle of illumination on the subjects. Those three descriptors may also have associated probabilities assigned by the classifier, e.g., 50% for afternoon, 30% for evening, and 20% for morning (each of these percentages may be stored as a sub-metatag). One or more of the metadata terms contributed by the classifier may have a further sub-tag pointing to an on-line glossary that aids in understanding the assigned terms.
- sub-tag may give the URL of a computer resource that associates the term "afternoon” with a definition, or synonyms, indicating that the term means noon to 7pm.
- the glossary may further indicate a probability density function, indicating that the mean time meant by "afternoon” is 3:30 pm, the median time is 4:15 pm, and the term has a Gaussian function of meaning spanning the noon to 7 pm time interval.
- Expertise of the metadata contributors may also be reflected in sub-metadata.
- the term “fescue” may have sub-metadata indicating it was contributed by a 45 year old grass seed farmer in Oregon.
- An automated system can conclude that this metadata term was contributed by a person having unusual expertise in a relevant knowledge domain, and may therefore treat the descriptor as highly reliable (albeit maybe not highly relevant). This reliability determination can be added to the metadata collection, so that other reviewers of the metadata can benefit from the automated system's assessment.
- Assessment of the contributor's expertise can also be self-made by the contributor. Or it can be made otherwise, e.g., by reputational rankings using collected third party assessments of the contributor's metadata contributions. (Such reputational rankings are known, e.g., from public assessments of sellers on EBay, and of book reviewers on Amazon.) Assessments may be field-specific, so a person may be judged (or self-judged) to be knowledgeable about grass types, but not about dog breeds. Again, all such information is desirably memorialized in sub-metatags (including sub-sub-metatags, when the information is about a sub-metatag).
- an image may accumulate - over time - a lengthy catalog of contributed geographic descriptors.
- An automated system e.g., a server at Flickr
- the process can apply known clustering algorithms to identify clusters of similar coordinates, and average same to generate a mean location for each cluster. For example, a photo of a geyser may be tagged by some people with latitude/longitude coordinates in Yellowstone, and by others with latitude/longitude coordinates of Hells Gate Park in New Zealand. These coordinates thus form distinct two clusters that would be separately averaged.
- the distilled (averaged) value may be given a confidence of 70%.
- Outlier data can be maintained, but given a low probability commensurate with its outlier status.
- Such distillation of the data by a proprietor can be stored in metadata fields that are readable by the public, but not writable. The same or other approach can be used with added textual metadata - e.g., it can be accumulated and ranked based on frequency of occurrence, to give a sense of relative confidence.
- Fig. 21 cell phone photo of a desk phone. Flickr can be searched based on image metrics to obtain a collection of subject-similar images (e.g., as detailed above).
- a data extraction process e.g., watermark decoding, fingerprint calculation, barcode- or OCR-reading
- information gleaned thereby can be added to the metadata for the Fig. 21 image, and/or submitted to a service provider with image data (either for the Fig. 21 image, and/or for related images).
- a second search can be conducted for similarly-tagged images.
- a search of Flickr may find a photo of the underside of the user's phone - with OCR-readable data - as shown in Fig. 36.
- the extracted information can be added to the metadata for the Fig. 21 image, and/or submitted to a service provider to enhance the response it is able to provide to the user.
- a cell phone user may be given the ability to look around corners and under objects -by using one image as a portal to a large collection of related images.
- cell phones and related portable devices 110 typically include a display 111 and a keypad 112.
- a keypad 112 In addition to a numeric (or alphanumeric) keypad there is often a multi- function controller 114.
- One popular controller has a center button 118, and four surrounding buttons 116a, 116b, 116c and 116d (also shown in Fig. 37).
- An illustrative usage model is as follows.
- a system responds to an image 128 (either optically captured or wirelessly received) by displaying a collection of related images to the user, on the cell phone display.
- the user captures an image and submits it to a remote service.
- the service determines image metrics for the submitted image (possibly after pre-processing, as detailed above), and searches (e.g., Flickr) for visually similar images. These images are transmitted to the cell phone (e.g., by the service, or directly from Flickr), and they are buffered for display.
- the service can prompt the user, e.g., by instructions presented on the display, to repeatedly press the right-arrow button 116b on the four-way controller (or press-and-hold) to view a sequence of pattern-similar images (130, Fig. 45A). Each time the button is pressed, another one of the buffered apparently-similar images is displayed. By techniques like those earlier described, or otherwise, the remote service can also search for images that are similar in geolocation to the submitted image. These too can be sent to and buffered at the cell phone. The instructions may advise that the user can press the left-arrow button 116d of the controller to review these GPS-similar images (132, Fig. 45A).
- the service can search for images that are similar in metadata to the submitted image (e.g., based on textual metadata inferred from other images, identified by pattern matching or GPS matching). Again, these images can be sent to the phone and buffered for immediate display. The instructions may advise that the user can press the up-arrow button 116a of the controller to view these metadata-similar images (134, Fig. 45A).
- buttons By pressing the right, left, and up buttons, the user can review images that are similar to the captured image in appearance, location, or metadata descriptors.
- the user can press the down button 116c.
- This action identifies the currently- viewed picture to the service provider, which then can repeat the process with the currently- viewed picture as the base image. The process then repeats with the user-selected image as the base, and with button presses enabling review of images that are similar to that base image in appearance (16b), location (16d), or metadata (16a).
- This process can continue indefinitely.
- the user can press the center button 118 of the four- way controller.
- This action submits the then-displayed image to a service provider for further action (e.g., triggering a corresponding response, as disclosed, e.g., in earlier-cited documents).
- This action may involve a different service provider than the one that provided all the alternative imagery, or they can be the same. (In the latter case the finally-selected image need not be sent to the service provider, since that service provider knows all the images buffered by the cell phone, and may track which image is currently being displayed.)
- Another example of this user interface technique is presentation of search results from EBay for auctions listing Xbox 360 game consoles.
- One dimension can be price (e.g., pushing button 116b yields a sequence of screens showing Xbox 360 auctions, starting with the lowest-priced ones); another can be seller's geographical proximity to user (closest to furthest, shown by pushing button 116d); another can be time until end of auction (shortest to longest, presented by pushing button 116a). Pressing the middle button 118 can load the full web page of the auction being displayed.
- a related example is a system that responds to a user-captured image of a car by identifying the car (using image features and associated database(s)), searching EBay and Craigslist for similar cars, and presenting the results on the screen.
- Pressing button 116b presents screens of information about cars offered for sale (e.g., including image, seller location, and price) based on similarity to the input image (same model year/same color first, and then nearest model years/colors), nationwide. Pressing button 116d yields such a sequence of screens, but limited to the user's state (or metropolitan region, or a 50 mile radius of the user's location, etc).
- Pressing button 116a yields such a sequence of screens, again limited geographically, but this time presented in order of ascending price (rather than closest model year/color). Again, pressing the middle button loads the full web page (EBay or Craigslist) of the car last-displayed.
- Another embodiment is an application that helps people recall names. A user sees a familiar person at a party, but can't remember his name. Surreptitiously the user snaps a picture of the person, and the image is forwarded to a remote service provider. The service provider extracts facial recognition parameters and searches social networking sites (e.g., FaceBook, MySpace, Linked-In), or a separate database containing facial recognition parameters for images on those sites, for similar-appearing faces.
- social networking sites e.g., FaceBook, MySpace, Linked-In
- the service may provide the user's sign-on credentials to the sites, allowing searching of information that is not otherwise publicly accessible.) Names and other information about similar-appearing persons located via the searching are returned to the user's cell phone - to help refresh the user's memory.
- buttons 116b When data is returned from the remote service, the user may push button 116b to scroll thru matches in order of closest-similarity - regardless of geography. Thumbnails of the matched individuals with associated name and other profile information can be displayed, or just full screen images of the person can be presented - with the name overlaid. When the familiar person is recognized, the user may press button 118 to load the full FaceBook/MySpace/Linked-In page for that person. Alternatively, instead of presenting images with names, just a textual list of names may be presented, e.g., all on a single screen - ordered by similarity of face-match; SMS text messaging can suffice for this last arrangement.
- Pushing button 116d may scroll thru matches in order of closest-similarity, of people who list their residence as within a certain geographical proximity (e.g., same metropolitan area, same state, same campus, etc.) of the user's present location or the user's reference location (e.g., home).
- Pushing button 116a may yield a similar display, but limited to persons who are "Friends" of the user within a social network (or who are Friends of Friends, or who are within another specified degree of separation of the user).
- a related arrangement is a law enforcement tool in which an officer captures an image of a person and submits same to a database containing facial portrait/eigenvalue information from government driver license records and/or other sources.
- Pushing button 116b causes the screen to display a sequence of images/biographical dossiers about persons nationwide having the closest facial matches.
- Pushing button 116d causes the screen to display a similar sequence, but limited to persons within the officer's state.
- Button 116a yields such a sequence, but limited to persons within the metropolitan area in which the officer is working.
- Fig. 45B shows browsing screens in just two dimensions. (Pressing the right button yields a first sequence 140 of information screens; pressing the left button yields a different sequence 142 of information screens.)
- a single UI control can be employed to navigate in the available dimensions of information.
- a joystick is one such device.
- Another is a roller wheel (or scroll wheel).
- Portable device 110 of Fig. 44 has a roller wheel 124 on its side, which can be rolled-up or rolled-down. It can also be pressed-in to make a selection (e.g., akin to buttons 116c or 118 of the earlier-discussed controller). Similar controls are available on many mice.
- opposing buttons navigate the same dimension of information - just in opposite directions (e.g., forward/reverse). In the particular interface discussed above, it will be recognized that this is not the case (although in other implementations, it may be so). Pressing the right button 116b, and then pressing the left button 116d, does not return the system to its original state. Instead, pressing the right button gives, e.g., a first similar-appearing image, and pressing the left button gives the first similarly-located image.
- buttons 120a - 12Of buttons 120a - 12Of, around the periphery of the controller 114. Any of these - if pressed - can serve to reverse the scrolling order.
- button 120a By pressing, e.g., button 120a, the scrolling (presentation) direction associated with nearby button 116b can be reversed. So if button 116b normally presents items in order of increasing cost, activation of button 120a can cause the function of button 116b to switch, e.g., to presenting items in order of decreasing cost.
- button 116b If, in reviewing screens resulting from use of button 116b, the user "overshoots" and wants to reverse direction, she can push button 120a, and then push button 116b again. The screen(s) earlier presented would then appear in reverse order - starting from the present screen.
- operation of such a button can cause the opposite button 116d to scroll back thru the screens presented by activation of button 116b, in reverse order.
- a textual or symbolic prompt can be overlaid on the display screen in all these embodiments - informing the user of the dimension of information that is being browsed, and the direction (e.g., browsing by cost: increasing).
- a single button can perform multiple functions. For example, pressing button 116b can cause the system to start presenting a sequence of screens, e.g., showing pictures of houses for sale nearest the user's location - presenting each for 800 milliseconds (an interval set by preference data entered by the user). Pressing button 116b a second time can cause the system to stop the sequence - displaying a static screen of a house for sale. Pressing button 116b a third time can cause the system to present the sequence in reverse order, starting with the static screen and going backwards thru the screens earlier presented.
- pressing button 116b can cause the system to start presenting a sequence of screens, e.g., showing pictures of houses for sale nearest the user's location - presenting each for 800 milliseconds (an interval set by preference data entered by the user). Pressing button 116b a second time can cause the system to stop the sequence - displaying a static screen of a house for sale. Pressing button 116b a third time can cause the system to present
- buttons 116a, 116b, etc. can operate likewise (but control different sequences of information, e.g., houses closest in price, and houses closest in features).
- this base image may be presented throughout the display - e.g., as a thumbnail in a corner of the display.
- a button on the device e.g., 126a, or 120b
- Touch interfaces are gaining in popularity, such as in products available from Apple and Microsoft (detailed, e.g., in Apple's patent publications 20060026535, 20060026536, 20060250377, 20080211766,
- Each button press noted above can have a counterpart gesture in the vocabulary of the touch screen system.
- different touch-screen gestures can invoke display of the different types of image feeds just reviewed.
- a brushing gesture to the right may present a rightward-scrolling series of image frames 130 of imagery having similar visual content (with the initial speed of scrolling dependent on the speed of the user gesture, and with the scrolling speed decelerating - or not - over time).
- a brushing gesture to the left may present a similar leftward-scrolling display of imagery 132 having similar GPS information.
- a brushing gesture upward may present images an upward-scrolling display of imagery 134 similar in metadata. At any point the user can tap one of the displayed images to make it the base image, with the process repeating.
- gestures can invoke still other actions.
- One such action is displaying overhead imagery corresponding to the GPS location associated with a selected image.
- the imagery can be zoomed in/out with other gestures.
- the user can select for display photographic imagery, map data, data from different times of day or different dates/seasons, and/or various overlays (topographic, places of interest, and other data, as is known from Google Earth), etc.
- Icons or other graphics may be presented on the display depending on contents of particular imagery.
- One such arrangement is detailed in Digimarc's published application 20080300011. "Curbside" or "street-level” imagery - rather than overhead imagery - can be also displayed. It will be recognized that certain embodiments of the present technology include a shared general structure.
- An initial set of data (e.g., an image, or metadata such as descriptors or geocode information, or image metrics such as eigenvalues) is presented. From this, a second set of data (e.g., images, or image metrics, or metadata) are obtained. From that second set of data, a third set of data is compiled (e.g., images with similar image metrics or similar metadata, or image metrics, or metadata). Items from the third set of data can be used as a result of the process, or the process may continue, e.g., by using the third set of data in determining fourth data (e.g., a set of descriptive metadata can be compiled from the images of the third set).
- fourth data e.g., a set of descriptive metadata can be compiled from the images of the third set.
- a sixth set of data can be obtained from the fifth (e.g., identifying clusters of GPS data with which images in the fifth set are tagged), and so on.
- the sets of data can be images, or they can be other forms of data (e.g., image metrics, textual metadata, geolocation data, decoded OCR-, barcode-, watermark-data, etc).
- Any data can serve as the seed.
- the process can start with image data, or with other information, such as image metrics, textual metadata (aka semantic metadata), geolocation information (e.g., GPS coordinates), decoded OCR/barcode/watermark data, etc.
- image metrics image metrics, semantic metadata, GPS info, decoded info
- a first set of information-similar images can be obtained.
- a second, different type of information image metrics/semantic metadata/GPS/decoded info, etc.
- a second set of information-similar images can be obtained.
- a third, different type of information image metrics/semantic metadata/GPS/decoded info, etc.
- a third set of information-similar images can be obtained.
- the seed can be the payload from a product barcode. This can generate a first collection of images depicting the same barcode. This can lead to a set of common metadata. That can lead to a second collection of images based on that metadata. Image metrics may be computed from this second collection, and the most prevalent metrics can be used to search and identify a third collection of images. The images thus identified can be presented to the user using the arrangements noted above.
- Certain embodiments of the present technology may be regarded as employing an iterative, recursive process by which information about one set of images (a single image in many initial cases) is used to identify a second set of images, which may be used to identify a third set of images, etc.
- the function by which each set of images is related to the next relates to a particular class of image information, e.g., image metrics, semantic metadata, GPS, decoded info, etc.
- the relation between one set of images and the next is a function not just of one class of information, but two or more.
- a seed user image may be examined for both image metrics and GPS data. From these two classes of information a collection of images can be determined - images that are similar in both some aspect of visual appearance and location. Other pairings, triplets, etc., of relationships can naturally be employed - in the determination of any of the successive sets of images.
- Some embodiments of the present technology analyze a consumer cell phone picture, and heuristically determine information about the picture's subject. For example, is it a person, place, or thing? From this high level determination, the system can better formulate what type of response might be sought by the consumer - making operation more intuitive.
- the consumer might be interested in adding the depicted person as a FaceBook "friend.” Or sending a text message to that person. Or publishing an annotated version of the photo to a web page. Or simply learning who the person is.
- the consumer might be interested in the local geography, maps, and nearby attractions.
- the consumer may be interested in information about the object (e.g., its history, others who use it), or in buying or selling the object, etc.
- an illustrative system/service can identify one or more actions that it expects the consumer will find most appropriately responsive to the cell phone image. One or all of these can be undertaken, and cached on the consumer's cell phone for review. For example, scrolling a thumbwheel on the side of the cell phone may present a succession of different screens - each with different information responsive to the image subject. (Or a screen may be presented that queries the consumer as to which of a few possible actions is desired.)
- the system can monitor which of the available actions is chosen by the consumer.
- the consumer's usage history can be employed to refine a Bayesian model of the consumer's interests and desires, so that future responses can be better customized to the user.
- location information e.g., latitude/longitude in XMP- or EXIF- metadata.
- a search of Flickr can be undertaken for a first set of images - taken from the same (or nearby) location. Perhaps there are 5 or 500 images in this first set.
- Metadata from this set of images is collected.
- the metadata can be of various types. One is words/phrases from a title given to an image. Another is information in metatags assigned to the image - usually by the photographer (e.g., naming the photo subject and certain attributes/keywords), but additionally by the capture device (e.g., identifying the camera model, the date/time of the photo, the location, etc). Another is words/phrases in a narrative description of the photo authored by the photographer. Some metadata terms may be repeated across different images. Descriptors common to two or more images can be identified (clustered), and the most popular terms may be ranked. (Such as listing is shown at "A" in Fig. 46A. Here, and in other metadata listings, only partial results are given for expository convenience.)
- Place-Centric Processing Terms that relate to place can be identified using various techniques.
- One is to use a database with geographical information to look-up location descriptors near a given geographical position.
- Yahoo's GeoPlanet service returns a hierarchy of descriptors such as "Rockefeller Center,” "10024” (a zip code), "Midtown Manhattan,” “New York,” “Manhattan,” “New York,” and “United States,” when queried with the latitude/longitude of the Rockefeller Center.
- the same service can return names of adjoining/sibling neighborhoods/features on request, e.g.,
- Nearby street names can be harvested from a variety of mapping programs, given a set of latitude/longitude coordinates or other location info.
- a glossary of nearby place-descriptors can be compiled in such manner.
- the metadata harvested from the set of Flickr images can then be analyzed, by reference to the glossary, to identify the terms that relate to place (e.g., that match terms in the glossary).
- Images may have metadata that is exclusively place-related. These images are likely place- centric, rather than person-centric or thing-centric.
- One rule looks at the number of metadata descriptors associated with an image, and determines the fraction that is found in the glossary of place-related terms. This is one metric.
- Placement of the placement-related metadata is another metric. Consideration can also be given to the particularity of the place-related descriptor. A descriptor "New York” or “USA” may be less indicative that an image is place-centric than a more particular descriptor, such as "Rockefeller Center” or “Grand Central Station.” This can yield a third metric.
- a related, fourth metric considers the frequency of occurrence (or improbability) of a term - either just within the collected metadata, or within a superset of that data. "RCA Building” is more relevant, from this standpoint, than "Rockefeller Center” because it is used much less frequently.
- the combination can be a straight sum of four factors, each ranging from 0 to 100. More likely, however, some metrics will be weighted more heavily.
- the following equation employing metrics Ml, M2, M2 and M4 can be employed to yield a score S, with the factors A, B, C, D and exponents W, X, Y and Z determined experimentally, or by Bayesian techniques:
- a different analysis can be employed to estimate the person-centric-ness of each image in the set obtained from Flickr.
- a glossary of relevant terms can be compiled - this time terms associated with people.
- the person name glossary can be global - rather than associated with a particular locale. (However, different glossaries may be appropriate in different countries.)
- Such a glossary can be compiled from various sources, including telephone directories, lists of most popular names, and other reference works where names appear.
- the list may start, "Aaron, Abigail, Adam, Addison, Adrian, Aidan, Aiden, Alex, Alexa, Alexander, Alexandra, Alexis, Allison, Alyssa, Amelia, Andrea, Andrew, Angel, Angelina, Anna, Anthony, Antonio, Ariana, Arianna, Ashley, Aubrey, Audrey, Austin, Autumn, Ava, Avery."
- First names alone can be considered, or last names can be considered too. (Some names may be a place name or a person name. Searching for adjoining first/last names and/or adjoining place names can help distinguish ambiguous cases. E.g., Elizabeth Smith is a person; Elizabeth NJ is a place.)
- Adjectives and adverbs that are usually applied to people may also be included in the person-term glossary (e.g., happy, boring, blonde, etc), as can the names of objects and attributes that are usually associated with people (e.g., t-shirt, backpack, sunglasses, tanned, etc.).
- Verbs associated with people can also be employed (e.g., surfing, drinking).
- thing-centric images rather than person-centric.
- the term “sunglasses” may appear in metadata for an image depicting sunglasses, alone; "happy” may appear in metadata for an image depicting a dog.
- glossary terms can be associated with respective confidence metrics, by which any results based on such terms may be discounted or otherwise acknowledged to have different degrees of uncertainty.
- an image is not associated with any person-related metadata, then the image can be adjudged likely not person-centric. Conversely, if all of the metadata is person-related, the image is likely person-centric. For other cases, metrics like those reviewed above can be assessed and combined to yield a score indicating the relative person-centric-ness of each image, e.g., based on the number, placement, particularity and/or frequency/improbability of the person-related metadata associated with the image.
- Metadata While analysis of metadata gives useful information about whether an image is person-centric, other techniques can also be employed - either alternatively, or in conjunction with metadata analysis.
- One technique is to analyze the image looking for continuous areas of skin- tone colors. Such features characterize many features of person-centric images, but are less frequently found in images of places and things.
- a related technique is facial recognition. This science has advanced to the point where even inexpensive point-and-shoot digital cameras can quickly and reliably identify faces within an image frame (e.g., to focus or expose the image based on such subjects).
- Facial recognition algorithms can be applied to the set of reference images obtained from Flickr, to identify those that have evident faces, and identify the portions of the images corresponding to the faces.
- Another form of further processing is to look for the existence of (1) one or more faces in the image, together with (2) person-descriptors in the metadata associated with the image.
- the facial recognition data can be used as a "plus” factor to increase a person-centric score of an image based on metadata or other analysis.
- the "plus” can take various forms. E.g., a score (in a 0-100 scale) can be increased by 10, or increased by 10%. Or increased by half the remaining distance to 100, etc.)
- a photo tagged with "Elizabeth” metadata is more likely a person-centric photo if the facial recognition algorithm finds a face within the image than if no face is found.
- the absence of any face in an image can be used as a "plus” factor to increase the confidence that the image subject is of a different type, e.g., a place or a thing.
- an image tagged with Elizabeth as metadata, but lacking any face increases the likelihood that the image relates to a place named Elizabeth, or a thing named Elizabeth - such as a pet.
- the facial recognition algorithm identifies a face as a female, and the metadata includes a female name.
- the glossary - or other data structure - have data that associates genders with at least some names.
- the age of the depicted person(s) can be estimated using automated techniques (e.g., as detailed in patent 5,781,650, to Univ. of Central Florida). Names found in the image metadata can also be processed to estimate the age of the thus-named person(s). This can be done using public domain information about the statistical distribution of a name as a function of age (e.g., from published Social Security Administration data, and web sites that detail most popular names from birth records). Thus, names Mildred and Gertrude may be associated with an age distribution that peaks at age 80, whereas Madison and Alexis may be associated with an age distribution that peaks at age 8.
- Finding statistically-likely correspondence between metadata name and estimated person age can further increase the person-centric score for an image.
- Statistically unlikely correspondence can be used to decrease the person-centric score. (Estimated information about the age of a subject in the consumer's image can also be used to tailor the intuited response(s), as may information about the subject's gender.))
- this information can be used to reduce a person-centric score for the image (perhaps down to zero).
- Thing-Centric Processing A thing-centered image is the third type of image that may be found in the set of images obtained from
- Flickr in the present example.
- a glossary of nouns can be compiled - either from the universe of Flickr metadata or some other corpus (e.g., WordNet), and ranked by frequency of occurrence. Nouns associated with places and persons can be removed from the glossary.
- the glossary can be used in the manners identified above to conduct analyses of the images' metadata, to yield a score for each.
- Another approach uses pattern matching to identify thing-centric images - matching each against a library of known thing-related images. Still another approach is based on earlier-determined scores for person-centric and place-centric. A thing-centric score may be assigned in inverse relationship to the other two scores (i.e., if an image scores low for being person-centric, and low for being place-centric, then it can be assigned a high score for being thing- centric).
- Data produced by the foregoing techniques can produce three scores for each image in the set, indicating rough confidence/probability/likelihood that the image is (1) person-centric, (2) place-centric, or (3) thing-centric. These scores needn't add to 100% (although they may). Sometimes an image may score high in two or more categories. In such case the image may be regarded as having multiple relevance, e.g., as both depicting a person and a thing.
- the set of images downloaded from Flickr may next be segregated into groups, e.g., A, B and C, depending on whether identified as primarily person-centric, place-centric, or thing-centric. However, since some images may have split probabilities (e.g., an image may have some indicia of being place-centric, and some indicia of being person-centric), identifying an image wholly by its high score ignores useful information. Preferable is to calculate a weighted score for the set of images - taking each image's respective scores in the three categories into account.
- a sample of images from Flickr - all taken near Rockefeller Center - may suggest that 60% are place- centric, 25% are person-centric, and 15% are thing-centric.
- This information gives useful insight into the tourist's cell phone image - even without regard to the contents of the image itself (except its geocoding). That is, chances are good that the image is place-centric, with less likelihood it is person-centric, and still less probability it is thing centric. (This ordering can be used to determine the order of subsequent steps in the process - allowing the system to more quickly gives responses that are most likely to be appropriate.)
- This type-assessment of the cell phone photo can be used - alone - to help determine an automated action provided to the tourist in response to the image. However, further processing can better assess the image's contents, and thereby allow a more particularly-tailored action to be intuited. Similarity Assessments and Metadata Weighting
- Place-centric images may be characterized by straight lines (e.g., architectural edges). Or repetitive patterns (windows). Or large areas of uniform texture and similar color near the top of the image (sky).
- person-centric images will also tend to have different appearances than the other two classes of image, yet have common attributes within the person-centric class.
- person-centric images will usually have faces - generally characterized by an ovoid shape with two eyes and a nose, areas of flesh tones, etc.
- thing-centric images are perhaps the most diverse, images from any given geography may tend to have unifying attributes or features. Photos geocoded at a horse track will depict horses with some frequency; photos geocoded from Independence National Historical Park in Philadelphia will tend to depict the Liberty Bell regularly, etc. By determining whether the cell phone image is more similar to place-centric, or person-centric, or thing-centric images in the set of Flickr images, more confidence in the subject of the cell phone image can be achieved (and a more accurate response can be intuited and provided to the consumer).
- a fixed set of image assessment criteria can be applied to distinguish images in the three categories.
- the detailed embodiment determines such criteria adaptively.
- this embodiment examines the set of images and determines which image features/characteristics/metrics most reliably (1) group like- categorized images together (similarity); and (2) distinguish differently-categorized images from each other (difference).
- attributes that may be measured and checked for similarity/difference behavior within the set of images are dominant color; color diversity; color histogram; dominant texture; texture diversity; texture histogram; edginess; wavelet-domain transform coefficient histograms, and dominant wavelet coefficients; frequency domain transfer coefficient histograms and dominant frequency coefficients (which may be calculated in different color channels); eigenvalues; keypoint descriptors; geometric class probabilities; symmetry; percentage of image area identified as facial; image autocorrelation; low-dimensional "gists" of image; etc. (Combinations of such metrics may be more reliable than the characteristics individually.)
- One way to determine which metrics are most salient for these purposes is to compute a variety of different image metrics for the reference images. If the results within a category of images for a particular metric are clustered (e.g., if, for place-centric images, the color histogram results are clustered around particular output values), and if images in other categories have few or no output values near that clustered result, then that metric would appear well suited for use as an image assessment criteria.
- the system may determine that an edginess score of >40 is reliably associated with images that score high as place-centric; a facial area score of >15% is reliably associated with images that score high as person-centric; and a color histogram that has a local peak in the gold tones - together with a frequency content for yellow that peaks at lower image frequencies, is somewhat associated with images that score high as thing-centric.
- the analysis techniques found most useful in grouping/distinguishing the different categories of images can then be applied to the user's cell phone image.
- the results can then be analyzed for proximity - in a distance measure sense (e.g., multi-dimensional space) - with the characterizing features associated with different categories of images. (This is the first time that the cell phone image has been processed in this particular embodiment.)
- the cell phone image may score a 60 for thing-centric, a 15 for place-centric, and a 0 for person-centric (on scale of 0-100). This is a second, better set of scores that can be used to classify the cell phone image (the first being the statistical distribution of co-located photos found in Flickr).
- the similarity of the user's cell phone image may next be compared with individual images in the reference set. Similarity metrics identified earlier can be used, or different measures can be applied.
- the time or processing devoted to this task can be apportioned across the three different image categories based on the just-determined scores. E.g., the process may spend no time judging similarity with reference images classed as 100% person-centric, but instead concentrate on judging similarity with reference images classed as thing- or place-centric (with more effort - e.g., four times as much effort - being applied to the former than the latter).
- a similarity score is generated for most of the images in the reference set (excluding those that are assessed as 100% person-centric).
- Metadata from the reference images are again assembled - this time weighted in accordance with each image's respective similarity to the cell phone image. (The weighting can be linear or exponential.) Since metadata from similar images is weighted more than metadata from dissimilar images, the resulting set of metadata is tailored to more likely correspond to the cell phone image.
- the top N (e.g., 3) metadata descriptors may be used.
- descriptors that - on a weighted basis - comprise an aggregate M% of the metadata set.
- the thus-identified metadata may comprise "Rockefeller Center,” “Prometheus,” and “Skating rink,” with respective scores of 19, 12 and 5 (see “B” in Fig. 46B).
- the system can begin determining what responses may be most appropriate for the consumer. In the exemplary embodiment, however, the system continues by further refining its assessment of the cell phone image. (The system may begin determining appropriate responses while also undertaking the further processing.) Processing a Second Set of Reference Images
- Flickr is queried for images having the identified metadata.
- the query can be geographically limited to the cell phone's geolocation, or a broader (or unlimited) geography may be searched. (Or the query may run twice, so that half of the images are co-located with the cell phone image, and the others are remote, etc.)
- the search may first look for images that are tagged with all of the identified metadata. In this case, 60 images are found. If more images are desired, Flickr may be searched for the metadata terms in different pairings, or individually. (In these latter cases, the distribution of selected images may be chosen so that the metadata occurrence in the results corresponds to the respective scores of the different metadata terms, i.e., 19/12/5.)
- Metadata from this second set of images can be harvested, clustered, and may be ranked ("C” in Fig. 46B). (Noise words ("and, of, or,” etc.) can be eliminated. Words descriptive only of the camera or the type of photography may also be disregarded (e.g., "Nikon,” “D80,” “HDR,” “black and white,” etc.). Month names may also be removed.)
- the analysis performed earlier - by which each image in the first set of images was classified as person- centric, place-centric or thing-centric - can be repeated on images in the second set of images.
- Appropriate image metrics for determining similarity/difference within and between classes of this second image set can be identified (or the earlier measures can be employed). These measures are then applied, as before, to generate refined scores for the user's cell phone image, as being person-centric, place-centric, and thing-centric.
- the cell phone image may score a 65 for thing-centric, 12 for place- centric, and 0 for person-centric. (These scores may be combined with the earlier-determined scores, e.g., by averaging, if desired.)
- Similarity between the user's cell phone image and each image in the second set can be determined. Metadata from each image can then be weighted in accordance with the corresponding similarity measure. The results can then be combined to yield a set of metadata weighted in accordance image similarity.
- Metadata - often including some highly ranked terms - will be of relatively low value in determining image-appropriate responses for presentation to the consumer. "New York,” “Manhattan,” are a few examples. Generally more useful will be metadata descriptors that are relatively unusual.
- a measure of "unusualness” can be computed by determining the frequency of different metadata terms within a relevant corpus, such as Flickr image tags (globally, or within a geolocated region), or image tags by photographers from whom the respective images were submitted, or words in an encyclopedia, or in Google's index of the web, etc.
- the terms in the weighted metadata list can be further weighted in accordance with their unusualness (i.e., a second weighting).
- the result of such successive processing may yield the list of metadata shown at "D" in Fig. 46B (each shown with its respective score).
- This information (optionally in conjunction with a tag indicating the person/place/thing determination) allows responses to the consumer to be well-correlated with the cell phone photo.
- this set of inferred metadata for the user's cell phone photo was compiled entirely by automated processing of other images, obtained from public sources such as Flickr, in conjunction with other public resources (e.g., listings of names, places).
- the inferred metadata can naturally be associated with the user's image. More importantly for the present application, however, it can help a service provider decide how best to respond to submission of the user's image.
- Fig. 50 the system just-described can be viewed as one particular application of an "image juicer" that receives image data from a user, and applies different forms of processing so as to gather, compute, and/or infer information that can be associated with the image.
- image juicer that receives image data from a user, and applies different forms of processing so as to gather, compute, and/or infer information that can be associated with the image.
- the information can be forwarded by a router to different service providers.
- These providers may be arranged to handle different types of information (e.g., semantic descriptors, image texture data, keypoint descriptors, eigenvalues, color histograms, etc) or to different classes of images (e.g., photo of friend, photo of a can of soda, etc).
- Outputs from these service providers are sent to one or more devices (e.g., the user's cell phone) for presentation or later reference.
- devices e.g., the user's cell phone
- a tree structure can be used, with an image first being classed into one of a few high level groupings (e.g., person/place/thing), and then each group being divided into further subgroups. In use, an image is assessed through different branches of the tree until the limits of available information allow no further progress to be made. Actions associated with the terminal leaf or node of the tree are then taken.
- groupings e.g., person/place/thing
- Fig. 51 Part of a simple tree structure is shown in Fig. 51. (Each node spawns three branches, but this is for illustration only; more or less branches can of course be used.) If the subject of the image is inferred to be an item of food (e.g., if the image is associated with food- related metadata), three different screens of information can be cached on the user's phone. One starts an online purchase of the depicted item at an online vendor. (The choice of vendor, and payment/shipping details, can be obtained from user profile data.) The second screen shows nutritional information about the product. The third presents a map of the local area - identifying stores that sell the depicted product. The user switches among these responses using a roller wheel 124 on the side of the phone (Fig. 44).
- one screen presented to the user gives the option of posting a copy of the photo to the user's FaceBook page, annotated with the person(s)'s likely name(s).
- Determining the names of persons depicted in a photo can be done by submitting the photo to the user's account at Picasa.
- Picasa performs facial recognition operations on submitted user images, and correlates facial eigenvectors with individual names provided by the user, thereby compiling a user-specific database of facial recognition information for friends and others depicted in the user's prior images.
- Picasa' s facial recognition is understood to be based on technology detailed in patent 6,356,659 to Google.
- Apple's iPhoto software and Facebook's Photo Finder software include similar facial recognition functionality.
- Another screen starts a text message to the individual, with the addressing information having been obtained from the user's address book, indexed by the Picasa-determined identity. The user can pursue any or all of the presented options by switching between the associated screens.
- the system will have earlier undertaken an attempted recognition of the person using publicly available facial recognition information.
- uch information can be extracted from photos of known persons.
- VideoSurf is one vendor with a database of facial recognition features for actors and other persons.
- L-I Corp. maintains databases of driver's licenses photos and associated data which may - with appropriate safeguards - be employed for facial recognition purposes.
- the screen(s) presented to the user can show reference photos of the persons matched (together with a "match" score), as well as dossiers of associated information compiled from the web and other databases.
- a further screen gives the user the option of sending a "Friend" invite to the recognized person on MySpace, or another social networking site where the recognized person is found to have a presence.
- a still further screen details the degree of separation between the user and the recognized person. (E.g., my brother David has a classmate Steve, who has a friend Matt, who has a friend Tom, who is the son of the depicted person.) Such relationships can be determined from association information published on social networking sites.
- At least one alternative response to each image may be open-ended - allowing the user to navigate to different information, or specify a desired response - making use of whatever image/metadata processed information is available.
- Google per se, is not necessarily best for this function, because current Google searches require that all search terms be found in the results. Better is a search engine that does fuzzy searching, and is responsive to differently- weighted keywords - not all of which need be found.
- the results can indicate different seeming relevance, depending on which keywords are found, where they are found, etc. (A result including "Prometheus” but lacking "RCA Building” would be ranked more relevant than a result including the latter but lacking the former.)
- the results from such a search can be clustered by other concepts.
- results may be clustered because they share the theme "art deco.” Others may be clustered because they deal with corporate history of RCA and GE. Others may be clustered because they concern the works of the architect Raymond Hood. Others may be clustered as relating to 20 th century American sculpture, or Paul Manship. Other concepts found to produce distinct clusters may include John Rockefeller, The Mitsubishi Group, Colombia University, Radio City Music Hall, The Rainbow Room Restaurant, etc.
- Information from these clusters can be presented to the user on successive UI screens, e.g., after the screens on which prescribed information/actions are presented.
- the order of these screens can be determined by the sizes of the information clusters, or the keyword-determined relevance.
- Still a further response is to present to the user a Google search screen - pre-populated with the twice- weighted metadata as search terms. The user can then delete terms that aren't relevant to his/her interest, and add other terms, so as to quickly execute a web search leading to the information or action desired by the user.
- the system response may depend on people with whom the user has a "friend" relationship in a social network, or some other indicia of trust. For example, if little is known about user Ted, but there is a rich set of information available about Ted's friend Alice, that rich set of information may be employed in determining how to respond to Ted, in connection with a given content stimulus.
- One such technique simply examines which responsive screen(s) are selected by users in particular contexts. As such usage patterns become evident, the most popular responses can be moved earlier in the sequence of screens presented to the user.
- the usage patterns can be tailored in various dimensions of context. Males between 40 and 60 years of age, in New York, may demonstrate interest in different responses following capture of a snapshot of a statue by a 20 th century sculptor, than females between 13 and 16 years of age in Beijing. Most persons snapping a photo of a food processor in the weeks before Christmas may be interested in finding the cheapest online vendor of the product; most persons snapping a photo of the same object the week following Christmas may be interested in listing the item for sale on E-Bay or Craigslist. Etc. Desirably, usage patterns are tracked with as many demographic and other descriptors as possible, so as to be most-predictive of user behavior.
- More sophisticated techniques can also be applied, drawing from the rich sources of expressly- and inferentially-linked data sources now available. These include not only the web and personal profile information, but all manner of other digital data we touch and in which we leave traces, e.g., cell phone billing statements, credit card statements, shopping data from Amazon and EBay, Google search history, browsing history, cached web pages, cookies, email archives, phone message archives from Google Voice, travel reservations on Expedia and Orbitz, music collections on iTunes, cable television subscriptions, Netflix movie choices, GPS tracking information, social network data and activities, activities and postings on photo sites such as Flickr and Picasa and video sites such as YouTube, the times of day memorialized in these records, etc. (our "digital life log"). Moreover, this information is potentially available not just for the user, but also for the user's friends/family, for others having demographic similarities with the user, and ultimately everyone else (with appropriate anonymization and/or privacy safeguards).
- the network of interrelationships between these data sources is smaller than the network of web links analyzed by Google, but is perhaps richer in the diversity and types of links. From it can be mined a wealth of inferences and insights, which can help inform what a particular user is likely to want done with a particular snapped image.
- NLP natural language processing
- Semantic Map compiled by Cognition Technologies, Inc., a database that can be used to analyze words in context, in order to discern their meaning.
- This functionality can be used, e.g., to resolve homonym ambiguity in analysis of image metadata (e.g., does "bow” refer to a part of a ship, a ribbon adornment, a performer's thank- you, or a complement to an arrow? Proximity to terms such as “Carnival cruise,” “satin,” “Carnegie Hall” or “hunting” can provide the likely answer).
- Patent 5,794,050 (FRCD Corp.) details underlying technologies.
- NLP techniques can also be used to augment image metadata with other relevant descriptors - which can be used as additional metadata in the embodiments detailed herein.
- a close-up image tagged with the descriptor "hibiscus stamens” can - through NLP techniques - be further tagged with the term "flower.”
- Flickr has 460 images tagged with "hibiscus” and “stamen,” but omitting "flower.”
- Patent 7,383,169 (Microsoft) details how dictionaries and other large works of language can be processed by NLP techniques to compile lexical knowledge bases that serve as daunting sources of such
- NLP techniques can reach nuanced understandings about our historical interests and actions - information that can be used to model (predict) our present interests and forthcoming actions. This understanding can be used to dynamically decide what information should be presented, or what action should be undertaken, responsive to a particular user capturing a particular image (or to other stimulus). Truly intuitive computing will then have arrived.
- other processing activities will be started in parallel with those detailed. For example, if initial processing of the first set of reference images suggests that the snapped image is place- centric, the system can request likely-useful information from other resources before processing of the user image is finished. To illustrate, the system may immediately request a street map of the nearby area, together with a satellite view, a street view, a mass transit map, etc. Likewise, a page of information about nearby restaurants can be compiled, together with another page detailing nearby movies and show- times, and a further page with a local weather forecast. These can all be sent to the user's phone and cached for later display (e.g., by scrolling a thumb wheel on the side of the phone).
- the reference sets of imagery may nonetheless be compiled based on location.
- Location information for the input image can be inferred from various indirect techniques.
- a wireless service provider thru which a cell phone image is relayed may identify the particular cell tower from which the tourist's transmission was received. (If the transmission originated through another wireless link, such as WiFi, its location may also be known.)
- the tourist may have used his credit card an hour earlier at a Manhattan hotel, allowing the system (with appropriate privacy safeguards) to infer that the picture was taken somewhere near Manhattan.
- features depicted in an image are so iconic that a quick search for similar images in Flickr can locate the user (e.g., as being at the Eiffel Tower, or at the Statue of Liberty).
- GeoPlanet was cited as one source of geographic information. However, a number of other geoinformation databases can alternatively be used. GeoNames-dot-org is one. (It will be recognized that the "- dot-" convention, and omission of the usual http preamble, is used to prevent the reproduction of this text by the Patent Office from being indicated as a live hyperlink).
- GeoNames' free data available as a web service
- Google's GeoSearch API which allows retrieval of and interaction with data from Google Earth and Google Maps.
- archives of aerial imagery are growing exponentially. Part of such imagery is from a straight-down perspective, but off-axis the imagery increasingly becomes oblique. From two or more different oblique views of a location, a 3D model can be created. As the resolution of such imagery increases, sufficiently rich sets of data are available that - for some locations - a view of a scene as if taken from ground level may be synthesized. Such views can be matched with street level photos, and metadata from one can augment metadata for the other.
- Fig. 47 the embodiment particularly described above made use of various resources, including Flickr, a database of person names, a word frequency database, etc. These are just a few of the many different information sources that might be employed in such arrangements.
- Other social networking sites, shopping sites (e.g., Amazon, EBay), weather and traffic sites, online thesauruses, caches of recently- visited web pages, browsing history, cookie collections, Google, other digital repositories (as detailed herein), etc. can all provide a wealth of additional information that can be applied to the intended tasks. Some of this data reveals information about the user's interests, habits and preferences - data that can be used to better infer the contents of the snapped picture, and to better tailor the intuited response(s).
- Fig. 47 shows a few lines interconnecting the different items, these are illustrative only.
- One additional action is to refine the just-detailed process by receiving user-related input, e.g., after the processing of the first set of Flickr images.
- the system identified "Rockefeller Center,” “Prometheus,” and “Skating rink” as relevant metadata to the user-snapped image.
- the system may query the user as to which of these terms is most relevant (or least relevant) to his/her particular interest.
- the further processing e.g., further search, etc. can be focused accordingly.
- the user may touch a region to indicate an object of particular relevance within the image frame.
- Image analysis and subsequent acts can then focus on the identified object.
- Some of the database searches can be iterative/recursive. For example, results from one database search can be combined with the original search inputs and used as inputs for a further search.
- the first set of images in the arrangement detailed above may be selectively chosen to more likely be similar to the subject image.
- Flickr can be searched for images taken at about the same time of day. Lighting conditions will be roughly similar, e.g., so that matching a night scene to a daylight scene is avoided, and shadow/shading conditions might be similar.
- Flickr can be searched for images taken in the same season/month. Issues such as seasonal disappearance of the ice skating rink at Rockefeller Center, and snow on a winter landscape, can thus be mitigated.
- the camera/phone is equipped with a magnetometer, inertial sensor, or other technology permitting its bearing (and/or azimuth/elevation) to be determined, then Flickr can be searched for shots with this degree of similarity too.
- the sets of reference images collected from Flickr desirably comprise images from many different sources (photographers) - so they don't tend towards use of the same metadata descriptors.
- Images collected from Flickr may be screened for adequate metadata. For example, images with no metadata (except, perhaps, an arbitrary image number) may be removed from the reference set(s). Likewise, images with less than 2 (or 20) metadata terms, or without a narrative description, may be disregarded.
- Flickr is often mentioned in this specification, but other collections of content can of course be used. Images in Flickr commonly have specified license rights for each image. These include “all rights reserved,” as well as a variety of Creative Commons licenses, through which the public can make use of the imagery on different terms. Systems detailed herein can limit their searches through Flickr for imagery meeting specified license criteria (e.g., disregard images marked "all rights reserved").
- image collections are in some respects preferable.
- the database at images. google- dot-com seems better at ranking images based on metadata-relevance than Flickr.
- Flickr and Google maintain image archives that are publicly accessible.
- Many other image archives are private.
- Embodiments of the present technology can find application with both - including some hybrid contexts in which both public and proprietary image collections are used (e.g., Flickr is used to find an image based on a user image, and the Flickr image is submitted to a private database to find a match and determine a corresponding response for the user).
- services such as Flickr for providing data (e.g., images and metadata)
- other sources can of course be used.
- P2P peer-to-peer
- the index may include metadata and metrics for images, together with pointers to the nodes at which the images themselves are stored.
- the peers may include cameras, PDAs, and other portable devices, from which image information may be available nearly instantly after it has been captured.
- imagery e.g., similar geolocation; similar image metrics; similar metadata, etc.
- imagery e.g., similar geolocation; similar image metrics; similar metadata, etc.
- these data are generally reciprocal, so if the system discovers - during processing of Image A, that its color histogram is similar to that of Image B, then this information can be stored for later use. If a later process involves Image B, the earlier-stored information can be consulted to discover that Image A has a similar histogram - without analyzing Image B.
- Such relationships are akin to virtual links between the images.
- Images can be assigned Digital Object Identifiers (DOI) for this purpose.
- DOI Digital Object Identifiers
- the International DOI Foundation has implemented the CNRI Handle System so that such resources can be resolved to their current location through the web site at doi-dot-org.
- Another alternative is for the images to be assigned and digitally watermarked with identifiers tracked by Digimarc For Images service.
- image identifiers may be used instead of imagery, per se (e.g., as a data proxy).
- the cell phone/camera may provide location data in one or more other reference systems, such as Yahoo's GeoPlanet ID - the Where on Earth ID (WOEID).
- WEID Where on Earth ID
- Location metadata can be used for identifying other resources in addition to similarly-located imagery.
- Web pages for example, can have geographical associations (e.g., a blog may concern the author's neighborhood; a restaurant's web page is associated with a particular physical address).
- the web service GeoURL-dot-org is a location-to-URL reverse directory that can be used to identify web sites associated with particular geographies.
- GeoURL supports a variety of location tags, including their own ICMB meta tags, as well as Geo Tags.
- Other systems that support geotagging include RDF, Geo microformat, and the GPSLongitude/GPSLatitude tags commonly used in XMP- and EXIF-camera metainformation.
- Flickr uses a syntax established by Geobloggers, e.g.
- the metadata may also be examined for dominant language, and if not English (or other particular language of the implementation), the metadata and the associated image may be removed from consideration.
- Fig. 48A shows an arrangement in which - if an image is identified as person-centric - it is next referred to two other engines. One identifies the person as family, friend or stranger. The other identifies the person as child or adult. The latter two engines work in parallel, after the first has completed its work.
- Fig. 48B shows engines performing family/friend/stranger and child/adult analyses - at the same time the person/place/thing engine is undertaking its analysis. If the latter engine determines the image is likely a place or thing, the results of the first two engines will likely not be used.
- Specialized online services can be used for certain types of image discrimination/identification. For example, one web site may provide an airplane recognition service: when an image of an aircraft is uploaded to the site, it returns an identification of the plane by make and model.
- Fig. 49 shows that different analysis engines may provide their outputs to different response engines. Often the different analysis engines and response engines may be operated by different service providers. The outputs from these response engines can then be consolidated/coordinated for presentation to the consumer.
- This consolidation may be performed by the user's cell phone - assembling inputs from different data sources; or such task can be performed by a processor elsewhere.
- One example of the technology detailed herein is a homebuilder who takes a cell phone image of a drill that needs a spare part.
- the image is analyzed, the drill is identified by the system as a Black and Decker DR250B, and the user is provided various info/action options.
- These include reviewing photos of drills with similar appearance, reviewing photos of drills with similar descriptors/features, reviewing the user's manual for the drill, seeing a parts list for the drill, buying the drill new from Amazon or used from EBay, listing the builder's drill on EBay, buying parts for the drill, etc.
- the builder chooses the "buying parts" option and proceeds to order the necessary part.
- Fig. 41. Another example is a person shopping for a home.
- the system refers the image both to a private database of MLS information, and a public database such as Google.
- the system responds with a variety of options, including reviewing photos of the nearest houses offered for sale; reviewing photos of houses listed for sale that are closest in value to the pictured home, and within the same zip-code; reviewing photos of houses listed for sale that are most similar in features to the pictured home, and within the same zip-code; neighborhood and school information, etc. (Fig. 43.)
- a first user snaps an image of Paul Simon at a concert.
- the system automatically posts the image to the user's Flickr account - together with metadata inferred by the procedures detailed above. (The name of the artist may have been found in a search of Google for the user's geolocation; e.g., a Ticketmaster web page revealed that Paul Simon was playing that venue that night.)
- the first user's picture a moment later, is encountered by a system processing a second concert-goer's photo of the same event, from a different vantage.
- the second user is shown the first user's photo as one of the system's responses to the second photo.
- the system may also alert the first user that another picture of the same event - from a different viewpoint - is available for review on his cell phone, if he'll press a certain button twice.
- the content is the network.
- Google is limited to analysis and exploitation of links between digital content
- the technology detailed herein allows the analysis and exploitation of links between physical content as well (and between physical and electronic content).
- the device may be provided with different actuator buttons - each invoking a different operation with the captured image information.
- the user can indicate - at the outset - the type of action intended (e.g., identify faces in image per Picasa or VideoSurf information, and post to my FaceBook page; or try and identify the depicted person, and send a "friend request" to that person's MySpace account).
- the function of a sole actuator button can be controlled in accordance with other UI controls on the device. For example, repeated pressing of a Function Select button can cause different intended operations to be displayed on the screen of the UI (just as familiar consumer cameras have different photo modes, such as Close-up, Beach, Nighttime, Portrait, etc.). When the user then presses the shutter button, the selected operation is invoked.
- Metadata inferred by the processes detailed herein can be saved in conjunction with the imagery (qualified, perhaps, as to its confidence).
- Business rules can dictate a response appropriate to a given situation. These rules and responses may be determined by reference to data collected by web indexers, such as Google, etc., using intelligent routing. Crowdsourcing is not generally suitable for real-time implementations. However, inputs that stymie the system and fail to yield a corresponding action (or yield actions from which user selects none) can be referred offline for crowdsource analysis - so that next time it's presented, it can be handled better.
- Fig. 57A shows that web pages on the internet relate in a point-to-point fashion. For example, web page 1 may link to web pages 2 and 3. Web page 3 may link to page 2. Web page 2 may link to page 4. Etc.
- Fig. 57B shows the contrasting network associated with image-based navigation. The individual images are linked a central node (e.g., a router), which then links to further nodes (e.g., response engines) in accordance with the image information.
- a central node e.g., a router
- the "router” here does not simply route an input packet to a destination determined by address information conveyed with the packet - as in the familiar case with internet traffic routers. Rather, the router takes image information and decides what to do with it, e.g., to which responsive system should the image information be referred.
- Routers can be stand-alone nodes on a network, or they can be integrated with other devices. (Or their functionality can be distributed between such locations.)
- a wearable computer may have a router portion (e.g., a set of software instructions) - which takes image information from the computer, and decides how it should be handled. (For example, if it recognizes the image information as being an image of a business card, it may OCR name, phone number, and other data, and enter it into a contacts database.)
- the particular response for different types of input image information can be determined by a registry database, e.g., of the sort maintained by a computer's operating system, or otherwise.
- response engines can be stand-alone nodes on a network, they can also be integrated with other devices (or their functions distributed).
- a wearable computer may have one or several different response engines that take action on information provided by the router portion.
- Fig. 52 shows an arrangement employing several computers (A-E), some of which may be wearable computers (e.g., cell phones).
- the computers include the usual complement of processor, memory, storage, input/output, etc.
- the storage or memory can contain content, such as images, audio and video.
- the computers can also include one or more routers and/or response engines. Standalone routers and response engines may also be coupled to the network
- the computers are networked, shown schematically by link 150.
- This connection can be by any known networking arrangement, including the internet and/or wireless links (WiFi, WiMax, Bluetooth, etc),
- Software in at least certain of the computers includes a peer-to-peer (P2P) client, which makes at least some of that computer's resources available to other computers on the network, and reciprocally enables that computer to employ certain resources of the other computers.
- P2P peer-to-peer
- computer A may obtain image, video and audio content from computer B.
- Sharing parameters on computer B can be set to determine which content is shared, and with whom.
- Data on computer B may specify, for example, that some content is to be kept private; some may be shared with known parties (e.g., a tier of social network "Friends"); and other may be freely shared. (Other information, such as geographic position information, may also be shared - subject to such parameters.)
- the sharing parameters may also specify sharing based on the content age. For example, content/information older than a year might be shared freely, and content older than a month might be shared with a tier of friends (or in accordance with other rule-based restrictions). In other arrangements, fresher content might be the type most liberally shared. E.g., content captured or stored within the past hour, day or week might be shared freely, and content from within the past month or year might be shared with friends.
- An exception list can identify content - or one or more classes of content - that is treated differently than the above-detailed rules (e.g., never shared or always shared).
- the computers can also share their respective router and response engine resources across the network.
- computer A does not have a response engine suitable for a certain type of image information, it can pass the information to computer B for handling by its response engine.
- the "peer" groupings can be defined geographically, e.g., computers that find themselves within a particular spatial environment (e.g., an area served by a particular WiFi system). The peers can thus establish dynamic, ad hoc subscriptions to content and services from nearby computers. When the computer leaves that environment, the session ends. Some researchers foresee the day when all of our experiences are captured in digital form.
- Certain embodiments incorporating aspects of the present technology are well suited for use with such experiential digital content - either as input to a system (i.e., the system responds to the user's present experience), or as a resource from which metadata, habits, and other attributes can be mined (including service in the role of the Flickr archive in the embodiments earlier detailed).
- the user's desire can be expressed by a deliberate action by the user, e.g., pushing a button, or making a gesture with head or hand.
- the system takes data from the current experiential environment, and provides candidate responses.
- Electroencephalography can be used to generate a signal that triggers the system's response (or triggers one of several different responses, e.g., responsive to different stimuli in the current environment).
- Skin conductivity, pupil dilation, and other autonomous physiological responses can also be optically or electrically sensed, and provide a triggering signal to the system.
- Eye tracking technology can be employed to identify which object in a field of view captured by an experiential-video sensor is of interest to the user. If Tony is sitting in a bar, and his eye falls on a bottle of unusual beer in front of a nearby woman, the system can identify his point of focal attention, and focus its own processing efforts on pixels corresponding to that bottle. With a signal from Tony, such as two quick eye- blinks, the system can launch an effort to provide candidate responses based on that beer bottle - perhaps also informed by other information gleaned from the environment (time of day, date, ambient audio, etc.) as well as Tony's own personal profile data.
- the system may quickly identify the beer as Doppelbock, e.g., by pattern matching from the image (and/or OCR). With that identifier it finds other resources indicating the beer originates from Bavaria, where it is brewed by monks of St. Francis of Paula. Its 9% alcohol content also is distinctive.
- the system learns that his buddy Geoff is fond of Doppelbock, and most recently drank a bottle in a pub in Dublin. Tony's glancing encounter with the bottle is logged in his own experiential archive, where Geoff may later encounter same. The fact of the encounter may also be real-time-relayed to Geoff in Prague, helping populate an on-going data feed about his friends' activities.
- the bar may also provide an experiential data server, to which Tony is wirelessly granted access.
- the server maintains an archive of digital data captured in the bar, and contributed by patrons.
- the server may also be primed with related metadata & information the management might consider of interest to its patrons, such as the Wikipedia page on the brewing methods of the monks of St Paul, what bands might be playing in weeks to come, or what the night's specials are. (Per user preference, some users require that their data be cleared when they leave the bar; others permit the data to be retained.)
- Tony's system may routinely check the local environment's experiential data server to see what odd bits of information might be found.
- P2P networks such as BitTorrent have permitted sharing of audio, image and video content
- arrangements like that shown in Fig. 52 allow networks to share a contextually-richer set of experiential content.
- a basic tenet of P2P networks is that even in the face of technologies that that mine the long-tail of content, the vast majority of users are interested in similar content (the score of tonight's NBA game, the current episode of Lost, etc.), and that given sufficient bandwidth and protocols, the most efficient mechanism to deliver similar content to users is not by sending individual streams, but by piecing the content together based on what your "neighbors" have on the network.
- This same mechanism can be used to provide metadata related to enhancing an experience, such as being at the bar drinking a Dopplebock, or watching a highlight of tonight's NBA game on a phone while at the bar.
- the protocol used in the ad-hoc network described above might leverage P2P protocols with the experience server providing a peer registration service (similar to early P2P networks) or in a true P2P modality, with all devices in the ad-hoc network advertising what experiences (metadata, content, social connections, etc.) they have available, either for free, for payment, or for barter of information in-kind, etc. Apple's Bonjour software is well suited for this sort of application.
- Tony's cell phone may simply retrieve the information on Dopplebock by posting the question to the peer network and receive a wealth of information from a variety of devices within the bar or the experience server, without ever knowing the source.
- the experience server may also act as data- recorder, recording the experiences of those within the ad-hoc network, providing a persistence to experience in time and place. Geoff may visit the same bar at some point in the future and see what threads of communication or connections his friend Tony made two weeks earlier, or possibly even leave a note for Tony to retrieve a future time next time he is at the bar.
- the ability to mine the social threads represented by the traffic on the network can also enable the proprietors of the bar to augment the experiences of the patrons by orchestrating interaction or introductions. This may include people with shared interests, singles, etc. or in the form of gaming by allowing people to opt- in to theme based games, where patrons piece together clues to find the true identity of someone in the bar or unravel a mystery (similar to the board game Clue).
- the demographic information as it relates to audience measurement is of material value to proprietors as they consider which beers to stock next, where to advertise, etc.
- Certain portable devices such as the Apple iPhone, offer single-button access to pre-defined functions. Among these are viewing prices of favorite stocks, viewing a weather forecast, and viewing a general map of the user's location. Additional functions are available, but the user must undertake a series of additional manipulations, e.g., to reach a favorite web site, etc.
- An embodiment of certain aspect of the present technology allows these further manipulations to be shortcut by capturing distinctive imagery. Capturing an image of the user's hand may link the user to a babycam back home - delivering real time video of a newborn in a crib. Capturing an image of a wristwatch may load a map showing traffic conditions along some part of a route on the user's drive home, etc. Such functionality is shown in Figs. 53-55.
- a user interface for the portable device includes a set-up/training phase that allows the user to associate different functions with different visual signs.
- the user is prompted to capture a picture, and enter the URL and name of an action that is to be associated with the depicted object.
- the URL is one type of response; others can also be used - such as launching a JAVA application, etc.
- the system then characterizes the snapped image by deriving a set of feature vectors by which similar images can be recognized (e.g., thru pattern/template matching).
- the feature vectors are stored in a data structure (Fig. 55), in association with the function name and associated URL.
- the user may capture several images of the same visual sign - perhaps from different distances and perspectives, and with different lighting and backgrounds.
- the feature extraction algorithm processes the collection to extract a feature set that captures shared similarities of all of the training images.
- the extraction of image features, and storage of the data structure can be performed at the portable device, or at a remote device (or in distributed fashion).
- the device can check each image captured by the device for correspondence with one of the stored visual signs. If any is recognized, the corresponding action can be launched. Else, the device responds with the other functions available to the user upon capturing a new image.
- the portable device is equipped with two or more shutter buttons.
- Manipulation of one button captures an image and executes an action - based on a closest match between the captured image and a stored visual sign.
- Manipulation of another button captures an image without undertaking such an action.
- the device UI can include a control that presents a visual glossary of signs to the user, as shown in Fig. 54.
- thumbnails of different visual signs are presented on the device display, in association with names of the functions earlier stored - reminding the user of the defined vocabulary of signs.
- the control that launches this glossary of signs can - itself - be an image.
- One image suitable for this function is a generally featureless frame.
- An all-dark frame can be achieved by operating the shutter with the lens covered.
- An all-light frame can be achieved by operating the shutter with the lens pointing at a light source.
- Another substantially featureless frame (of intermediate density) may be achieved by imaging a patch of skin, or wall, or sky. (To be substantially featureless, the frame should be closer to featureless than matching one of the other stored visual signs. In other embodiments, "featureless" can be concluded if the image has a texture metric below a threshold value.)
- a threshold can be set - by the user with a UI control, or by the manufacturer - to establish how "light” or "dark” such a frame must be in order to be interpreted as a command. For example, 8-bit (0-255) pixel values from a million pixel sensor can be summed. If the sum is less than 900,000, the frame may be regarded as all-dark. If greater than 254 million, the frame may be regarded as all-light. Etc.
- One of the other featureless frames can trigger another special response. It can cause the portable device to launch all of the stored functions/URLs (or, e.g., a certain five or ten) in the glossary.
- the device can cache the resulting frames of information, and present them successively when the user operates one of the phone controls, such as button 116b or scroll wheel 124 in Fig. 44, or makes a certain gesture on a touch screen. (This function can be invoked by other controls as well.)
- the third of the featureless frames can send the device's location to a map server, which can then transmit back multiple map views of the user's location. These views may include aerial views and street map views at different zoom levels, together with nearby street-level imagery. Each of these frames can be cached at the device, and quickly reviewed by turning a scroll wheel or other UI control.
- the user interface desirably includes controls for deleting visual signs, and editing the name/functionality assigned to each.
- the URLs can be defined by typing on a keypad, or by navigating otherwise to a desired destination and then saving that destination as the response corresponding to a particular image.
- Training of the pattern recognition engine can continue through use, with successive images of the different visual signs each serving to refine the template model by which that visual sign is defined.
- a hand can define many different signs, with fingers arranged in different positions (fist, one- through five-fingers, thumb -forefinger OK sign, open palm, thumbs-up, American sign language signs, etc).
- Apparel and its components e.g., shoes, buttons
- features from common surroundings e.g., telephone may also be used.
- a software program or web service may present a list of options to the user. Rather than manipulating a keyboard to enter, e.g., choice #3, the user may capture an image of three fingers - visually symbolizing the selection. Software recognizes the three finger symbol as meaning the digit 3, and inputs that value to the process.
- visual signs can form part of authentication procedures, e.g., to access a bank or social- networking web site.
- the user may be shown a stored image (to confirm that the site is authentic) and then be prompted to submit an image of a particular visual type (earlier defined by the user, but not now specifically prompted by the site).
- the web site checks features extracted from the just-captured image for correspondence with an expected response, before permitting the user to access the web site.
- Other embodiments can respond to a sequence of snapshots within a certain period (e.g., 10 seconds) - a grammar of imagery.
- An image sequence of "wristwatch,” “four fingers” “three fingers” can set an alarm clock function on the portable device to chime at 7 am.
- the visual signs may be gestures that include motion - captured as a sequence of frames (e.g., video) by the portable device.
- Context data (e.g., indicating the user's geographic location, time of day, month, etc.) can also be used to tailor the response. For example, when a user is at work, the response to a certain visual sign may be to fetch an image from a security camera from the user's home. At home, the response to the same sign may be to fetch an image from a security camera at work.
- the response needn't be visual. Audio or other output (e.g., tactile, smell, etc.) can of course be employed.
- the just-described technology allows a user to define a glossary of visual signs and corresponding customized responses.
- An intended response can be quickly invoked by imaging a readily-available subject.
- the captured image can be of low quality (e.g., overexposed, blurry), since it only needs to be classified among, and distinguished from, a relatively small universe of alternatives.
- Another aspect of the present technology is to perform one or more visual intelligence pre-processing operations on image information captured by a camera sensor. These operations may be performed without user request, and before other image processing operations that the camera customarily performs.
- Fig. 56 is a simplified diagram showing certain of the processing performed in an exemplary camera, such as a cell phone camera.
- an image sensor comprising an array of photodiodes. (CCD or CMOS sensor technologies are commonly used.)
- the resulting analog electrical signals are amplified, and converted to digital form by D/A converters.
- the outputs of these D/A converters provide image data in its most raw, or "native,” form.
- Bayer interpolation de-mosaicing
- the photodiodes of the sensor array typically each captures only a single color of light: red, green or blue (R/G/B), due to a color filter array.
- This array is comprised of a tiled 2x2 pattern of filter elements: one red, a diagonally-opposite one blue, and the other two green.
- Bayer interpolation effectively "fills in the blanks" of the sensor's resulting R/G/B mosaic pattern, e.g., providing a red signal where there is a blue filter, etc.
- white balance correction Another common operation is white balance correction. This process adjusts the intensities of the component R/G/B colors in order to render certain colors (especially neutral colors) correctly.
- JPEG compression is most commonly used.
- the processed, compressed image data is then stored in a buffer memory. Only at this point is the image information commonly available to other processes and services of the cell phone (e.g., by calling a system API).
- One such process that is commonly invoked with this processed image data is to present the image to the user on the screen of the camera. The user can then assess the image and decide, e.g., whether (1) to save it to the camera's memory card, (2) to transmit it in a picture message, (3) to delete it, etc.
- the image stays in the buffer memory.
- the only use made of the processed image data is to display same on the screen of the cell phone.
- Fig. 57 shows an exemplary embodiment of the presently-discussed aspect of the technology. After converting the analog signals into digital native form, one or more other processes are performed.
- One such process is to perform a Fourier transformation (e.g., an FFT) on the native image data. This converts the spatial-domain representation of the image into a frequency-domain representation.
- a Fourier transformation e.g., an FFT
- a Fourier-domain representation of the native image data can be useful in various ways. One is to screen the image for likely barcode data.
- One familiar 2D barcode is a checkerboard-like array of light- and dark-squares.
- the size of the component squares, and thus their repetition spacing, gives a pair of notable peaks in the Fourier-domain representation of the image at a corresponding frequency.
- the peaks may be phase-spaced ninety degrees in the UV plane, if the pattern recurs in equal frequency in both the vertical and horizontal directions.
- These peaks extend significantly above other image components at nearby image frequencies - with the peaks often having a magnitude twice- to five- or ten- times (or more) that of nearby image frequencies.
- Fourier transform information can be analyzed for telltale signs associated with an image of a barcode.
- a template- like approach can be used.
- the template can comprise a set of parameters against which the Fourier transform information is tested - to see if the data has indicia associated with a barcode-like pattern.
- the Fourier data is consistent with an image depicting a 2D barcode
- corresponding information can be routed for further processing (e.g., sent from the cell phone to a barcode-responsive service).
- This information can comprise the native image data, and/or the Fourier transform information derived from the image data.
- the full image data needn't be sent.
- a down-sampled version of the image data e.g., one- fourth the resolution in both the horizontal and vertical directions, can be sent.
- patches of the image data having the highest likelihood of depicting part of a barcode pattern can be sent.
- patches of the image data having the lowest likelihood of depicting a barcode can not be sent.
- the transmission can be prompted by the user.
- the camera UI may ask the user if information should be directed for barcode processing.
- the transmission is dispatched immediately upon a determination that the image frame matches the template, indicating possible barcode data. No user action is involved.
- the Fourier transform data can be tested for signs of other image subjects as well.
- a ID barcode for example, is characterized by a significant amplitude component at a high frequency - (going "across the pickets," and another significant amplitude spike at a low frequency - going along the pickets.
- the Fourier-Mellin (F-M) transform is also useful in characterizing various image subjects/components - including the barcodes noted above.
- the F-M transform has the advantage of being robust to scale and rotation of the image subject (scale/rotation invariance). In an exemplary embodiment, if the scale of the subject increases (as by moving the camera closer), the F-M transform pattern shifts up; if the scale decreases, the F-M pattern shifts down. Similarly, if the subject is rotated clockwise, the F-M pattern shifts right; if rotated counterclockwise, the F-M pattern shifts left.
- F-M data important in recognizing patterns that may be affine- transformed, such as facial recognition, character recognition, object recognition, etc.
- the arrangement shown in Fig. 57 applies a Mellin transform to the output of the Fourier transform process, to yield F-M data.
- the F-M can then be screened for attributes associated with different image subjects.
- text is characterized by plural symbols of approximately similar size, composed of strokes in a foreground color that contrast with a larger background field.
- Vertical edges tend to dominate (albeit slightly inclined with italics), with significant energy also being found in the horizontal direction. Spacings between strokes usually fall within a fairly narrow range.
- a template can define tests by which the F-M data is screened to indicate the likely presence of text in the captured native image data. If the image is determined to include likely-text, it can be dispatched to a service that handles this type of data (e.g., an optical character recognition, or OCR, engine). Again, the image (or a variant of the image) can be sent, or the transform data can be sent, or some other data.
- a service e.g., an optical character recognition, or OCR, engine.
- a watermark orientation signal is a distinctive signal present in some watermarks that can serve as a sign that a watermark is present.
- the templates may be compiled by testing with known images (e.g., "training"). By capturing images of many different text presentations, the resulting transform data can be examined for attributes that are consistent across the sample set, or (more likely) that fall within bounded ranges. These attributes can then be used as the template by which images containing likely-text are identified. (Likewise for faces, barcodes, and other types of image subjects.)
- Fig. 57 shows that a variety of different transforms can be applied to the image data. These are generally shown as being performed in parallel, although one or more can be performed sequentially - either all operating on the same input image data, or one transform using an output of a previous transform (as is the case with the Mellin transform). Although not all shown (for clarity of illustration), outputs from each of the other transform processes can be examined for characteristics that suggest the presence of a certain image type. If found, related data is then sent to a service appropriate to that type of image information. In addition to Fourier transform and Mellin transform processes, processes such as eigenface
- the outputs from some processes may be input to other processes.
- an output from one of the boxes labeled ETC in Fig. 57 is provided as an input to the Fourier transform process.
- This ETC box can be, for example, a filtering operation.
- Sample filtering operations include median, Laplacian, Wiener, Sobel, high- pass, low-pass, bandpass, Gabor, signum, etc. (Digimarc's patents 6,442,284, 6,483,927, 6,516,079, 6,614,914, 6,631,198, 6,724,914, 6,988,202, 7,013,021 and 7,076,082 show various such filters.)
- a single service may handle different data types, or data that passes different screens.
- a facial recognition service may receive F-M transform data, or eigenface data. Or it may receive image information that has passed one of several different screens (e.g., its F-M transform passed one screen, or its eigenface representation passed a different screen).
- data can be sent to two or more different services.
- Fig. 57 it is desirable that some or all of the processing shown in Fig. 57 be performed by circuitry integrated on the same substrate as the image sensors. (Some of the operations may be performed by programmable hardware - either on the substrate or off - responsive to software instructions.) While the foregoing operations are described as immediately following conversion of the analog sensor signals to digital form, in other embodiments such operations can be performed after other processing operations (e.g., Bayer interpolation, white balance correction, JPEG compression, etc.).
- Some of the services to which information is sent may be provided locally in the cell phone. Or they can be provided by a remote device, with which the cell phone establishes a link that is at least partly wireless. Or such processing can be distributed among various devices.
- Foveon and panchromatic image sensors can alternately be used. So can high dynamic range sensors, and sensors using Kodak's Truesense Color Filter Pattern (which add panchromatic sensor pixels to the usual Bayer array of red/green/blue sensor pixels). Sensors with infrared output data can also advantageously be used. For example, sensors that output infrared image data (in addition to visible image data, or not) can be used to identify faces and other image subjects with temperature differentials - aiding in segmenting image subjects within the frame.)
- One processing chain produces data to be rendered into perceptual form for use by human viewers.
- This chain typically includes at least one of a de-mosaic processor, a white balance module, and a JPEG image compressor, etc.
- the second processing chain produces data to be analyzed by one or more machine-implemented algorithms, and in the illustrative example includes a Fourier transform processor, an eigenface processor, etc.
- Such processing architectures are further detailed in application 61/176,739, cited earlier.
- one or more appropriate image-responsive services can begin formulating candidate responses to the visual stimuli before the user has even decided what to do with the captured image.
- Motion is most commonly associated with video, and the techniques detailed herein can be used when capturing video content.
- motion/temporal implications are also present with "still" imagery.
- some image sensors are read sequentially, top row to bottom row.
- the image subject may move within the image frame (i.e., due to camera movement or subject movement).
- An exaggerated view of this effect is shown in Fig. 60, depicting an imaged "E" captured as the sensor is moved to the left.
- the vertical stroke of the letter is further from the left edge of the image frame at the bottom than the top, due to movement of the sensor while the pixel data is being clocked-out.
- the phenomenon also arises when the camera assembles data from several frames to generate a single "still” image.
- many consumer imaging devices rapidly capture plural frames of image data, and composite different aspects of the data together (using software provided, e.g., by FotoNation, Inc., now Tessera Technologies, Inc.). For example, the device may take three exposures - one exposed to optimize appearance of faces detected in the image frame, another exposed in accordance with the background, and other exposed in accordance with the foreground. These are melded together to create a pleasing montage.
- the camera captures a burst of frames and, in each, determines whether persons are smiling or blinking. It may then select different faces from different frames to yield a final image.
- Detection of motion can be accomplished in the spatial domain (e.g., by reference to movement of feature pixels between frames), or in a transform domain.
- Fourier transform and DCT data are exemplary.
- the system may extract the transform domain signature of an image component, and track its movement across different frames - identifying its motion.
- One illustrative technique deletes, e.g., the lowest N frequency coefficients - leaving just high frequency edges, etc. (The highest M frequency coefficients may be disregarded as well.)
- a thresholding operation is performed on the magnitudes of the remaining coefficients - zeroing those below a value (such as 30% of the mean). The resulting coefficients serve as the signature for that image region.
- the transform may be based, e.g., on tiles of 8x8 pixels.
- a pattern corresponding to this signature is found at a nearby location within another (or the same) image frame (using known similarity testing, such as correlation), movement of that image region can be identified.
- a rough analogy is user interaction with Google. Bare search terms aren't sent to a Google mainframe, as if from a dumb terminal. Instead, the user's computer formats a query as an HTTP request, including the internet protocol address of the originating computer (indicative of location), and makes available cookie information by which user language preferences, desired safe search filtering, etc., can be discerned. This structuring of relevant information serves as a precursor to Google's search process, allowing Google to perform the search process more intelligently - providing faster and better results to the user.
- Fig. 61 shows some of the metadata that may be involved in an exemplary system.
- the left-most column of information types may be computed directly from the native image data signals taken from the image sensor. (As noted, some or all of these can be computed using processing arrangements integrated with the sensor on a common substrate.) Additional information may be derived by reference to these basic data types, as shown by the second column of information types. This further information may be produced by processing in the cell phone, or external services can be employed (e.g., the OCR recognition service shown in Fig. 57 can be within the cell phone, or can be a remote server, etc.; similarly with the operations shown in Fig. 50.).
- RGB Red luminance
- a second conveys green luminance
- a third conveys blue luminance.
- CMYK the channels respectively conveying cyan, magenta, yellow, and black information
- Ditto with YUV - commonly used with video a luma, or brightness, channel: Y, and two color channels: U and V
- LAB also brightness, with two color channels.
- alpha is provided to convey opacity information - indicating the extent to which background subjects are visible through the imagery.
- the alpha channel is not much used (except, most notably, in computer generated imagery and radiology). Certain implementations of the present technology use the alpha channel to transmit information derived from image data.
- the different channels of image formats commonly have the same size and bit-depth.
- the red channel may convey 8-bit data (allowing values of 0-255 to be represented), for each pixel in a 640 x 480 array.
- the green and blue channels are also commonly 8 bits, and co-extensive with the image size (e.g., 8 bits x 640 x 480). Every pixel thus has a red value, a green value, a blue value, and an alpha value.
- the composite image representation is commonly known as RGBA.
- Fig. 62 shows a picture that a user may snap with a cell phone.
- a processor in the cell phone may apply an edge detection filter (e.g., a Sobel filter) to the image data, yielding an edge map.
- an edge detection filter e.g., a Sobel filter
- Each pixel of the image is either determined to be part of an edge, or not. So this edge information can be conveyed in just one bit plane of the eight bit planes available in the alpha channel.
- Such an alpha channel payload is shown in Fig. 63.
- the cell phone camera may also apply known techniques to identify faces within the image frame.
- the red, green and blue image data from pixels corresponding to facial regions can be combined to yield a grey-scale representation, and this representation can be included in the alpha channel - e.g., in aligned correspondence with the identified faces in the RGB image data.
- An alpha channel conveying both edge information and greyscale faces is shown in Fig. 64. (An 8-bit greyscale is used for faces in the illustrated embodiment, although a shallower bit-depth, such as 6- or 7-bits, can be used in other arrangements - freeing other bit planes for other information.)
- the camera may also perform operations to locate the positions of the eyes and mouth in each detected face.
- Markers can be transmitted in the alpha channel - indicating the scale and positions of these detected features.
- a simple form of marker is a "smiley face" bit-mapped icon, with the eyes and mouth of the icon located at the positions of the detected eyes and mouth.
- the scale of the face can be indicated by the length of the iconic mouth, or by the size of a surrounding oval (or the space between the eye markers).
- the tilt of the face can be indicated by the angle of the mouth (or the angle of the line between the eyes, or the tilt of a surrounding oval).
- the cell phone processing yields a determination of the genders of persons depicted in the image, this too can be represented in the extra image channel. For example, an oval line circumscribing the detected face of a female may be made dashed or otherwise patterned. The eyes may be represented as cross-hairs or Xs instead of blackened circles, etc. Ages of depicted persons may also be approximated, and indicated similarly.
- the processing may also classify each person's emotional state by visual facial clues, and an indication such as surprise/happiness/sadness/anger /neutral can be represented. (See, e.g., Su, "A simple approach to facial expression recognition," Proceedings of the 2007 IntT Conf on Computer Engineering and Applications, Queensland, Australia, 2007, pp. 456-461. See also patent publications 20080218472 (Emotiv Systems, Pty), and 20040207720 (NTT DoCoMo)).
- a confidence metric output by the analysis process can also be represented in an iconic fashion, such as by the width of the line, or the scale or selection of pattern elements.
- Fig. 65 shows different pattern elements that can be used to denote different information, including gender and confidence, in an auxiliary image plane.
- the portable device may also perform operations culminating in optical character recognition of alphanumeric symbols and strings depicted in the image data.
- the device may recognize the string "LAS VEGAS" in the picture.
- This determination can be memorialized by a PDF417 2D barcode added to the alpha channel.
- the barcode can be in the position of the OCR'd text in the image frame, or elsewhere.
- PDF417 is exemplary only.
- Other barcodes - such as ID, Aztec, Datamatrix, High Capacity Color Barcode, Maxicode, QR Code, Semacode, and ShotCode - or other machine-readable data symbologies - such as OCR fonts and data glyphs - can naturally be used. Glyphs can be used both to convey arbitrary data, and also to form halftone image depictions. See in this regard Xerox's patent 6,419,162, and Hecht, "Printed Embedded Data Graphical User Interfaces," IEEE Computer Magazine, Vol. 34, No. 3, 2001, pp 47-55.)
- Fig. 66 shows an alpha channel representation of some of the information determined by the device.
- Figs. 62-66 showed a variety of information that can be conveyed in the alpha channel, and different representations of same, still more are shown in the example of Figs. 67-69. These involve a cell phone picture of a new GMC truck and its owner.
- the cell phone in this example processed the image data to recognize the model, year and color of the truck, recognize the text on the truck grill and the owner's t-shirt, recognize the owner's face, and recognize areas of grass and sky.
- the sky was recognized by its position at the top of the frame, its color histogram within a threshold distance of expected norms, and a spectral composition weak in certain frequency coefficients (e.g., a substantially "flat” region).
- the grass was recognized by its texture and color.
- Fig. 68 shows an illustrative graphical, bitonal representation of the discerned information, as added to the alpha channel of the Fig. 67 image.
- Fig. 69 shows the different planes of the composite image: red, green, blue, and alpha.
- the portion of the image area detected as depicting grass is indicated by a uniform array of dots.
- the image area depicting sky is represented as a grid of lines. (If trees had been particularly identified, they could have been labeled using one of the same patterns, but with different size/spacing/etc. Or an entirely different pattern could have been used.)
- the face information is encoded in a second PDF417 barcode.
- This second barcode is oriented at 90 degrees relative to the truck barcode, and is scaled differently, to help distinguish the two distinct symbols to downstream decoders. (Other different orientations could be used, and in some cases are preferable, e.g., 30 degrees, 45 degrees, etc.)
- the facial barcode is oval in shape, and may be outlined with an oval border (although this is not depicted).
- the center of the barcode is placed at the mid-point of the person's eyes.
- the width of the barcode is twice the distance between the eyes.
- the height of the oval barcode is four times the distance between the mouth and a line joining the eyes.
- the payload of the facial barcode conveys information discerned from the face.
- the barcode simply indicates the apparent presence of a face.
- eigenvectors computed from the facial image can be encoded. If a particular face is recognized, information identifying the person can be encoded.
- the processor makes a judgment about the likely gender of the subject, this information can be conveyed in the barcode too.
- Persons appearing in imagery captured by consumer cameras and cell phones are not random: a significant percentage are of recurring subjects, e.g., the owner's children, spouse, friends, the user himself/herself, etc. There are often multiple previous images of these recurring subjects distributed among devices owned or used by the owner, e.g., PDA, cell phone, home computer, network storage, etc. Many of these images are annotated with names of the persons depicted. From such reference images, sets of characterizing facial vectors can be computed, and used to identify subjects in new photos.
- Such a library of reference facial vectors can be checked to try and identify the person depicted in the Fig. 67 photograph, and the identification can be represented in the barcode.
- the identification can comprise the person's name, and/or other identifier(s) by which the matched face is known, e.g., an index number in a database or contact list, a telephone number, a FaceBook user name, etc.
- Fig. 68 alpha channel A variety of further information may be included in the Fig. 68 alpha channel. For example, locations in the frame where a processor suspects text is present, but OCRing did not successfully decode alphanumeric symbols (on the tires perhaps, or other characters on the person's shirt), can be identified by adding a corresponding visual clue (e.g., a pattern of diagonal lines). An outline of the person (rather than just an indication of his face) can also be detected by a processor, and indicated by a corresponding border or fill pattern. While the examples of Figs. 62-66 and Figs. 67-69 show various different ways of representing semantic metadata in the alpha channel, still more techniques are shown in the example of Figs. 70-71. Here a user has captured a snapshot of a child at play (Fig. 70).
- the child's face is turned away from the camera, and is captured with poor contrast.
- the processor makes a likely identification by referring to the user's previous images: the user's firstborn child Matthew Doe (who seems to be found in countless of the user's archived photos).
- the alpha channel in this example conveys an edge-detected version of the user's image.
- a substitute image of the child's face Superimposed over the child's head is a substitute image of the child's face.
- This substitute image can be selected for its composition (e.g., depicting two eyes, nose and mouth) and better contrast.
- each person known to the system has an iconic facial image that serves as a visual proxy for the person in different contexts.
- some PDAs store contact lists that include facial images of the contacts. The user (or the contacts) provides facial images that are easily recognized - iconic. These iconic facial images can be scaled to match the head of the person depicted in an image, and added to the alpha channel at the corresponding facial location.
- a 2D barcode can convey other of the information discerned from processing of the image data or otherwise available (e.g., the child's name, a color histogram, exposure metadata, how many faces were detected in the picture, the ten largest DCT or other transform coefficients, etc.).
- the 2D barcode As robust as possible to compression and other image processing operations, its size may not be fixed, but rather is dynamically scaled based on circumstances - such as image characteristics.
- the processor analyzes the edge map to identify regions with uniform edginess (i.e., within a thresholded range). The largest such region is selected. The barcode is then scaled and placed to occupy a central area of this region. (In subsequent processing, the edginess where the barcode was substituted can be largely recovered by averaging the edginess at the center points adjoining the four barcode sides.)
- region size is tempered with edginess in determining where to place a barcode: low edginess is preferred.
- a smaller region of lower edginess may be chosen over a larger region of higher edginess.
- each candidate region minus a scaled value of edginess in the region, can serve as a metric to determine which region should host the barcode.
- This is the arrangement used in Fig. 71, resulting in placement of the barcode in a region to the left of Matthew's head - rather than in a larger, but edgier, region to the right.
- the Fig. 70 photo is relatively "edgy" (as contrasted, e.g., with the Fig. 62 photo), much of the edginess may be irrelevant.
- the edge data is filtered to preserve only the principal edges (e.g., those indicated by continuous line contours).
- a processor can convey additional data.
- the processor inserts a pattern to indicate a particular color histogram bin into which that region' s image colors fall. (In a 64-bin histogram, requiring 64 different patterns, bin 2 may encompass colors in which the red channel has values of 0-63, the green channel has values of 0-63, and the blue channel has a values of 64-127, etc.) Other image metrics can similarly be conveyed.
- a human can distinguish a man embracing a woman, in front of a sign stating "WELCOME TO naval LAS VEGAS NEVADA.”
- the human can see greyscale faces, and an outline of the scene.
- the person can additionally identify a barcode conveying some information, and can identify two smiley face icons showing the positions of faces.
- a viewer to whom the frame of graphical information in Fig. 68 is rendered can identify an outline of a person, can read the LSU TIGERS from the person's shirt, and make out what appears to be the outline of a truck (aided by the clue of the GMC text where the truck's grill would be). From presentation of the Fig. 71 alpha channel data a human can identify a child sitting on the floor, playing with toys.
- the barcode in Fig. 71 like the barcode in Fig. 66, conspicuously indicates to an inspecting human the presence of information, albeit not its content.
- graphical content in the alpha channel may not be informative to a human upon inspection. For example, if the child's name is steganographically encoded as a digital watermark in a noise- like signal in Fig. 71, even the presence of information in that noise may go undetected by the person.
- the sensor chip in a portable device may have on-chip processing that performs certain analyses, and adds resulting data to the alpha channel.
- the device may have another processor that performs further processing - on the image data and/or on the results of the earlier analyses -and adds a representation of those further results to the alpha channel. (These further results may be based, in part, on data acquired wirelessly from a remote source.
- a consumer camera may link by Bluetooth to the user's PDA, to obtain facial information from the user's contact files.
- the composite image file may be transmitted from the portable device to an intermediate network node
- an intermediate network node can perform more complex, resource-intensive processing - such as more sophisticated facial recognition and pattern matching.
- Such a node can also employ a variety of remote resources to augment the alpha channel with additional data, e.g., links to Wikipedia entries - or Wikipedia content itself, information from telephone database and image database lookups, etc.)
- the thus-supplemented image may then be forwarded to an image query service provider (e.g., SnapNow, MobileAcuity, etc.), which can continue the process and/or instruct a responsive action based on the information thus-provided.
- the alpha channel may thus convey an iconic view of what all preceding processing has discerned or learned about the image. Each subsequent processor can readily access this information, and contribute still more. All this within the existing workflow channels and constraints of long-established file formats.
- the provenance of some or all of the discerned/inferred data is indicated.
- stored data may indicate that OCRing which yielded certain text was performed by a Verizon server having a unique identifier, such as MAC address of 01-50-F3-83-AB-CC or network identifier PDX-
- LA002290.corp.verizon-dot-com on August 28, 2008, 8:35 pm.
- Such information can be stored in the alpha channel, in header data, in a remote repository to which a pointer is provided, etc.
- a capture device may write its information to bit plane #1.
- An intermediate node may store its contributions in bit plane #2. Etc. Certain bit planes may be available for shared use.
- bit planes may be allocated for different classes or types of semantic information.
- Information relating to faces or persons in the image may always be written to bit plane #1.
- Information relating to places may always be written to bit plane #2.
- Edge map data may always be found in bit plane #3, together with color histogram data (e.g., represented in 2D barcode form).
- Other content labeling e.g., grass, sand, sky
- Textual information such as related links or textual content obtained from the web may be found in bit plane #5.
- ASCII symbols may be included as bit patterns, e.g., with each symbol taking 8 bits in the plane.
- An index to the information conveyed in the alpha channel can be compiled, e.g., in an EXIF header associated with the image, allowing subsequent systems to speed their interpretation and processing of such data.
- the index can employ XML-like tags, specifying the types of data conveyed in the alpha channel, and optionally other information (e.g., their locations).
- Locations can be specified as the location of the upper-most bit (or upper-left-most bit) in the bit-plane array, e.g., by X-, Y- coordinates. Or a rectangular bounding box can be specified by reference to two corner points (e.g., specified by X, Y coordinates) - detailing the region where information is represented.
- the index may convey information such as
- bit plane #1 of the alpha channel, with a top pixel at location (637,938); a female face is similarly present with a top pixel located at (750,1012); OCR'd text encoded as a PDF417 barcode is found in bit plane #1 in the rectangular area with corner points (75,450) and (1425,980), and that bit plane #1 also includes an edge map of the image.
- index More or less information can naturally be provided.
- a different form of index, with less information, may specify, e.g.:
- bit plane #1 of the alpha channel includes 2 faces, a PDF417 barcode, and an edge map.
- An index with more information may specify data including the rotation angle and scale factor for each face, the LAS VEGAS payload of the PDF417 barcode, the angle of the PDF417 barcode, the confidence factors for subjective determinations, names of recognized persons, a lexicon or glossary detailing the semantic significance of each pattern used in the alpha channels (e.g., the patterns of Fig. 65, and the graphical labels used for sky and grass in Fig. 68), the sources of auxiliary data (e.g., of the superimposed child's face in Fig. 71, or the remote reference image data that served as basis for the conclusion that the truck in Fig. 67 is a Sierra Z71), etc.
- auxiliary data e.g., of the superimposed child's face in Fig. 71, or the remote reference image data that served as basis for the conclusion that the truck in Fig. 67 is a Sierra Z71
- the index can convey information that is also conveyed in the bit planes of the alpha channel.
- different forms of representation are used in the alpha channel's graphical representations, versus the index.
- the femaleness of the second face is represented by the '+'s to represent the eyes; in the index the femaleness is represented by the XML tag ⁇ FemaleFacel>. Redundant representation of information can serve as a check on data integrity.
- header information such as EXIF data
- a bit plane of the alpha channel can serve to convey the index information, e.g., bit plane #1.
- One such arrangement encodes the index information as a 2D barcode.
- the barcode may be scaled to fill the frame, to provide maximum robustness to possible image degradation.
- some or all of the index information is replicated in different data stores. For example, it may be conveyed both in EXIF header form, and as a barcode in bit plane #1. Some or all of the data may also be maintained remotely, such as by Google, or other web storage "in the cloud.” Address information conveyed by the image can serve as a pointer to this remote storage.
- the pointer (which can be a URL, but more commonly is a UID or index into a database which - when queried - returns the current address of the sought-for data) can be included within the index, and/or in one or more of the bit planes of the alpha channel. Or the pointer can be steganographically encoded within the pixels of the image data (in some or all of the composite image planes) using digital watermarking technology.
- some or all the information described above as stored in the alpha channel can additionally, or alternatively, be stored remotely, or encoded within the image pixels as a digital watermark. (The picture itself, with or without the alpha channel, can also be replicated in remote storage, by any device in the processing chain.)
- Some image formats include more than the four planes detailed above. Geospatial imagery and other mapping technologies commonly represent data with formats that extend to a half-dozen or more information planes. For example, multispectral space-based imagery may have separate image planes devoted to (1) red, (2) green, (3) blue, (4) near infrared, (5) mid-infrared, (6) far infrared, and (7) thermal infrared.
- the techniques detailed above can convey derived/inferred image information using one or more of the auxiliary data planes available in such formats.
- the overwriting processor may copy the overwritten information into remote storage, and include a link or other reference to it in the alpha channel, or index, or image - in case same later is needed.
- the overwriting processor may copy the overwritten information into remote storage, and include a link or other reference to it in the alpha channel, or index, or image - in case same later is needed.
- consideration may be given to degradations to which this channel may be subjected.
- JPEG compression commonly discards high frequency details that do not meaningfully contribute to a human' s perception of an image. Such discarding of information based on the human visual system, however, can work to disadvantage when applied to information that is present for other purposes (although human viewing of the alpha channel is certainly possible and, in some cases, useful).
- the information in the alpha channel can be represented by features that would not likely be regarded as visually irrelevant. Different types of information may be represented by different features, so that the most important persist through even severe compression. Thus, for example, the presence of faces in Fig. 66 are signified by bold ovals. The locations of the eyes may be less relevant, so are represented by smaller features. Patterns shown in Fig. 65 may not be reliably distinguished after compression, and so might be reserved to represent secondary information -where loss is less important. With JPEG compression, the most-significant bit-plane is best preserved, whereas lesser-significant bit-planes are increasingly corrupted. Thus, the most important metadata should be conveyed in the most-significant bit planes of the alpha channel - to enhance survivability.
- JPEG compression may be applied to the red, green and blue image channels, but lossless (or less lossy) compression may be applied to the alpha channel.
- lossless compression may be applied to the alpha channel.
- the various bit planes of the alpha channel may convey different information, they may be compressed separately - rather than as bytes of 8-bit depth. (If compressed separately, lossy compression may be more acceptable.)
- compression schemes known from facsimile technology can be used, including Modified Huffman, Modified READ, run length encoding, and ITU-T T.6. Hybrid compression techniques are thus well-suited for such files.
- Alpha channel conveyance of metadata can be arranged to progressively transmit and decode in general correspondence with associated imagery features, when using compression arrangements such as JPEG 2000. That is, since the alpha channel is presenting semantic information in the visual domain (e.g., iconically), it can be represented so that layers of semantic detail decompress at the same rate as the image.
- JPEG 2000 a wavelet transform is used to generate data representing the image.
- JPEG 2000 packages and processes this transform data in a manner yielding progressive transmission and decoding. For example, when rendering a JPEG 2000 image, the gross details of the image appear first, with successively finer details following. Similarly with transmission.
- the information in the alpha channel can be arranged similarly (Fig. 77B).
- Information about the truck can be represented with a large, low frequency (shape-dominated) symbology.
- Information indicating the presence and location of the man can be encoded with a next-most-dominant representation.
- Information corresponding to the GMC lettering on the truck grill, and lettering on the man's shirt, can be represented in the alpha channel with a finer degree of detail.
- the finest level of salient detail in the image e.g., the minutiae of the man's face, can be represented with the finest degree of detail in the alpha channel. (As may be noted, the illustrative alpha channel of Fig. 68 doesn't quite follow this model.)
- the alpha channel conveys its information in the form of machine-readable symbologies (e.g., barcodes, digital watermarks, glyphs, etc.)
- the order of alpha channel decoding can be deterministically controlled. Features with the largest features are decoded first; those with the finest features are decoded last.
- the alpha channel can convey barcodes at several different sizes (all in the same bit frame, e.g., located side-by-side, or distributed among bit frames).
- the alpha channel can convey plural digital watermark signals, e.g., one at a gross resolution (e.g., corresponding to 10 watermark elements, or "waxels" to the inch), and others at successively finer resolutions (e.g., 50, 100, 150 and 300 waxels per inch).
- a gross resolution e.g., corresponding to 10 watermark elements, or "waxels” to the inch
- finer resolutions e.g., 50, 100, 150 and 300 waxels per inch.
- data glyphs a range of larger and smaller sizes of glyphs can be used, and they will decode relatively earlier or later.
- JPEG2000 is the most common of the compression schemes exhibiting progressive behavior, but there are others. JPEG, with some effort, can behave similarly. The present concepts are applicable whenever such progressivity exists.
- Receiving nodes can also use the conveyed data to enhance stored profile information relating to the user.
- a node receiving the Fig 66 metadata can note Las Vegas as a location of potential interest.
- a system receiving the Fig. 68 metadata can infer that GMC Z71 trucks are relevant to the user, and/or to the person depicted in that photo. Such associations can serve as launch points for tailored user experiences.
- the metadata also allows images with certain attributes to be identified quickly, in response to user queries. (E.g., find pictures showing GMC Sierra Z71 trucks.) Desirably, web-indexing crawlers can check the alpha channels of images they find on the web, and add information from the alpha channel to the compiled index to make the image more readily identifiable to searchers. As noted, an alpha channel-based approach is not essential for use of the technologies detailed in this specification.
- Another alternative is a data structure indexed by coordinates of image pixels. The data structure can be conveyed with the image file (e.g., as EXIF header data), or stored at a remote server.
- one entry in the data structure corresponding to pixel (637,938) in Fig. 66 may indicate that the pixel forms part of a male's face.
- a second entry for this pixel may point to a shared sub-data structure at which eigenface values for this face are stored. (The shared sub-data structure may also list all the pixels associated with that face.)
- a data record corresponding to pixel (622,970) may indicate the pixel corresponds to the left-side eye of the male's face.
- a data record indexed by pixel (155,780) may indicate that the pixel forms part of text recognized (by OCRing) as the letter "L", and also falls within color histogram bin 49, etc. The provenance of each datum of information may also be recorded.
- each pixel may be assigned a sequential number by which it is referenced.
- the entries may form a linked list, in which each pixel includes a pointer to a next pixel with a common attribute (e.g., associated with the same face).
- a record for a pixel may include pointers to plural different sub- data structures, or to plural other pixels - to associate the pixel with plural different image features or data.
- a pointer to the remote store can be included with the image file, e.g., steganographically encoded in the image data, expressed with EXIF data, etc.
- the origin of the watermark can be used as a base from which pixel references are specified as offsets (instead of using, e.g., the upper left corner of the image). Such an arrangement allows pixels to be correctly identified despite corruptions such as cropping, or rotation.
- the metadata written to a remote store is desirably available for search.
- a web crawler encountering the image can use the pointer in the EXIF data or the steganographically encoded watermark to identify a corresponding repository of metadata, and add metadata from that repository to its index terms for the image (despite being found at different locations).
- information derived or inferred from processes such as those shown in Figs. 50, 57 and 61 can be sent by other transmission arrangements, e.g., dispatched as packetized data using WiFi or WiMax, transmitted from the device using Bluetooth, sent as an SMS short text or MMS multimedia messages, shared to another node in a low power peer-to-peer wireless network, conveyed with wireless cellular transmission or wireless data service, etc.) Texting, Etc.
- a tip/tilt interface is used in connection with a typing operation, such as for composing text messages sent by a Simple Message Service (SMS) protocol from a PDA, a cell phone, or other portable wireless device.
- SMS Simple Message Service
- a user activates a tip/tilt text entry mode using any of various known means (e.g., pushing a button, entering a gesture, etc.).
- a scrollable user interface appears on the device screen, presenting a series of icons. Each icon has the appearance of a cell phone key, such as a button depicting the numeral "2" and the letters "abc.”
- the user tilts the device left or right to scroll backwards or forwards thru the series of icons, to reach a desired button.
- the user then tips the device towards or away from themselves to navigate between the three letters associated with that icon (e.g., tipping away navigates to "a;” no tipping corresponds to "b;” and tipping towards navigates to "c”).
- the user After navigating to the desired letter, the user takes an action to select that letter. This action may be pressing a button on the device (e.g., with the user's thumb), or another action may signal the selection. The user then proceeds as described to select subsequent letters. By this arrangement, the user enters a series of text without the constraints of big fingers on tiny buttons or UI features.
- the device needn't be a phone; it may be a wristwatch, keyfob, or have another small form factor.
- the device may have a touch-screen. After navigating to a desired character, the user may tap the touch screen to effect the selection. When tipping/tilting the device, the corresponding letter can be displayed on the screen in an enlarged fashion (e.g., on the icon representing the button, or overlaid elsewhere) to indicate the user's progress in navigation.
- accelerometers or other physical sensors are employed in certain embodiments, others use a 2D optical sensor (e.g., a camera).
- the user can point the sensor to the floor, to a knee, or to another subject, and the device can then sense relative physical motion by sensing movement of features within the image frame (up/down; left right).
- the image frame captured by the camera need not be presented on the screen; the symbol selection UI, alone, may be displayed. (Or, the UI can be presented as an overlay on the background image captured by the camera.)
- another dimension of motion may also be sensed: up/down. This can provide an additional degree of control (e.g., shifting to capital letters, or shifting from characters to numbers, or selecting the current symbol, etc).
- the device has several modes: one for entering text; another for entering numbers; another for symbols; etc.
- the user can switch between these modes by using mechanical controls (e.g., buttons), or through controls of a user interface (e.g., touches or gestures or voice commands). For example, while tapping a first region of the screen may select the currently-displayed symbol, tapping a second region of the screen may toggle the mode between character-entry and numeric-entry. Or one tap in this second region can switch to character-entry (the default); two taps in this region can switch to numeric-entry; and three taps in this region can switch to entry of other symbols.
- such an interface can also include common words or phrases (e.g., signature blocks) to which the user can tip/tilt navigate, and then select.
- common words or phrases e.g., signature blocks
- a first list may be standardized (pre-programmed by the device vendor), and include statistically common words.
- a second list may comprise words and/or phrases that are associated with a particular user (or a particular class of users). The user may enter these words into such a list, or the device can compile the list during operation - determining which words are most commonly entered by the user. (The second list may exclude words found on the first list, or not.) Again, the user can switch between these lists as described above.
- the sensitivity of the tip/tilt interface is adjustable by the user, to accommodate different user preferences and skills.
- the degree of tilt can correspond to different actions. For example, tilting the device between 5 and 25 degrees can cause the icons to scroll, but tilting the device beyond 30 degrees can insert a line break (if to the left) or can cause the message to be sent (if to the right).
- a portable device captures - and may present - geometric information relating to the device's position (or that of a subject).
- Digimarc's published patent application 20080300011 teaches various arrangements by which a cell phone can be made responsive to what it "sees," including overlaying graphical features atop certain imaged objects.
- the overlay can be warped in accordance with the object's perceived affine distortion.
- Digimarc's patents 6,614,914 and 6,580,809 Steganographic calibration signals by which affine distortion of an imaged object can be accurately quantified are detailed, e.g., in Digimarc's patents 6,614,914 and 6,580,809; and in patent publications 20040105569, 20040101157, and 20060031684.
- Digimarc's patent 6,959,098 teaches how distortion can be characterized by such watermark calibration signals in conjunction with visible image features (e.g., edges of a rectilinear object). From such affine distortion information, the 6D location of a watermarked object relative to the imager of a cell phone can be determined. There are various ways 6D location can be described. One is by three location parameters: x, y, z, and three angle parameters: tip, tilt, rotation.
- Fig. 58 shows how a cell phone can display affine parameters (e.g., derived from imagery or otherwise).
- the camera can be placed in this mode through a UI control (e.g., tapping a physical button, making a touchscreen gesture, etc.).
- a UI control e.g., tapping a physical button, making a touchscreen gesture, etc.
- the device's rotation from (an apparent) horizontal orientation is presented at the top of the cell phone screen.
- the cell phone processor can make this determination by analyzing the image data for one or more generally parallel elongated straight edge features, averaging them to determine a mean, and assuming that this is the horizon. If the camera is conventionally aligned with the horizon, this mean line will be horizontal. Divergence of this line from horizontal indicates the camera's rotation.
- This information can be presented textually (e.g., "12 degrees right"), and/or a graphical representation showing divergence from horizontal can be presented.
- many cell phones include accelerometers, or other tilt detectors, which output data from which the cell phone processor can discern the device's angular orientation.
- the camera captures a sequence of image frames (e.g., video) when in this mode of operation.
- a second datum indicates the angle by which features in the image frame have been rotated since image capture began.
- this information can be gleaned by analysis of the image data, and can be presented in text form, and/or graphically.
- the graphic can comprise a circle, with a line - or arrow - through the center showing real-time angular movement of the camera to the left or right.
- the device can track changes in the apparent size of edges, objects, and/or other features in the image, to determine the amount by which scale has changed since image capture started. This indicates whether the camera has moved towards or away from the subject, and by how much.
- the information can be presented textually and graphically.
- the graphical presentation can comprise two lines: a reference line, and a second, parallel line whose length changes in real time in accordance with the scale change (larger than the reference line for movement of the camera closer to the subject, and smaller for movement away).
- other such geometric data can also be derived and presented, e.g., translation, differential scaling, tip angle (i.e., forward/backward), etc.
- the determinations detailed above can be simplified if the camera field of view includes a digital watermark having steganographic calibration/orientation data of the sort detailed in the referenced patent documents.
- the information can also be derived from other features in the imagery.
- data from one or more accelerometers or other position sensing arrangements in the device - either alone or in conjunction with image data - can be used to generate the presented information.
- such information can also be used, e.g., in sensing gestures made with the device by a user, in providing context by which remote system responses can be customized, etc.
- a cell phone functions as a state machine, e.g., changing aspects of its functioning based on image-related information previously acquired.
- the image- related information can be focused on the natural behavior of the camera user, typical environments in which the camera is operated, innate physical characteristics of the camera itself, the structure and dynamic properties of scenes being imaged by the camera, and many other such categories of information.
- the resulting changes in the camera's function can be directed toward improving image analysis programs resident on a camera-device or remotely located at some image-analysis server.
- Image analysis is construed very broadly, covering a range of analysis from digital watermark reading, to object and facial recognition, to 2-D and 3-D barcode reading and optical character recognition, all the way through scene categorization analysis and more.
- edges tend to be longer on the right sides of the images, this tends to indicate that the images were taken from a right-oblique view. Differences in illumination across foreground subjects can also be used - brighter illumination on the right side of subjects suggest the right side was closer to the lens. Etc.)
- this particular user may habitually adopt a grip of the phone that inclines the top of the camera five degrees towards the user (i.e., to the left). This results in the captured image subjects generally being skewed with an apparent rotation of five degrees.
- Such recurring biases can be discerned by examining a collection of images captured by that user with that cell phone. Once identified, data memorializing these idiosyncrasies can be stored in a memory, and used to optimize image recognition processes performed by the device.
- the device may generate a first output (e.g., a tentative object identification) from a given image frame at one time, but generate a second, different output (e.g., a different object identification) from the same image frame at a later time - due to intervening use of the camera.
- a characteristic pattern of the user ' s hand jitter may also be inferred by examination of plural images.
- the device may notice that the images captured by the user during weekday hours of 9:00 - 5:00 are routinely illuminated with a spectrum characteristic of fluorescent lighting, to which a rather extreme white-balancing operation needs to be applied to try and compensate. With a priori knowledge of this tendency, the device can expose photos captured during those hours differently than with its baseline exposure parameters - anticipating the fluorescent illumination, and allowing a better white balance to be achieved.
- the device derives information that models some aspect of the user's customary behavior or environmental variables.
- the device then adapts some aspect of its operation accordingly.
- the device may also adapt to its own peculiarities or degradations. These include non-uniformities in the photodiodes of the image sensor, dust on the image sensor, mars on the lens, etc. Again, over time, the device may detect a recurring pattern, e.g.: (a) that one pixel gives a 2% lower average output signal than adjoining pixels; (b) that a contiguous group of pixels tends to output signals that are about 3 digital numbers lower than averages would otherwise indicate; (c) that a certain region of the photosensor does not seem to capture high frequency detail - imagery in that region is consistently a bit blurry, etc.
- a recurring pattern e.g.: (a) that one pixel gives a 2% lower average output signal than adjoining pixels; (b) that a contiguous group of pixels tends to output signals that are about 3 digital numbers lower than averages would otherwise indicate; (c) that a certain region of the photosensor does not seem to capture high frequency detail - imagery in that region is consistently a bit blurry
- the device can deduce, e.g., that (a) the gain for the amplifier serving this pixel is low; (b) dust or other foreign object is occluding these pixels; and (c) a lens flaw prevents light falling in this region of the photosensor from being properly focused, etc. Appropriate compensations can then be applied to mitigate these shortcomings.
- a Fourier-transformed set of image data may be preferentially routed to a quick 2-D barcode detection function which may otherwise have been de-prioritized.
- Fourier transformed data may be shipped to a specialized pattern recognition routine.
- the human visual system has different sensitivity to imagery at different spectral frequencies. Different image frequencies convey different impressions. Low frequencies give global information about an image, such as its orientation and general shape. High frequencies give fine details and edges. As shown in Fig. 72, the sensitivity of the human vision system peaks at frequencies of about 10 cycles/mm on the retina, and falls away steeply on either side. (Perception also depends on contrast between features sought to be distinguished - the vertical axis.) Image features with spatial frequencies and contrast in the cross-hatched zone are usually not perceivable by humans. Fig. 73 shows an image with the low and high frequencies depicted separately (on the left and right).
- Digital watermarking of print media can be effected by tinting the page (before, during or after printing) with an inoffensive background pattern that steganographically conveys auxiliary payload data.
- Different columns of text can be encoded with different payload data, e.g., permitting each news story to link to a different electronic resource (see, e.g., Digimarc's patents 6,985,600, 6,947,571 and 6,724,912).
- the close- focus shortcoming of portable imaging devices is overcome by embedding a lower frequency digital watermark (e.g., with a spectral composition centered on the left side of Fig. 72, above the curve). Instead of encoding different watermarks in different columns, the page is marked with a single watermark that spans the page - encoding an identifier for that page.
- a lower frequency digital watermark e.g., with a spectral composition centered on the left side of Fig. 72, above the curve.
- the decoded watermark serves to index a data structure that returns information to the device, to be presented on its display screen.
- the display presents a map of the newspaper page layout, with different articles/advertisements shown in different colors.
- Figs. 74 and 75 illustrate one particular embodiment. The original page is shown in Fig. 74. The layout map displayed on the user device screen in shown in Fig. 75.
- the user simply touches the portion of the displayed map corresponding to the story of interest. (If the device is not equipped with a touch screen, the map of Fig. 75 can be presented with indicia identifying the different map zones, e.g., 1, 2, 3... or A, B, C... The user can then operate the device's numeric or alphanumeric user interface (e.g., keypad) to identify the article of interest.)
- numeric or alphanumeric user interface e.g., keypad
- the user's selection is transmitted to a remote server (which may be the same one that served the layout map data to the portable device, or another one), which then consults with stored data to identify information responsive to the user's selection. For example, if the user touches the region in the lower right of the page map, the remote system may instructs a server at buick-dot-com to transmit a page for presentation on the user device, with more information the about the Buick Lucerne. Or the remote system can send the user device a link to that page, and the device can then load the page.
- a remote server which may be the same one that served the layout map data to the portable device, or another one
- the remote system may instructs a server at buick-dot-com to transmit a page for presentation on the user device, with more information the about the Buick Lucerne.
- the remote system can send the user device a link to that page, and the device can then load the page.
- the remote system can cause the user device to present a menu of options, e.g., for a news article the user may be given options to: listen to a related podcast; see earlier stories on the same topic; order reprints; download the article as a Word file, etc.
- the remote system can send the user a link to a web page or menu page by email, so that the user can review same at a later time. (A variety of such different responses to user-expressed selections can be provided, as are known from the art cited herein.)
- the system may cause the user device to display a screen showing a reduced scale version of the newspaper page itself - like that shown in Fig. 74. Again, the user can simply touch the article of interest to trigger an associated response. Or instead of a presenting a graphical layout of the page, the remote system can return titles of all the content on the page (e.g., "Banks Owe Billions", “McCain Pins Hopes", "Buick Lucerne"). These titles are presented in menu form on the device screen, and the user touches the desired item (or enters a corresponding number/letter selection).
- the layout map for each printed newspaper and magazine page is typically generated by the publishing company as part of its layout process, e.g., using automated software from vendors such as Quark, Impress and Adobe, etc.
- Existing software thus knows what articles and advertisements appear in what spaces on each printed page.
- These same software tools, or others can be adapted to take this layout map information, associate corresponding links or other data for each story/advertisement, and store the resulting data structure in a web-accessible server from which portable devices can access same.
- delivery of a page map to the user device from a remote server is not required.
- a region of a page spanning several items of content is encoded with a single watermark payload.
- the user captures an image including content of interest.
- the watermark identifying the page is decoded.
- the captured image is displayed on the device screen, and the user touches the content region of particular interest.
- the coordinates of the user's selection within the captured image data are recorded.
- Fig. 76 is illustrative.
- the user has used an Apple iPhone, a T-Mobile Android phone, or the like to capture an image from an excerpt from a watermarked newspaper page, and then touches an article of interest
- the location of the touch within the image frame is known to the touch screen software, e.g., as an offset from the upper left corner, measured in pixels.
- the display may have a resolution of 480x320 pixels).
- the touch may be at pixel position (200,160).
- the watermark spans the page, and is shown in Fig. 76 by the dashed diagonal lines.
- the watermark e.g., as described in Digimarc's patent 6,590,996
- the watermark has an origin, but the origin point is not within the image frame captured by the user.
- the watermark decoder software knows the scale of the image and its rotation. It also knows the offset of the captured image frame from the watermark's origin.
- the software can determine that the upper left corner of the captured image frame corresponds to a point 1.6 inches below, and 2.3 inches to the right, of the top left corner of the originally printed page (assuming the watermark origin is at the top left corner of the page). From the decoded scale information, the software can discern that the 480 pixel width of the captured image corresponds to an area of the originally printed page 12 inches in width. The software finally determines the position of the user's touch, as an offset from the upper left corner of the originally-printed page.
- the device sends these coordinates to the remote server, together with the payload of the watermark (identifying the page).
- the server looks up the layout map of the identified page (from an appropriate database in which it was stored by the page layout software) and, by reference to the coordinates, determines in which of the articles/advertisements the user's touch fell.
- the remote system then returns to the user device responsive information related to the indicated article, as noted above.
- the useful watermark information is recovered from those regions of the page that are unprinted, e.g., from "white space” between columns, between lines, at the end of paragraphs, etc.
- the inked characters are "noise” that is best ignored.
- the blurring of printed portions of the page introduced by focus deficiencies of PDA cameras can be used to define a mask - identifying areas that are heavily inked. Those portions may be disregarded when decoding watermark data.
- the blurred image data can be thresholded. Any image pixels having a value darker than a threshold value can be ignored. Put another way, only image pixels having a value lighter than a threshold are input to the watermark decoder. The "noise" contributed by the inked characters is thus filtered- out.
- Image search functionality in certain of the foregoing embodiments can be implemented using Pixsimilar image search software and/or the Visual Search Developer's Kit (SDK), both from pie, Inc. (Toronto, ON).
- SDK Visual Search Developer's Kit
- a tool for automatically generating descriptive annotations for imagery is ALIPR (Automatic Linguistic Indexing of Pictures), as detailed in patent 7,394,947 (Penn State).
- CBIR Content-based image retrieval
- CBIR essentially involves (1) abstracting a characterization of an image - usually mathematically; and (2) using such characterizations to assess similarity between images.
- Two papers surveying these fields are Smeulders et al, "Content-Based Image Retrieval at the End of the Early Years," IEEE Trans. Pattern Anal. Mach. Intell, Vol. 22, No. 12, pp. 1349-1380, 2000, and Datta et al, "Image Retrieval: Ideas, Influences and Trends of the New Age,” ACM Computing Surveys, Vol. 40, No. 2, April 2008.
- the task of identifying like-appearing imagery from large image databases is a familiar operation in the issuance of drivers licenses. That is, an image captured from a new applicant is commonly checked against a database of all previous driver license photos, to check whether the applicant has already been issued a driver's license (possibly under another name).
- Methods and systems known from the driver's license field can be employed in the arrangements detailed herein. (Examples include Identix patent 7,369,685 and L-I Corp. patents 7,283,649 and 7,130,454.)
- image feature extraction algorithms known as CEDD and
- FCTH Fuzzy Color And Texture Histogram- A Low Level Feature for Accurate Image Retrieval
- Bitmap ImageData new Bitmap("c:/file.jpg”);
- FCTH GetFCTH new FCTH()
- CEDDTable GetCEDD.Apply(ImageData)
- FCTHTable GetFCTH.Apply(ImageData,2)
- CEDD and FCTH can be combined, to yield improved results, using the Joint Composite Descriptor file available from the just-cited web page. Chatzichristofis has made available an open source program "img(Finder)" (see the web page savvash.blogspot-dot-com/2008/07/image-retrieval-in-facebook-dot-html) - a content based image retrieval desktop application that retrieves and indexes images from the FaceBook social networking site, using CEDD and FCTH.
- a user connects to FaceBook with their personal account data, and the application downloads information from the images of the user, as well as the user's friends' image albums, to index these images for retrieval with the CEDD and FCTH features. The index can thereafter be queried by a sample image.
- Chatzichristofis has also made available an online search service "img(Anaktisi)" to which a user uploads a photo, and the service searches one of 11 different image archives for similar images - using image metrics including CEDD and FCTH. See orpheus.ee.duth-dot-gr/anaktisi/. (The image archives include Flickr). In the associated commentary to the Anaktisi search service, Chatzichristofis explains:
- CBIR Content-based image retrieval
- the Moving Picture Experts Group defines a standard for content-based access to multimedia data in their MPEG-7 standard. This standard identifies a set of image descriptors that maintain a balance between the size of the feature and the quality of the retrieval results.
- High retrieval scores in content-based image retrieval systems can be attained by adopting relevance feedback mechanisms. These mechanisms require the user to grade the quality of the query results by marking the retrieved images as being either relevant or not. Then, the search engine uses this grading information in subsequent queries to better satisfy users' needs. It is noted that while relevance feedback mechanisms were first introduced in the information retrieval field, they currently receive considerable attention in the CBIR field. The vast majority of relevance feedback techniques proposed in the literature are based on modifying the values of the search parameters so that they better represent the concept the user has in mind. Search parameters are computed as a function of the relevance values assigned by the user to all the images retrieved so far. For instance, relevance feedback is frequently formulated in terms of the modification of the query vector and/or in terms of adaptive similarity metrics.
- an Auto Relevance Feedback (ARF) technique is introduced which is based on the proposed descriptors.
- the goal of the proposed Automatic Relevance Feedback (ARF) algorithm is to optimally readjust the initial retrieval results based on user preferences.
- the user selects from the first round of retrieved images one as being relevant to his/her initial retrieval expectations. Information from these selected images is used to alter the initial query image descriptor.
- GIFT GNU Image Finding Tool
- the GIFT server and its client can then be used to search the indexed images based on image similarity.
- the system is further described at the web page gnu-dot- org/software/gift/gift-dot-html.
- the latest version of the software can be found at the ftp server ftp.gnu-dot- org/gnu/gift.
- Still another open source CBIR system is Fire, written by Tom Deselaers and others at RWTH Aachen University, available for download from the web page -i6.informatik.rwth-aachen-dot-de/ ⁇ deselaers/fire/. Fire makes use of technology described, e.g., in Deselaers et al, "Features for Image Retrieval: An Experimental Comparison", Information Retrieval, Vol. 11, No. 2, The Netherlands, Springer, pp. 77-107, March, 2008.
- Embodiments of the present invention are generally concerned with objects depicted in imagery, rather than full frames of image pixels. Recognition of objects within imagery (sometimes termed computer vision) is a large science with which the reader is presumed to be familiar.
- Edges and centroids are among the image features that can be used to aid in recognizing objects in images.
- Shape contexts are another (c.f., Belongie et al, Matching with Shape Contexts, IEEE Workshop on Content Based Access of Image and Video Libraries, 2000.)
- Robustness to affine transformations e.g., scale invariance, rotation invariance
- Methods based on the Hough transform, and the Fourier Mellin transform exhibit rotation-invariant properties.
- SIFT discussed below is an image recognition technique with this and other advantageous properties.
- imagery contemplated in this specification can use of various other techniques, which can go by various names. Included are image analysis, pattern recognition, feature extraction, feature detection, template matching, facial recognition, eigenvectors, etc. (All these terms are generally used interchangeably in this specification.)
- image analysis pattern recognition, feature extraction, feature detection, template matching, facial recognition, eigenvectors, etc.
- the interested reader is referred to Wikipedia, which has an article on each of the just-listed topics, including a tutorial and citations to related information.
- Image metrics of the sort discussed are sometimes regarded as metadata, namely "content-dependent metadata.” This is in contrast to "content-descriptive metadata" - which is the more familiar sense in which the term metadata is used.
- a first level includes simply overtly or covertly encoding contact instructions in the surface features of an object, such as an IP address;
- a second level includes presenting public -key information to a device, either explicitly through overt symbology or more subtly through digital watermarking; and
- a third level where unique patterns or digital watermarking can only be acquired by actively taking a photograph of an object.
- the interface presented on the user's cell phone may be customized, in accordance with user preferences, and/or to facilitate specific task-oriented interactions with the device (e.g., a technician may pull up a "debug" interface for a thermostat, while an office worker may pull up a temperature setting control).
- a manufacturer effectively enables a mobile GUI for that device.
- such technology includes using a mobile phone to obtain identification information corresponding to a device. By reference to the obtained identification information, application software corresponding to said device is then identified, and downloaded to the mobile phone. This application software is then used in facilitating user interaction with the device.
- the mobile phone serves as a multi- function controller - adapted to control a particular device through use of application software identified by reference to information corresponding to that device.
- such technology includes using a mobile phone to sense information from a housing of a device. Through use of this sensed information, other information is encrypted using a public key corresponding to the device.
- such technology includes using a mobile phone to sense analog information from a device.
- This sensed analog information is converted to digital form, and corresponding data is transmitted from the cell phone.
- This transmitted data is used to confirm user proximity to the device, before allowing a user to interact with the device using the mobile phone.
- such technology includes using a user interface on a user's cell phone to receive an instruction relating to control of a device.
- This user interface is presented on a screen of the cell phone in combination with a cell phone-captured image of the device.
- Information corresponding to the instruction is signaled to the user, in a first fashion, while the instruction is pending; and in a second fashion once the instruction has been successfully performed.
- the present technology includes initializing a transaction with a device, using a user interface presented on a screen of a user cell phone, while the user is in proximity to the device. Later, the cell phone is used for a purpose unrelated to the device. Still later, the user interface is recalled and used to engage in a further transaction with the device.
- a mobile phone including a processor, a memory, a sensor, and a display.
- Instructions in the memory configure the processor to enable the following acts: sense information from a proximate first device; download first user interface software corresponding to the first device, by reference to the sensed information; interact with the first device by user interaction with the downloaded first user interface software; recall from the memory second user interface software corresponding to a second device, the second user interface software having been earlier downloaded to the mobile phone; and interact with the second device by user interaction with the recalled second user interface software, regardless of whether the user is proximate to said second device.
- such technology includes a mobile phone including a processor, a memory, and a display. Instructions in the memory configure the processor to present a user interface that allows a user to select between several other device-specific user interfaces stored in memory, for using the mobile phone to interact with plural different external devices.
- Figs. 78 and 79 show a prior art WiFi-equipped thermostat 512. Included are a temperature sensor 514, a processor 516, and a user interface 518. The user interface includes various buttons 518, an LCD display screen 520, and one or more indicator lights 522. A memory 524 stores programming and data for the thermostat. Finally, a WiFi transceiver 526 and antenna 528 allow communication with remote devices.
- the WiFi transceiver comprises the GainSpan GSlOlO SoC (System on Chip) device.
- thermostat 530 includes a temperature sensor 514 and a processor
- the memory 534 may store the same programming and data as memory 524. However, this memory 534 includes a bit more software to support the functionality described below. (For expository convenience, the software associated with this aspect of the present technology is given a name: ThingPipe software.
- the thermostat memory thus has ThingPipe code that cooperates with other code on other devices - such as cell phones - to implement the detailed functionality.
- Thermostat 530 can include the same user interface 518 as thermostat 512. However, significant economies may be achieved by omitting many of the associated parts, such as the LCD display and buttons. The depicted thermostat includes thus only indicator lights 522, and even these may be omitted.
- Thermostat 530 also includes an arrangement through which its identity can be sensed by a cell phone.
- the WiFi emissions from the thermostat may be employed (e.g., by the device's MAC identifier). However, other means are preferred, such as indicia that can be sensed by the camera of a cell phone.
- a steganographic digital watermark is one such indicia that can be sensed by a cell phone camera.
- Digital watermark technology is detailed in the assignee's patents, including 6,590,996 and 6,947,571.
- the watermark data can be encoded in a texture pattern on the exterior of the thermostat, on an adhesive label, on pseudo wood-grain trim on the thermostat, etc. (Since steganographic encoding is hidden, it is not depicted in Fig. 80.)
- a suitable indicia is a ID or 2D bar code or other overt symbology, such as the bar code 536 shown in Fig. 80. This may be printed on the thermostat housing, applied by an adhesive label, etc.
- Fig. 81 shows an exemplary cell phone 540, such as the Apple iPhone device. Included are conventional elements including a processor 542, a camera 544, a microphone, an RF transceiver, a network adapter, a display, and a user interface.
- the user interface includes physical controls as well as a touch-screen sensor. (Details of the user interface, and associated software, are provided in Apple's patent publication 20080174570.)
- the memory 546 of the phone includes the usual operating system and application software. In addition it includes ThingPipe software for performing the functions detailed in this specification.
- a user captures an image depicting the digitally- watermarked thermostat 530 using the cell phone camera 544.
- the processor 542 in the cell phone pre- processes the captured image data (e.g., by applying a Wiener filter or other filtering, and/or compressing the image data), and wirelessly transmits the processed data by to a remote server 552 (Fig. 82) - together with information identifying the cell phone.
- the wireless communication can be by WiFi to a nearby wireless access point, and then by internet to the server 552. Or the cell phone network can be employed, etc.
- Server 552 applies a decoding algorithm to the processed image data received from the cell phone, extracting steganographically encoded digital watermark data.
- This decoded data - which may comprise an identifier of the thermostat - is transmitted by internet to a router 554, together with the information identifying the cell phone.
- Router 554 receives the identifier and looks it up in a namespace database 555.
- the namespace database 555 examines the most significant bits of the identifier, and conducts a query to identify a particular server responsible for that group of identifiers.
- the server 556 thereby identified by this process has data pertaining to that thermostat. (Such an arrangement is akin to the Domain Name Servers employed in internet routing. Patent 6,947,571 has additional disclosure on how watermarked data can be used to identify a server that knows what to do with such data.)
- Router 554 polls identified server 556 for information. For example, router 554 may solicit from server 556 current data relating to the thermostat (e.g., current temperature setpoint and ambient temperature, which server 556 may obtain from the thermostat by a link that includes WiFi). Additionally, server 556 is requested to provide information about a graphical user interface suitable for display on the Apple iPhone 540 to control that particular thermostat. This information may comprise, for example, a JavaScript application that runs on the cell phone 540, and presents a GUI suited for use with the thermostat. This information is passed back to the cell phone - directly, or through server 552. The returned information may include the IP address of the server 556, so that the cell phone can thereafter exchange data directly with server 556.
- server 556 may solicit from server 556 current data relating to the thermostat (e.g., current temperature setpoint and ambient temperature, which server 556 may obtain from the thermostat by a link that includes WiFi). Additionally, server 556 is requested to provide information about a graphical user interface suitable for display on the Apple iPhone 540 to control
- the ThingPipe software in the cell phone 540 responds to the received information by presenting the graphical user interface for thermostat 530 on its screen.
- This GUI can include the ambient and setpoint temperature for the thermostat - whether received from the server 556, or directly (such as by WiFi) from the thermostat. Additionally, the presented GUI includes controls that the user can operate to change the settings. To raise the setpoint temperature, the user touches a displayed control corresponding to this operation (e.g., an "Increase Temperature" button). The setpoint temperature presented in the UI display immediately increments in response to the user's action - perhaps in flashing or other distinctive fashion to indicate that the request is pending.
- the user's touch also causes the ThingPipe software to transmit corresponding data from the cell phone 540 to the thermostat (which transmission may include some or all of the other devices shown in Fig. 82, or it may go directly to the thermostat - such as by WiFi).
- the thermostat increases its set temperature per the user's instructions. It then issues a confirmatory message that is relayed back from the thermostat to the cell phone. On receipt of the confirmatory message, the flashing of the incremented temperature indicator ceases, and the setpoint temperature is then displayed in static form.
- the confirmatory message may be rendered to the user as a visible signal - such as the text "Accepted" presented on the display, or an audible chime, or a voice saying "OK."
- the displayed UI is presented as an overlay on the screen of the cell phone, atop the image earlier captured by the user depicting the thermostat.
- Features of the UI are presented in registered alignment with any corresponding physical controls (e.g., buttons) shown in the captured image.
- the graphical overlay may outline these in the displayed image in a distinctive format, such as with scrolling dashed red lines. These are the graphical controls that the user touches to raise or lower the setpoint temperature.
- Fig. 83 This is shown schematically by Fig. 83, in which the user has captured an image 560 of part of the thermostat. Included in the image is at least a part of the watermark 562 (shown to be visible for sake of illustration).
- the cell phone processor overlays scrolling dashed lines atop the image - outlining the "+" and "-" buttons.
- the phone's touch-screen user interface senses user touches in these outlined regions, it reports same to the ThingPipe software in the cell phone. It, in turn, interprets these touches as commands to increment or decrement the thermostat temperature, and sends such instructions to the thermostat (e.g., through server 552 and/or 556). Meanwhile, it increments a "SET TEMPERATURE" graphic that is also overlaid atop the image, and causes it to flash until the confirmatory message is received back from the thermostat.
- the registered overlay of a graphical user interface atop captured image data is enabled by the encoded watermark data on the thermostat housing.
- Calibration data in the watermark permits the scale, translation and rotation of the thermostat's placement within the image to be precisely determined. If the watermark is reliably placed on the thermostat in a known spatial relationship with other device features (e.g., buttons and displays), then the positions of these features within the captured image can be determined by reference to the watermark. (Such technology is further detailed in applicant's published patent application 20080300011.)
- the registered overlay of a UI may still be used.
- the outlined buttons presented on the cell phone screen can indicate corresponding buttons on the phone's keypad that the user should press to activate the outlined functionality. For example, an outlined box around the "+” button may periodically flash orange with the number "2" - indicating that the user should press the "2" button on the cell phone keypad to increase the thermostat temperature setpoint.
- overlay of the graphical user interface onto the captured image of the thermostat in registered alignment, is believed easiest to implement through use of watermarks, other arrangements are possible. For example, if the size and scale of a barcode, and its position on the thermostat, are known, then the locations of the thermostat features for overlay purposes can be geometrically determined. Similarly with an image fingerprint-based approach (including SIFT). If the nominal appearance of the thermostat is known (e.g., by server 556), then the relative locations of features within the captured image can be discerned by image analysis. In one particular arrangement, the user captures a frame of imagery depicting the thermostat, and this frame is buffered for static display by the phone. The overlay is then presented in registered alignment with this static image.
- SIFT image fingerprint-based approach
- the user moves the camera, the static image persists, and the overlaid UI is similarly static.
- the user captures a stream of images (e.g., video capture), and the overlay is presented in registered alignment with features in the image even if they move from frame to frame.
- the overlay may move across the screen, in correspondence with movement of the depicted thermostat within the cell phone screen.
- Such an arrangement can allow the user to move the camera to capture different aspects of the thermostat - perhaps revealing additional features/controls. Or it permits the user to zoom the camera in so that certain features (and the corresponding graphically overlays) are revealed, or appear at a larger scale on the cell phone's touchscreen display.
- the user may selectively freeze the captured image at any time, and then continue to work with the (then static) overlaid user interface control - without regard to keeping the thermostat in the camera's field of view.
- thermostat 530 is of the sort that has no visible controls
- the UI displayed on the cell phone can be of any format. If the cell phone has a touch-screen, thermostat controls may be presented on the display. If there is no touch-screen, the display can simply present a corresponding menu. For example, it can instruct the user to press “2" to increase the temperature setpoint, press “8" to decrease the temperature setpoint, etc.
- the command is relayed to the thermostat as described above, and a confirmatory message is desirably returned back - for rendering to the user by the ThingPipe software.
- the displayed user interface is a function of the device with which the phone is interacting (i.e., the thermostat), and may also be a function of the capabilities of the cell phone itself (e.g., whether it has a touch-screen, the dimension of the screen, etc).
- Instructions and data enabling the cell phone's ThingPipe software to create these different UIs may be stored at the server 556 that administers the thermostat, and is delivered to the memory 546 of the cell phone with which the thermostat interacts.
- a WiFi-enabled parking meter Another example of a device that can be so-controlled is a WiFi-enabled parking meter.
- the user captures an image of the parking meter with the cell phone camera (e.g., by pressing a button, or the image capture may be free-running - such as every second or several). Processes occur generally as detailed above.
- the ThingPipe software processes the image data, and router 554 identifies a server 556a responsible for ThingPipe interactions with that parking meter.
- the server returns UI instructions, optionally with status information for that meter (e.g., time remaining; maximum allowable time). These data are displayed on the cell phone UI, e.g., overlaid on the captured image of the cell phone, together with controls/instructions for purchasing time.
- the user interacts with the cell phone to add two hours of time to the meter.
- a corresponding payment is debited, e.g., from the user's credit card account - stored as encrypted profile information in the cell phone or in a remote server.
- the user interface on the cell phone confirms that the payment has been satisfactorily made, and indicates the number of minutes purchased from the meter.
- a display at the streetside meter may also reflect the purchased time.
- the user leaves the meter and attends to other business, and may use the cell phone for other purposes.
- the cell phone may lapse into a low power mode - darkening the screen.
- the downloaded application software tracks the number of minutes remaining on the meter. It can do this by periodically querying the associated server for the data. Or it can track the time countdown independently. At a given point, e.g., with ten minutes remaining, the cell phone sounds an alert.
- the user sees that the cell phone has been returned to its active state, and the meter UI has been restored to the screen.
- the displayed UI reports the time remaining, and gives the user the opportunity to purchase more time.
- the user purchases another 30 minutes of time.
- the completed purchase is confirmed on the cell phone display - showing 40 minutes of time remaining.
- the display on the streetside meter can be similarly updated.
- the block diagram of the parking meter is similar to that of the thermostat of Fig. 80, albeit without the temperature sensor.
- Fig. 85 shows an alarm clock 580 employing aspects of the present technology. Like other alarm clocks it includes a display 582, a physical UI 584 (e.g., buttons), and a controlling processor 586. This clock, however, also includes a Bluetooth wireless interface 588, and a memory 590 in which are stored ThingPipe and Bluetooth software for execution by the processor. The clock also has means for identifying itself, such as a digital watermark or barcode, as described above.
- the user captures an image of the clock.
- An identifier is decoded from the imagery, either by the cell phone processor, or by a processor in a remote server 552b. From the identifier, the router identifies a further server 556b that is knowledgeable about such clocks. The router passes the identifier to the further server, together with the address of the cell phone.
- the server uses the decoded watermark identifier to look up that particular clock, and recall instructions about its processor, display, and other configuration data. It also provides instructions by which the particular display of cell phone 530 can present a standardized clock interface, through which the clock parameters can be set. The server packages this information in a file, which is transmitted back to the cell phone.
- the cell phone receives this information, and presents the user interface detailed by server 556b on the screen. It is a familiar interface - appearing each time this cell phone is used to interact with a hotel alarm clock - regardless of the clock's model or manufacturer. (In some cases the phone may simply recall the UI from a UI cache, e.g., in the cell phone, since it is used frequently.)
- the cell phone communicates, by Bluetooth, with the clock. (Parameters sent from server 556b may be required to establish the session.)
- the time displayed on the clock is presented on the cell phone UI, together with a menu of options.
- One of the options presented on the cell phone screen is "SET ALARM.”
- the UI shifts to a further screen 595 (Fig. 86) inviting the user to enter the desired alarm time by pressing the digit keys on the phone's keypad.
- the instruction to set the alarm time to 5:30 a.m. is received by Bluetooth.
- the ThingPipe software in the alarm clock memory understands the formatting by which data is conveyed by the Bluetooth signal, and parses out the desired time, and the command to set the alarm.
- the alarm clock processor then sets the alarm to ring at the designated time.
- the cell phone and clock communicate directly - rather than through one or more intermediary computers.
- the other computers were consulted by the cell phone to obtain the programming particulars for the clock but, once obtained, were not contacted further.
- the user interface does not integrate itself (e.g., in registered alignment) with the image of the clock captured by the user. This refinement is omitted in favor of presenting a consistent user interface experience - independent of the particular clock being programmed.
- a watermark is preferred by the present applicants to identify the particular device.
- any other known identification technology can be used, including those noted above.
- Knowing the locations of the devices allows enhanced functionality to be realized. For example, it allows devices to be identified by their position (e.g., unique latitude/longitude/elevation coordinates) - rather than by an identifier (e.g., watermarked or otherwise). Moreover, it allows proximity between a cell phone and other ThingPipe devices to be determined.
- ThingPipe software or it may already be running in the background. This software communicates the cell phone's current position to server 552, and requests identification of other ThingPipe-enabled devices nearby.
- Nearby is, of course, dependent on the implementation. It may be, for example, 10 feet, 10 meters, 50 feet, 50 meters, etc. This parameter can be defined by the cell phone user, or a default value can be employed).
- Server 552 checks a database identifying the current locations of other ThingPipe-enabled devices, and returns data to the cell phone identifying those that are nearby. A listing 598 (Fig.
- the cell phone's location module includes a magnetometer, or other means to determine the direction the device is facing, the displayed listing can also include directional clues with the distance, e.g., "4' to your left."
- the user selects the THERMOSTAT from the displayed list (e.g., by touching the screen - if a touchscreen, or entering the associated digit on a keypad).
- the phone then establishes a ThingPipe session with the thus-identified device, as detailed above. (In this example the thermostat user interface is not overlaid atop an image of the thermostat, since no image was captured.)
- authorization may again be anyone who approaches the meter and captures a picture (or otherwise senses its identifier from short range).
- the user is able to recall the corresponding UI at a later time and engage in further transactions with the device. This is fine, to a point. Perhaps twelve hours from the time of image capture is a suitable time interval within which a user can interact with the meter.
- the user's authorization to interact with the device may be terminated when a new user initiates a session with the meter (e.g., by capturing an image of the device and initiating a transaction of the sort identified above).
- a memory storing data setting the user's authorization period can be located in the meter, or it can be located elsewhere, e.g., in server 556a.
- a corresponding ID for the user would also normally be stored. This can be the user's telephone number, a MAC identifier for the phone device, or some other generally unique identifier.
- the thermostat there may be stricter controls about who is authorized to change the temperature, and for how long. Perhaps only supervisors in an office can set the temperature. Other personnel may be granted lesser privileges, e.g., to simply view the current ambient temperature.
- a memory storing such data can be located in the thermostat, in the server 556, or elsewhere.
- ThingPipe application can have a "Recent UI" menu option that, when selected, summons a list of pending or recent sessions. Selecting any recalls the corresponding UI, allowing the user to continue an earlier interaction with a particular device.
- Physical user interfaces - such as for thermostats and the like - are fixed. All users are presented with the same physical display, knobs, dials, etc. All interactions must be force-fit into this same physical vocabulary of controls. Implementations of the aspects of the present technology can be more diverse. Users may have stored profile settings - customizing cell phone UIs to their particular preferences - globally, and/or on a per-device basis. For example, a color-blind user may so-specify, causing a gray scale interface to always be presented - instead of colors which may be difficult for the user to discriminate. A person with farsighted vision may prefer that information be displayed in the largest possible font - regardless of aesthetics. Another person may opt for text to be read from the display, such as by a synthesized voice.
- One particular thermostat UI may normally present text indicating the current date; a user may prefer that the UI not be cluttered with such information, and may specify - for that UI - that no date information should be shown.
- the user interface can also be customized for specific task-oriented interactions with the object.
- a technician may invoke a "debug" interface for a thermostat, in order to trouble-shoot an associated HVAC system; an office worker may invoke a simpler UI that simply presents the current and set-point temperatures.
- a first security level includes simply encoding (covertly or overtly) contact instructions for the object in the surface features of the object, such as an IP address.
- the session simply starts with the cell phone collecting contact information from the device. (Indirection may be involved; the information on the device may refer to a remote repository that stores the contact information for the device.)
- a second level includes public-key information, explicitly present on the device through overt symbology, more subtly hidden through steganographic digital watermarking, indirectly accessed, or otherwise- conveyed.
- machine readable data on the device may provide the device's public key - with which transmissions from the user must be encrypted.
- the user's transmissions may also convey the user's public key - by which the device can identify the user, and with which data/instructions returned to the cell phone are encrypted.
- thermostat in a mall may use such technology. All passsers-by may be able to read the thermostat's public key. However, the thermostat may only grant control privileges to certain users - identified by their respective public keys.
- a third level includes preventing control of the device unless the user submits unique patterns or digital watermarking that can only be acquired by actively taking a photograph of the device. That is, it is not simply enough to send an identifier corresponding to the device. Rather, minutiae evidencing the user's physical proximity to the device must also be captured and transmitted. Only by capturing a picture of the device can the user obtain the necessary data; the image pixels essentially prove the user is nearby and took the picture.
- all patterns previously submitted may be cached - either at a remote server, or at the device, and checked against new data as it is received. If the identical pattern is submitted a second time, it may be disqualified - as an apparent playback attack (i.e., each image of the device should have some variation at the pixel level). In some arrangements the appearance of the device is changed over time (e.g., by a display that presents a periodically-changing pattern of pixels), and the submitted data must correspond to the device within an immediately preceding interval of time (e.g., 5 seconds, or 5 minutes).
- any analog information can be sensed from the device or its environment, and used to establish user proximity to the device.
- One simple application of this arrangement is a scavenger hunt - where taking a picture of a device provides the user's presence at the device.
- a more practical application is industrial settings, where there is concern about people remotely trying to access devices without physically being there.
- SIFT is an acronym for Scale-Invariant Feature Transform, a computer vision technology pioneered by David Lowe and described in various of his papers including "Distinctive Image Features from Scale-Invariant Keypoints," International Journal of Computer
- SIFT works by identification and description - and subsequent detection - of local image features.
- the SIFT features are local and based on the appearance of the object at particular interest points, and are invariant to image scale, rotation and affine transformation. They are also robust to changes in illumination, noise, and some changes in viewpoint. In addition to these properties, they are distinctive, relatively easy to extract, allow for correct object identification with low probability of mismatch and are straightforward to match against a (large) database of local features.
- Object description by a set of SIFT features is also robust to partial occlusion; as few as three SIFT features from an object are enough to compute its location and pose.
- the technique starts by identifying local image features - termed keypoints - in a reference image. This is done by convolving the image with Gaussian blur filters at different scales (resolutions), and determining differences between successive Gaussian-blurred images. Keypoints are those image features having maxima or minima of the difference of Gaussians occurring at multiple scales. (Each pixel in a difference-of-Gaussian frame is compared to its eight neighbors at the same scale, and corresponding pixels in each of the neighboring scales (e.g., nine other scales). If the pixel value is a maximum or minimum from all these pixels, it is selected as a candidate keypoint.
- the above procedure typically identifies many keypoints that are unsuitable, e.g., due to having low contrast (thus being susceptible to noise), or due to having poorly determined locations along an edge (the Difference of Gaussians function has a strong response along edges, yielding many candidate keypoints, but many of these are not robust to noise).
- These unreliable keypoints are screened out by performing a detailed fit on the candidate keypoints to nearby data for accurate location, scale, and ratio of principal curvatures. This rejects keypoints that have low contrast, or are poorly located along an edge.
- this process starts by - for each candidate keypoint - interpolating nearby data to more accurately determine keypoint location. This is often done by a Taylor expansion with the keypoint as the origin, to determine a refined estimate of maxima/minima location.
- the value of the second-order Taylor expansion can also be used to identify low contrast keypoints. If the contrast is less than a threshold (e.g., 0.03), the keypoint is discarded.
- a threshold e.g. 0.3
- a variant of a corner detection procedure is applied. Briefly, this involves computing the principal curvature across the edge, and comparing to the principal curvature along the edge. This is done by solving for eigenvalues of a second order Hessian matrix.
- the keypoint descriptor is computed as a set of orientation histograms on (4 x 4) pixel neighborhoods.
- the foregoing procedure is applied to training images to compile a reference database.
- An unknown image is then processed as above to generate keypoint data, and the closest-matching image in the database is identified by a Euclidian distance-like measure.
- a "best-bin-first” algorithm is typically used instead of a pure Euclidean distance calculation, to achieve several orders of magnitude speed improvement.
- a "no match” output is produced if the distance score for the best match is close - e.g., 25% to the distance score for the next-best match.
- an image may be matched by clustering. This identifies features that belong to the same reference image - allowing unclustered results to be discarded as spurious.
- a Hough transform can be used - identifying clusters of features that vote for the same object pose.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Data Mining & Analysis (AREA)
- Human Computer Interaction (AREA)
- General Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Health & Medical Sciences (AREA)
- Oral & Maxillofacial Surgery (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Databases & Information Systems (AREA)
- Telephone Function (AREA)
- Information Transfer Between Computers (AREA)
- Studio Devices (AREA)
- Telephonic Communication Services (AREA)
- User Interface Of Digital Computer (AREA)
Abstract
Description
Claims
Priority Applications (16)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR1020167032337A KR101763132B1 (en) | 2008-08-19 | 2009-08-19 | Methods and systems for content processing |
KR1020117006167A KR101680044B1 (en) | 2008-08-19 | 2009-08-19 | Methods and systems for content processing |
CA2734613A CA2734613C (en) | 2008-08-19 | 2009-08-19 | Methods and systems for content processing |
EP09808792.7A EP2313847A4 (en) | 2008-08-19 | 2009-08-19 | Methods and systems for content processing |
CN200980141567.8A CN102216941B (en) | 2008-08-19 | 2009-08-19 | For the method and system of contents processing |
US13/011,618 US8805110B2 (en) | 2008-08-19 | 2011-01-21 | Methods and systems for content processing |
US13/964,014 US9749607B2 (en) | 2009-07-16 | 2013-08-09 | Coordinated illumination and image signal capture for enhanced signal detection |
US14/078,171 US8929877B2 (en) | 2008-09-12 | 2013-11-12 | Methods and systems for content processing |
US14/456,784 US9886845B2 (en) | 2008-08-19 | 2014-08-11 | Methods and systems for content processing |
US14/590,669 US9565512B2 (en) | 2008-09-12 | 2015-01-06 | Methods and systems for content processing |
US15/425,817 US9918183B2 (en) | 2008-09-12 | 2017-02-06 | Methods and systems for content processing |
US15/687,153 US10223560B2 (en) | 2009-07-16 | 2017-08-25 | Coordinated illumination and image signal capture for enhanced signal detection |
US15/889,013 US10922957B2 (en) | 2008-08-19 | 2018-02-05 | Methods and systems for content processing |
US16/291,366 US10713456B2 (en) | 2009-07-16 | 2019-03-04 | Coordinated illumination and image signal capture for enhanced signal detection |
US16/927,730 US11386281B2 (en) | 2009-07-16 | 2020-07-13 | Coordinated illumination and image signal capture for enhanced signal detection |
US17/174,712 US11587432B2 (en) | 2008-08-19 | 2021-02-12 | Methods and systems for content processing |
Applications Claiming Priority (26)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US9008308P | 2008-08-19 | 2008-08-19 | |
US61/090,083 | 2008-08-19 | ||
US9670308P | 2008-09-12 | 2008-09-12 | |
US61/096,703 | 2008-09-12 | ||
US10064308P | 2008-09-26 | 2008-09-26 | |
US61/100,643 | 2008-09-26 | ||
US10390708P | 2008-10-08 | 2008-10-08 | |
US61/103,907 | 2008-10-08 | ||
US11049008P | 2008-10-31 | 2008-10-31 | |
US61/110,490 | 2008-10-31 | ||
US12/271,692 | 2008-11-14 | ||
US12/271,692 US8520979B2 (en) | 2008-08-19 | 2008-11-14 | Methods and systems for content processing |
US16926609P | 2009-04-14 | 2009-04-14 | |
US61/169,266 | 2009-04-14 | ||
US17482209P | 2009-05-01 | 2009-05-01 | |
US61/174,822 | 2009-05-01 | ||
US17673909P | 2009-05-08 | 2009-05-08 | |
US61/176,739 | 2009-05-08 | ||
US12/484,115 | 2009-06-12 | ||
US12/484,115 US8385971B2 (en) | 2008-08-19 | 2009-06-12 | Methods and systems for content processing |
US12/498,709 US20100261465A1 (en) | 2009-04-14 | 2009-07-07 | Methods and systems for cell phone interactions |
US12/498,709 | 2009-07-07 | ||
US22619509P | 2009-07-16 | 2009-07-16 | |
US61/226,195 | 2009-07-16 | ||
US23454209P | 2009-08-17 | 2009-08-17 | |
US61/234,542 | 2009-08-17 |
Related Parent Applications (3)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US12/271,692 Continuation-In-Part US8520979B2 (en) | 2008-08-19 | 2008-11-14 | Methods and systems for content processing |
US12/498,709 Continuation-In-Part US20100261465A1 (en) | 2008-08-19 | 2009-07-07 | Methods and systems for cell phone interactions |
US13/888,939 Continuation-In-Part US9008315B2 (en) | 2009-07-16 | 2013-05-07 | Shared secret arrangements and optical data transfer |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US13/011,618 Continuation US8805110B2 (en) | 2008-08-19 | 2011-01-21 | Methods and systems for content processing |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2010022185A1 true WO2010022185A1 (en) | 2010-02-25 |
Family
ID=43760038
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2009/054358 WO2010022185A1 (en) | 2008-08-19 | 2009-08-19 | Methods and systems for content processing |
Country Status (5)
Country | Link |
---|---|
EP (1) | EP2313847A4 (en) |
KR (2) | KR101680044B1 (en) |
CN (1) | CN102216941B (en) |
CA (1) | CA2734613C (en) |
WO (1) | WO2010022185A1 (en) |
Cited By (45)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102170471A (en) * | 2011-04-14 | 2011-08-31 | 宋健 | A real-time audio and video signal transmission method and system replacing satellite network |
WO2011116309A1 (en) * | 2010-03-19 | 2011-09-22 | Digimarc Corporation | Intuitive computing methods and systems |
WO2012005947A2 (en) * | 2010-07-07 | 2012-01-12 | Spinella Ip Holdings, Inc. | System and method for transmission, processing, and rendering of stereoscopic and multi-view images |
WO2012005955A2 (en) * | 2010-06-29 | 2012-01-12 | Microsoft Corporation | Content authoring and propagation at various fidelities |
EP2426645A1 (en) * | 2010-09-06 | 2012-03-07 | Sony Corporation | Image processing device, program, and image processing method |
US8175617B2 (en) | 2009-10-28 | 2012-05-08 | Digimarc Corporation | Sensor-based mobile search, related methods and systems |
CN102511013A (en) * | 2010-07-23 | 2012-06-20 | 索尼公司 | Imaging device, method for controlling same, and program |
CN102812497A (en) * | 2011-03-03 | 2012-12-05 | 松下电器产业株式会社 | Video provision device, video provision method, and video provision program capable of providing vicarious experience |
WO2013000142A1 (en) * | 2011-06-30 | 2013-01-03 | 深圳市君盛惠创科技有限公司 | Mobile phone user identity authentication method, cloud server and network system |
CN103018162A (en) * | 2011-09-22 | 2013-04-03 | 致茂电子股份有限公司 | System and method for processing video data for testing |
US8520979B2 (en) | 2008-08-19 | 2013-08-27 | Digimarc Corporation | Methods and systems for content processing |
KR20130118897A (en) * | 2010-11-04 | 2013-10-30 | 디지맥 코포레이션 | Smartphone-based methods and systems |
CN103442218A (en) * | 2013-08-27 | 2013-12-11 | 宁波海视智能系统有限公司 | Video signal pre-processing method of multi-mode behavior recognition and description |
US8627096B2 (en) | 2011-07-14 | 2014-01-07 | Sensible Vision, Inc. | System and method for providing secure access to an electronic device using both a screen gesture and facial biometrics |
US8774147B2 (en) | 2012-02-23 | 2014-07-08 | Dahrwin Llc | Asynchronous wireless dynamic ad-hoc network |
CN103996209A (en) * | 2014-05-21 | 2014-08-20 | 北京航空航天大学 | Infrared vessel object segmentation method based on salient region detection |
US8886222B1 (en) | 2009-10-28 | 2014-11-11 | Digimarc Corporation | Intuitive computing methods and systems |
US20140357312A1 (en) * | 2010-11-04 | 2014-12-04 | Digimarc Corporation | Smartphone-based methods and systems |
CN104267808A (en) * | 2014-09-18 | 2015-01-07 | 北京智谷睿拓技术服务有限公司 | Action recognition method and equipment |
US20150286873A1 (en) * | 2014-04-03 | 2015-10-08 | Bruce L. Davis | Smartphone-based methods and systems |
US9354778B2 (en) | 2013-12-06 | 2016-05-31 | Digimarc Corporation | Smartphone-based methods and systems |
US9367886B2 (en) | 2010-11-04 | 2016-06-14 | Digimarc Corporation | Smartphone arrangements responsive to musical artists and other content proprietors |
CN106815673A (en) * | 2016-11-29 | 2017-06-09 | 施冬冬 | A kind of intelligent management system for vehicle |
CN106896919A (en) * | 2010-12-03 | 2017-06-27 | 雷蛇(亚太)私人有限公司 | Configuration file management method |
US9763048B2 (en) | 2009-07-21 | 2017-09-12 | Waldeck Technology, Llc | Secondary indications of user locations and use thereof by a location-based service |
EP3223216A1 (en) * | 2016-03-23 | 2017-09-27 | Yokogawa Electric Corporation | Maintenance information sharing device, maintenance information sharing method, and non-transitory computer readable storage medium |
US9940118B2 (en) | 2012-02-23 | 2018-04-10 | Dahrwin Llc | Systems and methods utilizing highly dynamic wireless ad-hoc networks |
WO2019182907A1 (en) * | 2018-03-21 | 2019-09-26 | Nulman Yanir | Design, platform, and methods for personalized human interactions through digital communication devices |
US10542285B2 (en) | 2011-09-23 | 2020-01-21 | Velos Media, Llc | Decoded picture buffer management |
WO2020159386A1 (en) * | 2019-02-01 | 2020-08-06 | Andersen Terje N | Method and system for extracting metadata from an observed scene |
US10832026B2 (en) | 2012-03-01 | 2020-11-10 | Sys-Tech Solutions, Inc. | Method and system for determining whether a barcode is genuine using a gray level co-occurrence matrix |
CN112191055A (en) * | 2020-09-29 | 2021-01-08 | 广州天域科技有限公司 | Dust device with air detection structure for mining machinery |
US10922699B2 (en) | 2012-03-01 | 2021-02-16 | Sys-Tech Solutions, Inc. | Method and system for determining whether a barcode is genuine using a deviation from a nominal shape |
US10997385B2 (en) | 2012-03-01 | 2021-05-04 | Sys-Tech Solutions, Inc. | Methods and a system for verifying the authenticity of a mark using trimmed sets of metrics |
CN112819761A (en) * | 2021-01-21 | 2021-05-18 | 百度在线网络技术(北京)有限公司 | Model training method, score determination method, apparatus, device, medium, and product |
US11049094B2 (en) | 2014-02-11 | 2021-06-29 | Digimarc Corporation | Methods and arrangements for device to device communication |
US20210201018A1 (en) * | 2019-11-21 | 2021-07-01 | Tata Consultancy Services Limited | System and method for determination of label values in unstructured documents |
CN113488037A (en) * | 2020-07-10 | 2021-10-08 | 青岛海信电子产业控股股份有限公司 | Speech recognition method |
CN114697964A (en) * | 2022-05-30 | 2022-07-01 | 深圳市中电网络技术有限公司 | Information management method based on Internet and biological authentication and cloud service platform |
WO2023273318A1 (en) * | 2021-06-30 | 2023-01-05 | Huawei Cloud Computing Technologies Co., Ltd. | Data-sharing systemsand methods, which use multi-angle incentive allocation |
US11669752B2 (en) | 2014-04-22 | 2023-06-06 | Google Llc | Automatic actions based on contextual replies |
US11682141B2 (en) | 2011-09-30 | 2023-06-20 | Ebay Inc. | Item recommendations based on image feature data |
US11716169B2 (en) | 2021-12-09 | 2023-08-01 | SK Hynix Inc. | Method for error handling of an interconnection protocol, controller, and storage device |
DE102022204996A1 (en) | 2022-05-19 | 2023-11-23 | Carl Zeiss Smt Gmbh | Method and device for determining a residual gas using a residual gas analysis method in a vacuum of a vacuum chamber |
US11861669B2 (en) * | 2019-07-29 | 2024-01-02 | Walmart Apollo, Llc | System and method for textual analysis of images |
Families Citing this family (117)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8516607B2 (en) * | 2011-05-23 | 2013-08-20 | Qualcomm Incorporated | Facilitating data access control in peer-to-peer overlay networks |
LT3595281T (en) * | 2011-05-27 | 2022-05-25 | Dolby Laboratories Licensing Corporation | Scalable systems for controlling color management comprising varying levels of metadata |
CN104067291B (en) * | 2011-10-14 | 2017-07-18 | 西门子公司 | At least one moving parts, the method and system of the especially RFID label tag energy supply of RFID system into wireless communication system |
TWI455579B (en) | 2011-10-26 | 2014-10-01 | Ability Entpr Co Ltd | Image processing method and processing circuit and image processing system and image capturing device using the same |
CN103095977B (en) * | 2011-10-31 | 2016-08-10 | 佳能企业股份有限公司 | Image acquisition method and apply its image processing system and image capture unit |
KR20130048035A (en) * | 2011-11-01 | 2013-05-09 | 엘지전자 주식회사 | Media apparatus, contents server, and method for operating the same |
JP5851046B2 (en) * | 2011-11-10 | 2016-02-03 | エンパイア テクノロジー ディベロップメント エルエルシー | Remote display |
KR101375962B1 (en) * | 2012-02-27 | 2014-03-18 | 주식회사 팬택 | Flexible terminal |
CN103458175B (en) * | 2012-05-23 | 2018-08-03 | 杭州阿尔法红外检测技术有限公司 | Thermal imagery recording device and thermal imagery recording method |
US8959092B2 (en) * | 2012-06-27 | 2015-02-17 | Google Inc. | Providing streams of filtered photographs for user consumption |
US9373182B2 (en) * | 2012-08-17 | 2016-06-21 | Intel Corporation | Memory sharing via a unified memory architecture |
CN109542849B (en) * | 2012-09-16 | 2021-09-24 | 吴东辉 | Image file format, image file generating method, image file generating device and application |
US9164552B2 (en) | 2012-09-27 | 2015-10-20 | Futurewei Technologies, Inc. | Real time visualization of network information |
US8811670B2 (en) * | 2012-09-28 | 2014-08-19 | The Boeing Company | Method and system for using fingerprints to track moving objects in video |
CN103714089B (en) * | 2012-09-29 | 2018-01-05 | 上海盛大网络发展有限公司 | A kind of method and system for realizing cloud rollback database |
US10318308B2 (en) * | 2012-10-31 | 2019-06-11 | Mobileye Vision Technologies Ltd. | Arithmetic logic unit |
CN104823145A (en) * | 2012-11-29 | 2015-08-05 | 爱德拉株式会社 | Method for providing different content according to widget to be visually changed on screen of smart device |
US9589314B2 (en) * | 2013-04-29 | 2017-03-07 | Qualcomm Incorporated | Query processing for tile-based renderers |
US9843623B2 (en) | 2013-05-28 | 2017-12-12 | Qualcomm Incorporated | Systems and methods for selecting media items |
KR101480065B1 (en) * | 2013-05-29 | 2015-01-09 | (주)베라시스 | Object detecting method using pattern histogram |
US9443355B2 (en) | 2013-06-28 | 2016-09-13 | Microsoft Technology Licensing, Llc | Reprojection OLED display for augmented reality experiences |
CN104424485A (en) * | 2013-08-22 | 2015-03-18 | 北京卓易讯畅科技有限公司 | Method and device for obtaining specific information based on image recognition |
KR101502841B1 (en) * | 2013-08-28 | 2015-03-16 | 현대미디어 주식회사 | Outline forming method of Bitmap font, and computer-readable recording medium for the same |
AU2014321165B2 (en) | 2013-09-11 | 2020-04-09 | See-Out Pty Ltd | Image searching method and apparatus |
CN105556947A (en) * | 2013-09-16 | 2016-05-04 | 汤姆逊许可公司 | Method and apparatus for color detection to generate text color |
AT514861A3 (en) | 2013-09-20 | 2015-05-15 | Asmag Holding Gmbh | Authentication system for a mobile data terminal |
CN103530649A (en) * | 2013-10-16 | 2014-01-22 | 北京理工大学 | Visual searching method applicable mobile terminal |
KR101801581B1 (en) * | 2013-12-19 | 2017-11-27 | 인텔 코포레이션 | Protection system including machine learning snapshot evaluation |
CN105793867A (en) * | 2013-12-20 | 2016-07-20 | 西-奥特有限公司 | Image searching method and apparatus |
CA2885874A1 (en) * | 2014-04-04 | 2015-10-04 | Bradford A. Folkens | Image processing system including image priority |
CN105303506B (en) * | 2014-06-19 | 2018-10-26 | Tcl集团股份有限公司 | A kind of data parallel processing method and system based on HTML5 |
KR101487461B1 (en) * | 2014-06-26 | 2015-01-28 | 우원소프트 주식회사 | Security control system by face recognition with private image secure function |
CN104967790B (en) | 2014-08-06 | 2018-09-11 | 腾讯科技(北京)有限公司 | Method, photo taking, device and mobile terminal |
US9560465B2 (en) * | 2014-10-03 | 2017-01-31 | Dts, Inc. | Digital audio filters for variable sample rates |
KR101642602B1 (en) * | 2014-12-02 | 2016-07-26 | 서진이엔에스(주) | System and method of detecting parking by software using analog/digital closed-circuit television image |
CN105095398B (en) * | 2015-07-03 | 2018-10-19 | 北京奇虎科技有限公司 | A kind of information providing method and device |
CN105046256B (en) * | 2015-07-22 | 2018-10-16 | 福建新大陆自动识别技术有限公司 | QR codes coding/decoding method based on distorted image correction and system |
CN108055871B (en) * | 2015-09-15 | 2021-12-31 | 佩佩尔+富克斯欧洲股份公司 | Apparatus and method for providing a graphical representation or sequence thereof for detection by a detector |
US10769155B2 (en) * | 2016-05-17 | 2020-09-08 | Google Llc | Automatically augmenting message exchange threads based on tone of message |
US10498552B2 (en) | 2016-06-12 | 2019-12-03 | Apple Inc. | Presenting accessory state |
US10310725B2 (en) * | 2016-06-12 | 2019-06-04 | Apple Inc. | Generating scenes based on accessory state |
CN106213968A (en) * | 2016-08-04 | 2016-12-14 | 轩脉家居科技(上海)有限公司 | A kind of intelligent curtain based on human action identification |
CN106326888B (en) * | 2016-08-16 | 2022-08-16 | 北京旷视科技有限公司 | Image recognition method and device |
EP3293648B1 (en) * | 2016-09-12 | 2024-04-03 | Dassault Systèmes | Representation of a skeleton of a mechanical part |
US10846618B2 (en) * | 2016-09-23 | 2020-11-24 | Google Llc | Smart replies using an on-device model |
CN110603400B (en) * | 2016-12-09 | 2021-03-02 | 系统科技解决方案公司 | Method and computing device for determining whether a mark is authentic |
CN108228580A (en) * | 2016-12-09 | 2018-06-29 | 阿里巴巴集团控股有限公司 | A kind of method and apparatus for showing business object data in the picture |
KR20180070082A (en) * | 2016-12-16 | 2018-06-26 | (주)태원이노베이션 | Vr contents generating system |
EP3343445A1 (en) * | 2016-12-28 | 2018-07-04 | Thomson Licensing | Method and apparatus for encoding and decoding lists of pixels |
DE102017100622A1 (en) * | 2017-01-13 | 2018-07-19 | Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. | Apparatus and methods for correcting registration information from one or more inertial sensors |
CN108804387B (en) * | 2017-04-27 | 2021-07-23 | 腾讯科技(深圳)有限公司 | Target user determination method and device |
CN107465868B (en) * | 2017-06-21 | 2018-11-16 | 珠海格力电器股份有限公司 | Object identification method, device and electronic equipment based on terminal |
CN107231547B (en) * | 2017-07-07 | 2021-03-12 | 广东中星电子有限公司 | Video monitoring system and method |
KR102282455B1 (en) * | 2017-07-11 | 2021-07-28 | 한화테크윈 주식회사 | Image process device and image processing method |
CN110998286B (en) * | 2017-07-31 | 2023-06-13 | 史密斯探测公司 | System for determining the presence of a substance of interest in a sample |
US10565087B2 (en) * | 2017-08-03 | 2020-02-18 | Microsoft Technology Licensing, Llc | Tentative execution of code in a debugger |
US10360832B2 (en) | 2017-08-14 | 2019-07-23 | Microsoft Technology Licensing, Llc | Post-rendering image transformation using parallel image transformation pipelines |
KR102337182B1 (en) * | 2017-09-29 | 2021-12-09 | 한국전력공사 | Smartmeter installed middleware platform for function extension, smartmeter application management system and method using the same |
CN109522254B (en) * | 2017-10-30 | 2022-04-12 | 上海寒武纪信息科技有限公司 | Arithmetic device and method |
CN109724617B (en) * | 2017-10-31 | 2021-12-24 | 腾讯科技(深圳)有限公司 | Navigation route drawing method and related equipment |
KR102523672B1 (en) * | 2017-11-14 | 2023-04-20 | 삼성전자주식회사 | Display apparatus, control method thereof and recording media |
CN107992809A (en) * | 2017-11-24 | 2018-05-04 | 合肥博焱智能科技有限公司 | A kind of image processing system for recognition of face |
CN108257077B (en) * | 2018-01-02 | 2022-03-22 | 深圳云天励飞技术有限公司 | GPU-based clustering data processing method and system and computing device |
CN108446964B (en) * | 2018-03-30 | 2022-03-22 | 中南大学 | User recommendation method based on mobile traffic DPI data |
CN108960285B (en) * | 2018-05-31 | 2021-05-07 | 东软集团股份有限公司 | Classification model generation method, tongue image classification method and tongue image classification device |
CN108830594B (en) * | 2018-06-22 | 2019-05-07 | 山东高速信联支付有限公司 | Multi-mode electronic fare payment system |
WO2020065586A1 (en) * | 2018-09-26 | 2020-04-02 | Guardian Glass, LLC | Augmented reality system and method for substrates, coated articles, insulating glass units, and/or the like |
FR3088160B1 (en) * | 2018-11-06 | 2021-04-02 | Teledyne E2V Semiconductors Sas | IMAGE SENSOR FOR OPTICAL CODE (S) RECOGNITION |
CN109598668B (en) * | 2018-12-05 | 2023-03-14 | 吉林大学 | Touch form digital watermark embedding and detecting method based on electrostatic force |
CN111292245A (en) * | 2018-12-07 | 2020-06-16 | 北京字节跳动网络技术有限公司 | Image processing method and device |
CN109711361A (en) * | 2018-12-29 | 2019-05-03 | 重庆集诚汽车电子有限责任公司 | Intelligent cockpit embedded fingerprint feature extracting method based on deep learning |
US11620478B2 (en) | 2019-02-06 | 2023-04-04 | Texas Instruments Incorporated | Semantic occupancy grid management in ADAS/autonomous driving |
CN109886205B (en) * | 2019-02-25 | 2023-08-08 | 苏州清研微视电子科技有限公司 | Real-time safety belt monitoring method and system |
CN109872334A (en) * | 2019-02-26 | 2019-06-11 | 电信科学技术研究院有限公司 | A kind of image partition method and device |
KR102586014B1 (en) | 2019-03-05 | 2023-10-10 | 삼성전자주식회사 | Electronic apparatus and controlling method thereof |
US10990840B2 (en) * | 2019-03-15 | 2021-04-27 | Scenera, Inc. | Configuring data pipelines with image understanding |
CN111695395B (en) * | 2019-04-22 | 2021-01-05 | 广西众焰安科技有限公司 | Method for identifying field illegal behavior |
CN110069719B (en) * | 2019-04-24 | 2023-03-31 | 西安工程大学 | Internet environment-oriented behavior prediction method and prediction system thereof |
CN113826376B (en) * | 2019-05-24 | 2023-08-15 | Oppo广东移动通信有限公司 | User equipment and strabismus correction method |
CN110276281A (en) * | 2019-06-10 | 2019-09-24 | 浙江工业大学 | A kind of screenshotss picture and text identification extracting method and system towards mobile terminal |
WO2020256718A1 (en) * | 2019-06-19 | 2020-12-24 | Google Llc | Improved image watermarking |
US11537816B2 (en) * | 2019-07-16 | 2022-12-27 | Ancestry.Com Operations Inc. | Extraction of genealogy data from obituaries |
CN110853108B (en) * | 2019-10-11 | 2020-07-10 | 中国南方电网有限责任公司超高压输电公司天生桥局 | Compression storage method of infrared chart data |
CN110762817B (en) * | 2019-10-31 | 2020-08-07 | 万秋花 | Movable air outlet height adjusting platform and method for clinical laboratory |
CN110826726B (en) * | 2019-11-08 | 2023-09-08 | 腾讯科技(深圳)有限公司 | Target processing method, target processing device, target processing apparatus, and medium |
CN110895557B (en) * | 2019-11-27 | 2022-06-21 | 广东智媒云图科技股份有限公司 | Text feature judgment method and device based on neural network and storage medium |
CN110930064B (en) * | 2019-12-09 | 2023-04-25 | 山东大学 | Mars storm space-time probability extraction and landing safety evaluation method |
CN111178946B (en) * | 2019-12-17 | 2023-07-18 | 清华大学深圳国际研究生院 | User behavior characterization method and system |
CN111145180A (en) * | 2019-12-25 | 2020-05-12 | 威创集团股份有限公司 | Map tile processing method applied to large visual screen and related device |
CN111145354B (en) * | 2019-12-31 | 2024-02-13 | 北京恒华伟业科技股份有限公司 | BIM data model identification method and device |
CN111275057B (en) * | 2020-02-13 | 2023-06-20 | 腾讯科技(深圳)有限公司 | Image processing method, device and equipment |
CN111491088B (en) * | 2020-04-23 | 2021-08-31 | 支付宝(杭州)信息技术有限公司 | Security camera, image encryption method and device and electronic equipment |
WO2022013753A1 (en) * | 2020-07-15 | 2022-01-20 | Corephotonics Ltd. | Point of view aberrations correction in a scanning folded camera |
CN111881982A (en) * | 2020-07-30 | 2020-11-03 | 北京环境特性研究所 | Unmanned aerial vehicle target identification method |
CN112148710B (en) * | 2020-09-21 | 2023-11-14 | 珠海市卓轩科技有限公司 | Micro-service library separation method, system and medium |
CN112200199A (en) * | 2020-09-30 | 2021-01-08 | 南宁学院 | Identification method of screen color identification system |
CN112256581B (en) * | 2020-10-27 | 2024-01-23 | 华泰证券股份有限公司 | Log playback test method and device for high-simulation securities trade trading system |
CN112036387B (en) * | 2020-11-06 | 2021-02-09 | 成都索贝数码科技股份有限公司 | News picture shooting angle identification method based on gated convolutional neural network |
CN112257037B (en) * | 2020-11-13 | 2024-03-19 | 北京明朝万达科技股份有限公司 | Process watermarking method, system and electronic equipment |
CN112564796B (en) * | 2020-11-25 | 2021-11-02 | 重庆邮电大学 | Equipment payment information interaction system based on visible light wireless communication |
CN112991659B (en) * | 2021-03-18 | 2023-07-28 | 浙江赛龙建设科技有限公司 | Big data security monitoring management method with early warning processing function |
CN112995329B (en) * | 2021-03-22 | 2023-06-16 | 广东一一五科技股份有限公司 | File transmission method and system |
CN113020428B (en) * | 2021-03-24 | 2022-06-28 | 北京理工大学 | Progressive die machining monitoring method, device, equipment and storage medium |
WO2022229312A1 (en) * | 2021-04-29 | 2022-11-03 | Asml Netherlands B.V. | Hierarchical clustering of fourier transform based layout patterns |
CN113297547B (en) * | 2021-05-24 | 2022-07-08 | 上海大学 | Back door watermark adding method, verification method and system for data set |
CN113255059B (en) * | 2021-05-26 | 2024-04-05 | 上海海事大学 | Mail wheel weight control method |
CN113722672B (en) * | 2021-07-20 | 2022-04-05 | 厦门微亚智能科技有限公司 | Method for detecting and calculating stray light noise of VR Lens |
CN113610194B (en) * | 2021-09-09 | 2023-08-11 | 重庆数字城市科技有限公司 | Automatic classification method for digital files |
KR102606911B1 (en) * | 2021-09-13 | 2023-11-29 | 주식회사 위버스컴퍼니 | Method and system for controlling traffic inbound to application programming interface server |
US20230154212A1 (en) * | 2021-11-12 | 2023-05-18 | Zebra Technologies Corporation | Method on identifying indicia orientation and decoding indicia for machine vision systems |
CN114374636B (en) * | 2021-12-21 | 2024-04-02 | 航天科工网络信息发展有限公司 | Intelligent routing method, device and network equipment |
TWI806299B (en) * | 2021-12-21 | 2023-06-21 | 宏碁股份有限公司 | Processing method of sound watermark and sound watermark generating apparatus |
CN114611015A (en) * | 2022-03-25 | 2022-06-10 | 阿里巴巴达摩院(杭州)科技有限公司 | Interactive information processing method and device and cloud server |
CN115115885B (en) * | 2022-06-30 | 2024-04-02 | 中国科学院南京地理与湖泊研究所 | Land classification method for gram angle field conversion with important extreme points reserved |
CN116309494B (en) * | 2023-03-23 | 2024-01-23 | 宁波斯年智驾科技有限公司 | Method, device, equipment and medium for determining interest point information in electronic map |
CN117275069A (en) * | 2023-09-26 | 2023-12-22 | 华中科技大学 | End-to-end head gesture estimation method based on learnable vector and attention mechanism |
CN117647826B (en) * | 2024-01-29 | 2024-04-12 | 成都星历科技有限公司 | Navigation deception jamming signal detection system and method based on jamming source positioning |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20040263663A1 (en) * | 2003-06-25 | 2004-12-30 | Sunplus Technology Co., Ltd. | Digital camera image controller apparatus for a mobile phone |
US20060012677A1 (en) * | 2004-02-20 | 2006-01-19 | Neven Hartmut Sr | Image-based search engine for mobile phones with camera |
US20070019088A1 (en) * | 2005-07-19 | 2007-01-25 | Alps Electric Co., Ltd. | Camera module and mobile phone |
US20070162971A1 (en) * | 2006-01-06 | 2007-07-12 | Nokia Corporation | System and method for managing captured content |
US20080007620A1 (en) * | 2006-07-06 | 2008-01-10 | Nokia Corporation | Method, Device, Mobile Terminal and Computer Program Product for a Camera Motion Detection Based Scheme for Improving Camera Input User Interface Functionalities |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB0226014D0 (en) * | 2002-11-08 | 2002-12-18 | Nokia Corp | Camera-LSI and information device |
US20050094730A1 (en) * | 2003-10-20 | 2005-05-05 | Chang Li F. | Wireless device having a distinct hardware video accelerator to support video compression and decompression |
KR100757167B1 (en) * | 2006-06-09 | 2007-09-07 | 엘지이노텍 주식회사 | Mobile phone with image signal processor for capture biometrics image pickup and method for operating the same |
US20080094466A1 (en) | 2006-10-18 | 2008-04-24 | Richard Eric Helvick | Target use video limit notification on wireless communication device |
US7656438B2 (en) * | 2007-01-04 | 2010-02-02 | Sharp Laboratories Of America, Inc. | Target use video limit enforcement on wireless communication device |
-
2009
- 2009-08-19 EP EP09808792.7A patent/EP2313847A4/en not_active Ceased
- 2009-08-19 CA CA2734613A patent/CA2734613C/en active Active
- 2009-08-19 KR KR1020117006167A patent/KR101680044B1/en active IP Right Grant
- 2009-08-19 WO PCT/US2009/054358 patent/WO2010022185A1/en active Application Filing
- 2009-08-19 CN CN200980141567.8A patent/CN102216941B/en active Active
- 2009-08-19 KR KR1020167032337A patent/KR101763132B1/en active IP Right Grant
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20040263663A1 (en) * | 2003-06-25 | 2004-12-30 | Sunplus Technology Co., Ltd. | Digital camera image controller apparatus for a mobile phone |
US20060012677A1 (en) * | 2004-02-20 | 2006-01-19 | Neven Hartmut Sr | Image-based search engine for mobile phones with camera |
US20070019088A1 (en) * | 2005-07-19 | 2007-01-25 | Alps Electric Co., Ltd. | Camera module and mobile phone |
US20070162971A1 (en) * | 2006-01-06 | 2007-07-12 | Nokia Corporation | System and method for managing captured content |
US20080007620A1 (en) * | 2006-07-06 | 2008-01-10 | Nokia Corporation | Method, Device, Mobile Terminal and Computer Program Product for a Camera Motion Detection Based Scheme for Improving Camera Input User Interface Functionalities |
Non-Patent Citations (1)
Title |
---|
See also references of EP2313847A4 * |
Cited By (73)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8520979B2 (en) | 2008-08-19 | 2013-08-27 | Digimarc Corporation | Methods and systems for content processing |
US9763048B2 (en) | 2009-07-21 | 2017-09-12 | Waldeck Technology, Llc | Secondary indications of user locations and use thereof by a location-based service |
US9444924B2 (en) | 2009-10-28 | 2016-09-13 | Digimarc Corporation | Intuitive computing methods and systems |
US9118771B2 (en) | 2009-10-28 | 2015-08-25 | Digimarc Corporation | Intuitive computing methods and systems |
US8977293B2 (en) | 2009-10-28 | 2015-03-10 | Digimarc Corporation | Intuitive computing methods and systems |
US8886222B1 (en) | 2009-10-28 | 2014-11-11 | Digimarc Corporation | Intuitive computing methods and systems |
US8175617B2 (en) | 2009-10-28 | 2012-05-08 | Digimarc Corporation | Sensor-based mobile search, related methods and systems |
CN102893327A (en) * | 2010-03-19 | 2013-01-23 | 数字标记公司 | Intuitive computing methods and systems |
WO2011116309A1 (en) * | 2010-03-19 | 2011-09-22 | Digimarc Corporation | Intuitive computing methods and systems |
CN102893327B (en) * | 2010-03-19 | 2015-05-27 | 数字标记公司 | Intuitive computing methods and systems |
WO2012005955A3 (en) * | 2010-06-29 | 2012-04-26 | Microsoft Corporation | Content authoring and propagation at various fidelities |
WO2012005955A2 (en) * | 2010-06-29 | 2012-01-12 | Microsoft Corporation | Content authoring and propagation at various fidelities |
WO2012005947A2 (en) * | 2010-07-07 | 2012-01-12 | Spinella Ip Holdings, Inc. | System and method for transmission, processing, and rendering of stereoscopic and multi-view images |
WO2012005947A3 (en) * | 2010-07-07 | 2014-06-26 | Spinella Ip Holdings, Inc. | System and method for transmission, processing, and rendering of stereoscopic and multi-view images |
CN102511013A (en) * | 2010-07-23 | 2012-06-20 | 索尼公司 | Imaging device, method for controlling same, and program |
US9741141B2 (en) | 2010-09-06 | 2017-08-22 | Sony Corporation | Image processing device, program, and image processing method |
EP2426645A1 (en) * | 2010-09-06 | 2012-03-07 | Sony Corporation | Image processing device, program, and image processing method |
US9484046B2 (en) * | 2010-11-04 | 2016-11-01 | Digimarc Corporation | Smartphone-based methods and systems |
US10971171B2 (en) | 2010-11-04 | 2021-04-06 | Digimarc Corporation | Smartphone-based methods and systems |
US20170236006A1 (en) * | 2010-11-04 | 2017-08-17 | Digimarc Corporation | Smartphone-based methods and systems |
US20140357312A1 (en) * | 2010-11-04 | 2014-12-04 | Digimarc Corporation | Smartphone-based methods and systems |
KR102010221B1 (en) | 2010-11-04 | 2019-08-13 | 디지맥 코포레이션 | Smartphone-based methods and systems |
US9367886B2 (en) | 2010-11-04 | 2016-06-14 | Digimarc Corporation | Smartphone arrangements responsive to musical artists and other content proprietors |
KR20130118897A (en) * | 2010-11-04 | 2013-10-30 | 디지맥 코포레이션 | Smartphone-based methods and systems |
CN106896919A (en) * | 2010-12-03 | 2017-06-27 | 雷蛇(亚太)私人有限公司 | Configuration file management method |
CN102812497A (en) * | 2011-03-03 | 2012-12-05 | 松下电器产业株式会社 | Video provision device, video provision method, and video provision program capable of providing vicarious experience |
CN102170471A (en) * | 2011-04-14 | 2011-08-31 | 宋健 | A real-time audio and video signal transmission method and system replacing satellite network |
US8983145B2 (en) | 2011-06-30 | 2015-03-17 | Shenzhen Junshenghuichuang Technologies Co., Ltd | Method for authenticating identity of handset user |
US9537859B2 (en) | 2011-06-30 | 2017-01-03 | Dongguan Ruiteng Electronics Technologies Co., Ltd | Method for authenticating identity of handset user in a cloud-computing environment |
WO2013000142A1 (en) * | 2011-06-30 | 2013-01-03 | 深圳市君盛惠创科技有限公司 | Mobile phone user identity authentication method, cloud server and network system |
US9813909B2 (en) | 2011-06-30 | 2017-11-07 | Guangzhou Haiji Technology Co., Ltd | Cloud server for authenticating the identity of a handset user |
US8861798B2 (en) | 2011-06-30 | 2014-10-14 | Shenzhen Junshenghuichuang Technologies Co., Ltd. | Method for authenticating identity of handset user |
US8627096B2 (en) | 2011-07-14 | 2014-01-07 | Sensible Vision, Inc. | System and method for providing secure access to an electronic device using both a screen gesture and facial biometrics |
CN103018162A (en) * | 2011-09-22 | 2013-04-03 | 致茂电子股份有限公司 | System and method for processing video data for testing |
US10542285B2 (en) | 2011-09-23 | 2020-01-21 | Velos Media, Llc | Decoded picture buffer management |
US11490119B2 (en) | 2011-09-23 | 2022-11-01 | Qualcomm Incorporated | Decoded picture buffer management |
US10856007B2 (en) | 2011-09-23 | 2020-12-01 | Velos Media, Llc | Decoded picture buffer management |
US11682141B2 (en) | 2011-09-30 | 2023-06-20 | Ebay Inc. | Item recommendations based on image feature data |
US10075892B2 (en) | 2012-02-23 | 2018-09-11 | Dahrwin Llc | Mobile device for use in a dynamic and stochastic asynchronously updated wireless ad-hoc network |
US8774147B2 (en) | 2012-02-23 | 2014-07-08 | Dahrwin Llc | Asynchronous wireless dynamic ad-hoc network |
US9338725B2 (en) | 2012-02-23 | 2016-05-10 | Dahrwin Llc | Mobile device for use in a dynamic and stochastic asynchronously updated wireless ad-hoc network |
US9940118B2 (en) | 2012-02-23 | 2018-04-10 | Dahrwin Llc | Systems and methods utilizing highly dynamic wireless ad-hoc networks |
US10997385B2 (en) | 2012-03-01 | 2021-05-04 | Sys-Tech Solutions, Inc. | Methods and a system for verifying the authenticity of a mark using trimmed sets of metrics |
US10922699B2 (en) | 2012-03-01 | 2021-02-16 | Sys-Tech Solutions, Inc. | Method and system for determining whether a barcode is genuine using a deviation from a nominal shape |
US10832026B2 (en) | 2012-03-01 | 2020-11-10 | Sys-Tech Solutions, Inc. | Method and system for determining whether a barcode is genuine using a gray level co-occurrence matrix |
CN103442218A (en) * | 2013-08-27 | 2013-12-11 | 宁波海视智能系统有限公司 | Video signal pre-processing method of multi-mode behavior recognition and description |
CN103442218B (en) * | 2013-08-27 | 2016-12-28 | 宁波海视智能系统有限公司 | A kind of multi-mode Activity recognition and the preprocessing method of video signal of description |
US9354778B2 (en) | 2013-12-06 | 2016-05-31 | Digimarc Corporation | Smartphone-based methods and systems |
US11049094B2 (en) | 2014-02-11 | 2021-06-29 | Digimarc Corporation | Methods and arrangements for device to device communication |
US20150286873A1 (en) * | 2014-04-03 | 2015-10-08 | Bruce L. Davis | Smartphone-based methods and systems |
US11669752B2 (en) | 2014-04-22 | 2023-06-06 | Google Llc | Automatic actions based on contextual replies |
CN103996209B (en) * | 2014-05-21 | 2017-01-11 | 北京航空航天大学 | Infrared vessel object segmentation method based on salient region detection |
CN103996209A (en) * | 2014-05-21 | 2014-08-20 | 北京航空航天大学 | Infrared vessel object segmentation method based on salient region detection |
CN104267808A (en) * | 2014-09-18 | 2015-01-07 | 北京智谷睿拓技术服务有限公司 | Action recognition method and equipment |
EP3223216A1 (en) * | 2016-03-23 | 2017-09-27 | Yokogawa Electric Corporation | Maintenance information sharing device, maintenance information sharing method, and non-transitory computer readable storage medium |
US11308108B2 (en) | 2016-03-23 | 2022-04-19 | Yokogawa Electric Corporation | Maintenance information sharing device, maintenance information sharing method, and non-transitory computer readable storage medium |
CN106815673A (en) * | 2016-11-29 | 2017-06-09 | 施冬冬 | A kind of intelligent management system for vehicle |
WO2019182907A1 (en) * | 2018-03-21 | 2019-09-26 | Nulman Yanir | Design, platform, and methods for personalized human interactions through digital communication devices |
US11676251B2 (en) | 2019-02-01 | 2023-06-13 | Terje N. Andersen | Method and system for extracting metadata from an observed scene |
WO2020159386A1 (en) * | 2019-02-01 | 2020-08-06 | Andersen Terje N | Method and system for extracting metadata from an observed scene |
US11861669B2 (en) * | 2019-07-29 | 2024-01-02 | Walmart Apollo, Llc | System and method for textual analysis of images |
US11810383B2 (en) * | 2019-11-21 | 2023-11-07 | Tata Consultancy Services Limited | System and method for determination of label values in unstructured documents |
US20210201018A1 (en) * | 2019-11-21 | 2021-07-01 | Tata Consultancy Services Limited | System and method for determination of label values in unstructured documents |
CN113488037A (en) * | 2020-07-10 | 2021-10-08 | 青岛海信电子产业控股股份有限公司 | Speech recognition method |
CN113488037B (en) * | 2020-07-10 | 2024-04-12 | 海信集团控股股份有限公司 | Speech recognition method |
CN112191055A (en) * | 2020-09-29 | 2021-01-08 | 广州天域科技有限公司 | Dust device with air detection structure for mining machinery |
CN112819761A (en) * | 2021-01-21 | 2021-05-18 | 百度在线网络技术(北京)有限公司 | Model training method, score determination method, apparatus, device, medium, and product |
CN112819761B (en) * | 2021-01-21 | 2023-09-01 | 百度在线网络技术(北京)有限公司 | Model training method, score determining method, device, equipment, medium and product |
WO2023273318A1 (en) * | 2021-06-30 | 2023-01-05 | Huawei Cloud Computing Technologies Co., Ltd. | Data-sharing systemsand methods, which use multi-angle incentive allocation |
US11716169B2 (en) | 2021-12-09 | 2023-08-01 | SK Hynix Inc. | Method for error handling of an interconnection protocol, controller, and storage device |
DE102022204996A1 (en) | 2022-05-19 | 2023-11-23 | Carl Zeiss Smt Gmbh | Method and device for determining a residual gas using a residual gas analysis method in a vacuum of a vacuum chamber |
CN114697964B (en) * | 2022-05-30 | 2022-08-09 | 深圳市中电网络技术有限公司 | Information management method based on Internet and biological authentication and cloud service platform |
CN114697964A (en) * | 2022-05-30 | 2022-07-01 | 深圳市中电网络技术有限公司 | Information management method based on Internet and biological authentication and cloud service platform |
Also Published As
Publication number | Publication date |
---|---|
KR101680044B1 (en) | 2016-11-28 |
CN102216941B (en) | 2015-08-12 |
EP2313847A1 (en) | 2011-04-27 |
KR20160136467A (en) | 2016-11-29 |
CA2734613C (en) | 2020-06-09 |
CA2734613A1 (en) | 2010-02-25 |
CN102216941A (en) | 2011-10-12 |
KR101763132B1 (en) | 2017-07-31 |
KR20110043775A (en) | 2011-04-27 |
EP2313847A4 (en) | 2015-12-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11587432B2 (en) | Methods and systems for content processing | |
US9918183B2 (en) | Methods and systems for content processing | |
US9271133B2 (en) | Methods and systems for image or audio recognition processing | |
CA2734613C (en) | Methods and systems for content processing | |
US9692984B2 (en) | Methods and systems for content processing | |
US9204038B2 (en) | Mobile device and method for image frame processing using dedicated and programmable processors, applying different functions on a frame-by-frame basis | |
US9208384B2 (en) | Methods and systems for content processing |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
WWE | Wipo information: entry into national phase |
Ref document number: 200980141567.8 Country of ref document: CN |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 09808792 Country of ref document: EP Kind code of ref document: A1 |
|
REEP | Request for entry into the european phase |
Ref document number: 2009808792 Country of ref document: EP |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2009808792 Country of ref document: EP |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2734613 Country of ref document: CA |
|
WWE | Wipo information: entry into national phase |
Ref document number: 816/KOLNP/2011 Country of ref document: IN |
|
ENP | Entry into the national phase |
Ref document number: 20117006167 Country of ref document: KR Kind code of ref document: A |
|
NENP | Non-entry into the national phase |
Ref country code: DE |