KR101680044B1 - Methods and systems for content processing - Google PatentsMethods and systems for content processing Download PDF
- Publication number
- KR101680044B1 KR101680044B1 KR1020117006167A KR20117006167A KR101680044B1 KR 101680044 B1 KR101680044 B1 KR 101680044B1 KR 1020117006167 A KR1020117006167 A KR 1020117006167A KR 20117006167 A KR20117006167 A KR 20117006167A KR 101680044 B1 KR101680044 B1 KR 101680044B1
- South Korea
- Prior art keywords
- delete delete
- Prior art date
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N5/00—Details of television systems
- H04N5/222—Studio circuitry; Studio devices; Studio equipment ; Cameras comprising an electronic image sensor, e.g. digital cameras, video cameras, TV cameras, video cameras, camcorders, webcams, camera modules for embedding in other devices, e.g. mobile phones, computers or vehicles
- H04N5/225—Television cameras ; Cameras comprising an electronic image sensor, e.g. digital cameras, video cameras, camcorders, webcams, camera modules specially adapted for being embedded in other devices, e.g. mobile phones, computers or vehicles
- H04N5/232—Devices for controlling television cameras, e.g. remote control ; Control of cameras comprising an electronic image sensor
- H04N5/23222—Computer-aided capture of images, e.g. transfer from script file into camera, check of taken image quality, advice or proposal for image composition or decision on when to take image
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06—COMPUTING; CALCULATING; COUNTING
- G06K—RECOGNITION OF DATA; PRESENTATION OF DATA; RECORD CARRIERS; HANDLING RECORD CARRIERS
- G06K9/00—Methods or arrangements for reading or recognising printed or written characters or for recognising patterns, e.g. fingerprints
- G06K9/00221—Acquiring or recognising human faces, facial parts, facial sketches, facial expressions
- G06K9/00288—Classification, e.g. identification
- G06—COMPUTING; CALCULATING; COUNTING
- G06K—RECOGNITION OF DATA; PRESENTATION OF DATA; RECORD CARRIERS; HANDLING RECORD CARRIERS
- G06K9/00—Methods or arrangements for reading or recognising printed or written characters or for recognising patterns, e.g. fingerprints
- G06K9/00577—Recognising objects characterised by unique random properties, i.e. objects having a physically unclonable function [PUF], e.g. authenticating objects based on their unclonable texture
- G06—COMPUTING; CALCULATING; COUNTING
- G06K—RECOGNITION OF DATA; PRESENTATION OF DATA; RECORD CARRIERS; HANDLING RECORD CARRIERS
- G06K9/00624—Recognising scenes, i.e. recognition of a whole field of perception; recognising scene-specific objects
- G06K9/00664—Recognising scenes such as could be captured by a camera operated by a pedestrian or robot, including objects at substantially different ranges from the camera
- G06—COMPUTING; CALCULATING; COUNTING
- G06K—RECOGNITION OF DATA; PRESENTATION OF DATA; RECORD CARRIERS; HANDLING RECORD CARRIERS
- G06K9/20—Image acquisition
- G06K9/22—Image acquisition using hand-held instruments
- G06K9/228—Hand-held scanners; Optical wands
- G06—COMPUTING; CALCULATING; COUNTING
- G06K—RECOGNITION OF DATA; PRESENTATION OF DATA; RECORD CARRIERS; HANDLING RECORD CARRIERS
- G06K9/36—Image preprocessing, i.e. processing the image information without deciding about the identity of the image
- G06K9/46—Extraction of features or characteristics of the image
- G06K9/4604—Detecting partial patterns, e.g. edges or contours, or configurations, e.g. loops, corners, strokes, intersections
- G06—COMPUTING; CALCULATING; COUNTING
- G06K—RECOGNITION OF DATA; PRESENTATION OF DATA; RECORD CARRIERS; HANDLING RECORD CARRIERS
- G06K9/36—Image preprocessing, i.e. processing the image information without deciding about the identity of the image
- G06K9/46—Extraction of features or characteristics of the image
- G06K9/4642—Extraction of features or characteristics of the image by performing operations within image blocks or by using histograms
- G06K9/4647—Extraction of features or characteristics of the image by performing operations within image blocks or by using histograms summing image-intensity values; Projection and histogram analysis
- G06—COMPUTING; CALCULATING; COUNTING
- G06K—RECOGNITION OF DATA; PRESENTATION OF DATA; RECORD CARRIERS; HANDLING RECORD CARRIERS
- G06K9/46—Extraction of features or characteristics of the image
- G06K9/4652—Extraction of features or characteristics of the image related to colour
- G06—COMPUTING; CALCULATING; COUNTING
- G06K—RECOGNITION OF DATA; PRESENTATION OF DATA; RECORD CARRIERS; HANDLING RECORD CARRIERS
- G06K9/46—Extraction of features or characteristics of the image
- G06K9/4671—Extracting features based on salient regional features, e.g. Scale Invariant Feature Transform [SIFT] keypoints
- G06—COMPUTING; CALCULATING; COUNTING
- G06K—RECOGNITION OF DATA; PRESENTATION OF DATA; RECORD CARRIERS; HANDLING RECORD CARRIERS
- G06K9/62—Methods or arrangements for recognition using electronic means
- G06K9/6217—Design or setup of recognition systems and techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
- G06K9/6253—User interactive design ; Environments; Tool boxes
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04M—TELEPHONIC COMMUNICATION
- H04M1/00—Substation equipment, e.g. for use by subscribers; Analogous equipment at exchanges
- H04M1/02—Constructional features of telephone sets
- H04M1/0202—Portable telephone sets, e.g. cordless phones, mobile phones or bar type handsets
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04M—TELEPHONIC COMMUNICATION
- H04M1/00—Substation equipment, e.g. for use by subscribers; Analogous equipment at exchanges
- H04M1/02—Constructional features of telephone sets
- H04M1/0202—Portable telephone sets, e.g. cordless phones, mobile phones or bar type handsets
- H04M1/026—Details of the structure or mounting of specific components
- H04M1/0264—Details of the structure or mounting of specific components for a camera module assembly
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N1/00—Scanning, transmission or reproduction of documents or the like, e.g. facsimile transmission; Details thereof
- H04N1/21—Intermediate information storage
- H04N1/2104—Intermediate information storage for one or a few pictures
- H04N1/2112—Intermediate information storage for one or a few pictures using still video cameras
- H04N1/2129—Recording in, or reproducing from, a specific memory area or areas, or recording or reproducing at a specific moment
- H04N1/2133—Recording or reproducing at a specific moment, e.g. time interval or time-lapse
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N5/00—Details of television systems
- H04N5/222—Studio circuitry; Studio devices; Studio equipment ; Cameras comprising an electronic image sensor, e.g. digital cameras, video cameras, TV cameras, video cameras, camcorders, webcams, camera modules for embedding in other devices, e.g. mobile phones, computers or vehicles
- H04N5/225—Television cameras ; Cameras comprising an electronic image sensor, e.g. digital cameras, video cameras, camcorders, webcams, camera modules specially adapted for being embedded in other devices, e.g. mobile phones, computers or vehicles
- H04N5/232—Devices for controlling television cameras, e.g. remote control ; Control of cameras comprising an electronic image sensor
- H04N5/23218—Control of camera operation based on recognized objects
- H04N5/23219—Control of camera operation based on recognized objects where the recognized objects include parts of the human body, e.g. human faces, facial parts or facial expressions
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04W—WIRELESS COMMUNICATION NETWORKS
- H04W4/00—Services specially adapted for wireless communication networks; Facilities therefor
- H04W4/50—Service provisioning or reconfiguring
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N1/00—Scanning, transmission or reproduction of documents or the like, e.g. facsimile transmission; Details thereof
- H04N1/00127—Connection or combination of a still picture apparatus with another apparatus, e.g. for storage, processing or transmission of still picture signals or of information associated with a still picture
- H04N1/00281—Connection or combination of a still picture apparatus with another apparatus, e.g. for storage, processing or transmission of still picture signals or of information associated with a still picture with a telecommunication apparatus, e.g. a switched network of teleprinters for the distribution of text-based information, a selective call terminal
- H04N1/00307—Connection or combination of a still picture apparatus with another apparatus, e.g. for storage, processing or transmission of still picture signals or of information associated with a still picture with a telecommunication apparatus, e.g. a switched network of teleprinters for the distribution of text-based information, a selective call terminal with a mobile telephone apparatus
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N1/32—Circuits or arrangements for control or supervision between transmitter and receiver or between image input and image output device
- H04N1/32101—Display, printing, storage or transmission of additional information, e.g. ID code, date and time or title
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N2101/00—Still video cameras
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N2201/00—Indexing scheme relating to scanning, transmission or reproduction of documents or the like, and to details thereof
- H04N2201/32—Circuits or arrangements for control or supervision between transmitter and receiver or between image input and image output device
- H04N2201/3201—Display, printing, storage or transmission of additional information, e.g. ID code, date and time or title
- H04N2201/3225—Display, printing, storage or transmission of additional information, e.g. ID code, date and time or title of data relating to an image, a page or a document
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N2201/00—Indexing scheme relating to scanning, transmission or reproduction of documents or the like, and to details thereof
- H04N2201/32—Circuits or arrangements for control or supervision between transmitter and receiver or between image input and image output device
- H04N2201/3201—Display, printing, storage or transmission of additional information, e.g. ID code, date and time or title
- H04N2201/3274—Storage or retrieval of prestored additional information
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
Certain aspects of the techniques described herein are introduced in FIG. The user's mobile phone captures an image (responds to user commands or automatically), and objects in the scene are recognized. The information associated with each object is identified and made available to the user through a scene-registered interactive visual "bauble" that is graphically overlaid on the image. The bobble may provide information on its own, or it may be a simple indicia that allows the user to tap into a marked location to launch a longer listing of related information or to launch an associated function / application.
In the illustrated scene, the camera recognized the background face as "Bob ", thus annotating the image. A billboard promoting a Godzilla movie was recognized and a bobble saying "show times" was blitted on the display - urging the user to tap for information screening.
The phone recognized the user's car from the scene, and also identified other vehicles-from manufacturing and yearly-from the images. Both are displayed by overlaid text. The restaurant is also identified and an initial review ("Jane's review: very good!") Is shown from a collection of reviews. Tapping loads more reviews.
In one particular arrangement, this scenario is implemented as a cloud-side service supported by local device object recognition core services. Users can leave comments on both fixed objects and mobile objects. Taped bobbles can trigger other applications. Social networks can keep track of opposing relationships - forming virtual "web of objects".
In the initial roll-out, the class of recognizable objects will be limited but useful. The object identification events will primarily fetch and associate social-web connections and public domain information to the bobbles. Applications that utilize barcodes, digital watermarks, face recognition, OCR, etc. can help support the initial deployment of the technology.
Later, the arrangement is expected to develop into an auction market, and payment companies want to place their own bobbles (or associated information) on highly targeted demographic user screens. User profiles, along with input visual stimuli (promoted by GPS / jamming data in some cases), are fed into the Google-esque mix-master in the cloud, Matches buyers of device-screen real estate.
After all, these features can be ubiquitous enough to fit into ordinary vocabularies as in "I'll try to get a Bauble on that" or "See what happens if you see that scene".
Digimarc Patent No. 6,947,571 shows a system in which a cell phone camera captures content (e.g., image data) and processes it to derive an identifier associated with the image. This derived identifier is provided in a data structure (e.g., a database) that represents the corresponding data or operations. Thereafter, the cell phone displays the response information or takes a response action. The sequence of such operations may also be referred to as "visual search ".
Related technologies are disclosed in patent publications, such as, for example, patent publications, 2008, 03, 1111 (Digimarc), 7,283,983 and WO07 / 130688 (Evolution Robotics), 20070175998 and 20020102966 (DSPV), 20060012677, 20060240862 and 20050185060 (Google), 20060056707 and 20050227674 (Nokia) 20060026140 (ExBiblio) 6,491,217, 20020158410 and 20050144455 (Philips), 20020072982 and 20040199387 (Shazam), 20030083098 (Canon), 20010055391 (Qualcomm), 20010001854 (AirClic), 7,251,475 (Sony), 7,174,293 (Iceberg), 7,065,559 (Organnon Wireless) (Evolution Technologies), 6,993,573 and 6,199,048 (Neomedia), 6,941,275 (Tune Hunter), 6,788,293 (Silverbrook Research), 6,766,363 and 6,675,165 (BarPoint), 6,389,055 (Alcatel- Lucent), 6,121,530 (Sonoda), and 6,002,946 Respectively.
It is an object of the present invention to provide methods and systems for content processing.
Embodiments of the presently described techniques relate to improvements to these techniques-oriented for intuitive calculation purposes: seeing / viewing or listening, and infer the user's wind in the sensed context.
Figure 0 illustrates an exemplary embodiment incorporating certain aspects of the techniques described herein.
1 is a top view of an embodiment incorporating aspects of the technology;
2 illustrates a portion of an application that a user may request to run a camera-mounted cell phone;
3 identifies a portion of commercial entities in an embodiment incorporating aspects of the present technique;
Figures 4, 4A and 4B conceptually illustrate how pixel data and derivatives are applied to different jobs and packaged in packets.
Figure 5 illustrates how different tasks may have specific image processing operations in common.
Figure 6 illustrates how common image processing operations can be identified and used to configure cell phone processing hardware to execute these operations.
FIG. 7 illustrates how a cell phone can transmit certain pixel-related data over an internal bus for local processing and transmit other pixel-related data through a communication channel for processing in the cloud.
Figure 8 illustrates how the cloud processing of Figure 7 allows the user to apply much more "intelligence" to the desired operation.
9 illustrates in detail how key vector data is distributed to different external service providers, who executes services in an exchange for compensation, and which is handled in an enhanced manner for the user.
Figure 10 shows an embodiment that incorporates aspects of the technique that note that cell-based processing is suitable for simple object identification tasks such as template matching, while cloud-based processing is suitable for complex tasks such as data association Fig.
10A illustrates an embodiment that incorporates aspects of the present technique that realize that the user experience is optimized by performing visual key vector processing as close to a possible sensor as possible and managing the traffic for the cloud as low as possible of the communication stack.
Figure 11 illustrates that tasks related to external processing may be routed to a first group of service providers that routinely perform certain tasks for the cell phone or may be routed to a second group of competing dynamically- Lt; / RTI > can be routed to service providers of the network.
12 is an enlarged view of the concepts of FIG. 11 showing how a bead filter and a broadcast agent software module can examine reverse auction processing, for example.
13 is a top block diagram of a processing arrangement incorporating aspects of the present technique;
Figure 14 is a top block diagram of another processing arrangement incorporating aspects of the present technique;
Figure 15 illustrates an exemplary scope of image forms that can be captured by a cell phone camera;
Figure 16 illustrates a specific hardware implementation incorporating aspects of the present technique.
Figure 17 illustrates aspects of packets used in an exemplary embodiment;
18 is a block diagram illustrating an implementation of the SIFT technique.
19 is a block diagram illustrating, for example, how packet header data may be changed during processing through use of memory.
Figure 19A illustrates the architecture of the prior art from a robotic player project;
Figure 19b illustrates how various factors can affect how different actions can be processed.
20 shows an arrangement in which a cell phone camera and a cell phone projector share a lens;
Figure 20A illustrates a reference platform architecture that may be utilized in embodiments of the present technology.
Figure 21 shows an image of a desktop phone captured by a cell phone camera;
Figure 22 illustrates a collection of similar images found in a repository of public images, with reference to features identified from the image of Figure 21;
Figures 23 to 28A and Figures 30 to 34 are flowcharts that describe methods of incorporating aspects of the present technique.
29 is an art shot of an Eiffel Tower captured by a cell phone user;
35 is another image captured by a cell phone user;
Figure 36 is an image of the underside of the phone found using methods according to aspects of the present technique;
37 illustrates a portion of a style of physical user interface of a cell phone;
Figures 37A and 37B illustrate different linking topologies.
Figure 38 is an image captured by a cell phone user depicting a trail marker of the Appalachian Trail.
Figures 39-43 illustrate detailed methods incorporating aspects of the present technique.
44 illustrates a user interface of one style of cell phone;
Figures 45A and 45B are diagrams illustrating how different dimensions of commonality can be exploited through the use of cell phone user interface controls.
Figures 46a and 46b are detailed views of specific methods incorporating aspects of the technology, such as Prometheus and Paul Manship, where keywords are automatically determined from a cell phone image.
Figure 47 depicts some of the different data sources that may be referenced in a processing image according to aspects of the present technique.
48A, 48B and 49 are diagrams illustrating different processing methods according to aspects of the present technique.
Figure 50 illustrates, in accordance with aspects of the present technique, a portion of different processes that may be performed on image data;
Figure 51 illustrates an exemplary tree structure that may be utilized in accordance with certain aspects of the present technique.
52 illustrates a network of wearable computers (e. G., Cell phones) that can cooperate with each other in a peer-to-peer network, for example.
Figures 53-55 are detailed views of how the glossary of symbols can be identified by the cell phone and can be used to trigger different actions.
56 illustrates aspects of a prior art digital camera technology;
57 is a detailed view of an embodiment incorporating aspects of the technology;
58 shows how a cell phone can be used for scene and display filter parameters;
59 illustrates particular state machine aspects of the technique;
Figure 60 illustrates how temporal or movement aspects may be included even in the case of a "still"image;
61 illustrates some metadata that may be relevant to implementations incorporating aspects of the technology;
62 illustrates an image that can be captured by a cell phone camera user;
Figs. 63-66 show details of how the image of Fig. 62 can be processed to convey semantic metadata. Fig.
67 illustrates another image that may be captured by a cell phone camera user;
Figs. 68 and 69 are detailed views of how the image of Fig. 67 can be processed to convey semantic metadata; Fig.
70 shows an image that can be captured by a cell phone camera user;
FIG. 71 is a detailed view of how the image of FIG. 70 can be processed to convey semantic metadata; FIG.
72 is a chart showing aspects of a human visual system;
73 shows different low, middle and high frequency components of an image;
74 is a view showing a newspaper.
Fig. 75 is a diagram showing the layout of Fig. 74 as set by the layout software; Fig.
76 is a detailed view of how user interaction with an image captured from printed text can be improved;
FIG. 77 illustrates how metadata transfer semantics can have a progressive aspect similar to JPEG2000 and the like. FIG.
78 is a block diagram of a thermostat of the prior art;
79 is an external view of the thermostat of FIG. 78;
80 is a block diagram of a thermostat utilizing certain aspects of the present technique ("ThingPipe");
81 is a block diagram of a cell phone that utilizes certain aspects of the technique;
82 is a block diagram illustrating specific operations of the thermostat of FIG. 80;
83 illustrates a cell phone display depicting an image captured from a thermostat overlaid with specific touch-screen targets that a user can touch to increase or decrease the thermostat temperature.
84 is a view similar to FIG. 83, but showing a graphical user interface for use on a phone without a touch screen;
85 is a block diagram of an alarm clock utilizing aspects of the technique.
86 illustrates a screen of an alarm clock user interface that may be provided on a cell phone, in accordance with an aspect of the technique;
Figure 87 illustrates a screen of a user interface detailing nearby devices that may be controlled through the use of a cell phone.
This specification details the diversity of techniques assembled over extended time periods to meet a variety of different purposes. Yet, they are involved in a variety of ways, and are therefore bundled in these single documents.
These modified and interrelated points are not in themselves suitable for direct provision. Hence, this explanatory part sometimes proceeds in a non-linear manner among the classified topics and techniques, so that the understanding of the reader is desired.
Each part of this specification details a technique that advantageously integrates the technical features described above in other parts. It is therefore difficult to identify the "beginning" from which this disclosure should logically begin. In other words, we just work hard.
Mobile with Distributed Network Services device Object Recognition and Interaction
There is currently a huge disconnect between the amount of inexhaustible information contained in high quality image data streaming from a mobile device camera (e.g., in a cell phone) and the ability of the mobile device to process such data to finish anything. The "off device" processing of visual data may help to handle the firehose of such data, particularly when a number of visual processing operations may be desirable. These issues become even more important when "real-time object recognition and interaction" is considered, in which case the user of the mobile device virtually instantaneously experiences instantaneous results and increases Predicted realistic graphical feedback.
According to one aspect of the present technique, the distributed network of pixel processing engines serves these mobile device users and generally meets the most qualitative "human real-time interactivity" requirements by feeding back much less than one second. The implementation preferably provides certain basic characteristics for the mobile device, including a somewhat close relationship between the base communication channel available to the mobile device and the output pixels of the image sensor. Routing instructions to pixel data as specified by the user's intentions and reservations subsequent to the base specific levels of "content filtering & classification" of pixel data on the local device may result in one or more "cloud- Thereby causing an interactive session between services. The keyword "session " also represents fast responses sent back to the mobile device, and for some services, such as" real time "or" interactive ", a session essentially represents a packet-based duplex , Multiple outgoing "pixel packets" and multiple incoming response packets (which may be updated pixel data with processed data) may occur every second.
Business factors and good long-term contention are at the heart of distributed networks. Users can subscribe to or tap on any external services they choose. The local device itself and / or the carrier service provider for that device may be configured as the user chooses to route the filtered and appropriate pixel data to the specified object interaction services. The charging mechanisms for these services can be plugged directly into existing cell and / or mobile device charging networks, charged to users and paid to service providers.
But let's back up for a while. The addition of camera systems to mobile devices has led to a surge in applications. Native applications would have simply snapped out quick visual aspects of their environment among the common people and shared such pictures with friends and family arguably.
The panning-out of applications from its starting point almost certainly depends on the set of core plumbing features unique to mobile cameras. Briefly (and of course not immediately), these features include: a) high-quality pixel capture and low-level processing; b) better local device CPU and GPU resources for pixel-on-device processing with subsequent user feedback; c) connectivity structured with "cloud"; And importantly d) specific traffic monitoring and charging infrastructure. 1 is a graphical perspective view of some of these plumbing features of what may be referred to as a visual intelligent network. (Typical details of cell phones such as a microphone, A / D converters, modulation and demodulation systems, IF stages, cellular transceivers, etc. are not shown for clarity of illustration).
It is better to obtain better CPUs and GPUs and more memory on mobile devices. However, cost, weight, and power considerations are likely to help raise the "intelligence" possible to "cloud" do.
There is a possibility that it is necessary to be a common set of "device-side" operations for visual data serving all cloud processes including specific formatting, elemental graphics processing and other mechanical operations. Likewise, there is a possibility that it should be a standardized header and addressing scheme for the resulting back and forth communication traffic (typically packetized) with the cloud.
This conceptualization is similar to the human visual system. The eye performs baseline operations such as colorimeters and optimizes the information needed to transmit along the brain's optic nerve. The brain does the actual recognition work. And then back inversely - the brain transmits information that controls muscle movement - such as where the eye is heading, scanning the rows of the book, and controlling the iris (brightness).
Figure 2 shows an exemplary, but not exhaustive, list of visual processing applications for mobile devices. Again, it is not difficult to find similarities between this list and the basics of the human visual system and how the human brain works. It is a well researched area of the university that dealt with how the human visual system is "optimized" to be involved in any given object recognition task, and the eye-retina-optic nerve-cortical system is an efficient way to serve a vast array of recognition needs It is a general consensus that it is surprisingly pretty darn. Aspects of this technique relate to how similarly efficient and widely available elements can be made with mobile devices, mobile device interfaces, and network services, both of which are applications and technology dances shown in FIG. 2 To serve these new applications as they continue to show.
Perhaps the main difference between human analogy and mobile device networks is that as long as the business knows how to make a profit from it, buyers need to be sure to focus on the "market" concept of purchasing better things It is a point. It should be assumed that any technology for serving the applications listed in FIG. 2 will develop hundreds of business entities, if not thousands, important details of particular commercial offerings, and that one way or another way of forecasting will benefit from these offerings. . Yes, some behemoths will dominate key lines of cash flow in the entire mobile industry, but there are the same certainties that niche players will continue to develop niche applications and services. Thus, this disclosure describes how a market for visual processing services can be developed and thereby having business interests across the spectrum. Figure 3 attempts an approximate categorization of some of the business concerns applicable to the global business ecosystem operating at this time of application.
Figure 4 shows the reasoning behind the introduction of the technical aspect currently under consideration. Here, we have found a very abstract bit of information derived from a group of photons that have affected some form of electronic image sensor, with a large number of low-latency consumers. Figure 4a quickly introduces an intuitively well-known concept that single bits of visual information are then of little value outside their role in both spatial and temporal groups. These key concepts are well utilized in modern video compression standards such as MPEG7 and H.264.
The "visual" character of the bits may be removed very far from the visual domain by a specific process (e.g., consider vector strings representing eigenface data). Thus, sometimes we use the term "key vector data" (or "key vector string ") to collectively represent raw sensor / stimulus data (e.g., pixel data) and / or processed information and associated derivatives. Quot;). The key vector may take the form of a container (e.g., a data structure such as a packet) to which this information is delivered. The tag or other data may be included to identify the type of information (e.g., JPEG image data, or eigenface data), or the data type may be clear from the data or from the context. One or more commands or actions may be associated with key vector data that is explicitly specified or implied in the key vector. For certain types of key vector data, an operation can be implied in a default manner (e.g., for JPEG data, "image can be stored"; for eigenface data, "this eigenface template is matched" . Or nested operations may depend on the context.
4A and 4B also introduce a central player into this disclosure: a pixel packet packaged in an embedding body with key vector data and address-labeled. The key vector data may be a single patch or a collection of patches, or a time series of patches / collections. The pixel packet may be less than kilobytes, or the size may be much larger. It can convey information about a separate patch of pixels extracted from a larger image, or deliver a large Photosynth of Notre Dame Cathedral.
(As presently expressed, the pixel packet is an application layer structure, but, in fact, if the network is manipulated wildly, it can be broken into smaller parts - since the transmission layer constraints of the network may be required).
Figure 5 is still an abstract level, but three schemes indicating specificity. The list of user-defined applications as shown in Fig. 2 will map to a list of state-of-the-art techniques of pixel processing methods and methods, each of which can achieve all applications. These pixel processing methods can be divided into common and less common component sub-tasks. Object recognition textbooks are populated with a wide variety of methods and terms that, at first glance, result in a sequence of scenes that may appear to be an embarrassing array of "inherent requirements" associated with the application shown in FIG. (In addition, multiple computer vision and image processing libraries such as OpenCV and CMVision - described below were created by identifying and rendering functional operations, which can be considered as "atomic" functions within object recognition paradigms.) However, FIG. 5 attempts to illustrate that there is actually a set of shared steps and processes shared between visual processing applications. The differently formed pie slices are intended to show that certain pixel behaviors may be of a particular grade and may have differences in low level variables or optimizations. The size of the entire pi (the idea in the algebraic scene, for example, a pi, which is twice the size of another pi, can represent 10 times more flops) and the size ratio of the slice represents the degree of commonality.
6 takes concrete steps at the expense of simplicity of processing. Here, we can see that it is labeled "Native Call Visual Processing Services", which means that every given list of applications from Figure 2 that a given mobile device can be awakened or fully enabled to run Express. This concept is not that all these applications have to be activated all the time, so some sub-sets of services are actually "turned on" at any given moment. Turn-on applications as a one-time configuration activity negotiate to identify public component tasks labeled "Public Processes Classifier" - first, these elementary image processing routines (e.g., FFT, , Resampling, color histogramming, log-polarity conversion, etc.) to generate a full public list of pixel processing routines available for processing on the selected device. Following the generation of corresponding flow gate configuration / software programming information, a processor suitable for substantially loading library elements into otherwise well-aligned locations for field programmable gate array set-up, or otherwise performing the necessary component operations .
Figure 6 also includes figures of the general purpose pixel segment following the image sensor. This pixel segment is divided into manageable space and / or time blobs (e.g., MPEG macroblocks, wavelet transform blocks, 64 x 64 pixel blocks, etc.) Similar). After the torrents of pixels are divided into breakable masses, they are fed into the newly programmed gate array (or other hardware), which performs the elemental image processing tasks associated with the selected applications. (These arrangements are further described below in the example system utilizing "pixel packets"). The various output results are data that is elementally processed in other resources (internal and / or external) for further processing For example, key vector data). This additional processing is typically more complex than has already been done. Examples include creating associations, deriving inferences, pattern and template matching, and so on. This additional processing may be highly specialized.
(Considering an ad game from Pepsi who invites the public to participate in State Park's treasure hunt.) Based on the Internet-distributed clues, people try to find hidden soda cartons to earn a prize money of $ 500. Participants use Pepsi- You need to download a specific application from the dot-com website (or Apple App Store), which serves to distribute clues (which may be public on Twitter). The downloaded application also has a prize verification component, SIFT object recognition is used (as described below), and SIFT feature descriptors for the special package are stored in the downloaded application < RTI ID = 0.0 > When an image match is found, the cell phone immediately sends the same In the arrangement of Figure 6, some of the component tasks of the SIFT pattern matching operation are the elements in the configured hardware, such as elements < RTI ID = 0.0 > The rest is internally or externally more specialized processing.)
Figure 7 is a top view of an image for a general distributed pixel services network diagram and has some sort of symmetry as to how local device pixel services and "cloud based" pixel services operate. In Figure 7, the router notes how any given packaged pixel packet is sent to the appropriate pixel processing location, whether it is local or remote (the style of the charging pattern represents different component processing functions; Only a few of the processing functions required by the < / RTI > Some of the data shipped to the cloud-based pixel services may first be processed by the local device pixel services. The prototypes indicate that the routing function can have components in the cloud-nodes, which serve to distribute jobs to active service providers and collect results for transmission back to the device. In some implementations, these functions may be performed at the edge of the wireless network by, for example, modules of the wireless service towers to ensure the fastest operation. The results collected from the active external service providers and the active local processing stages are fed back to the pixel service manager software and thereafter interact with the device user interface.
Fig. 8 is an enlarged view of the bottom right of Fig. 7, showing the moment when the gods of the Dorothy turn red and the distributed pixel services provided by the cloud - as opposed to the local device - Lt; / RTI >
Richer object recognition is based on visual associations rather than stringent template matching rules. If we all learned that the default letter "A" would never change any pre-historic form, and if this was allowed, then universal template images would always be strictly followed, then the prescribed form "A" , Very clear and local normative methods may be appropriate for mobile imaging devices in order to obtain this in order to reliably read the base A. 2D and 3D barcodes also follow the template-like approach to object recognition in many cases, and for embedded applications involving these objects, local processing services can get a job in bulk. However, even in the case of bar code examples, the flexibility of growth and evolution of obvious visual coding targets desire an architecture that does not force "code upgrades" on countless devices whenever there is any advance in the sign language field.
At the other end of the spectrum, for example, the task of predicting suspicious typhoons caused by fluttering of butterfly's wings in the middle of the world - if the application needs it - Can be thought of. Oz beckons.
Figure 8 attempts to illustrate this basic additional dimensionality of pixel processing in the cloud as opposed to a local device. This goes without saying (or without a word) virtually, but Figure 8 is also the three frameworks of Figure 9, where Dorothy goes back to Kansas and is happy about it.
Figure 9 is all about cash, cash flow, and happy humans that use cameras on mobile devices to get very important results from their visual queries throughout the month paying the bills. This is evidenced by Google's "AdWords" auction for Genie to get out of the bottle. Behind scenes of visual scandals from mobile users in an instant visual environment, hundreds and thousands of micro-judgments, pixel routings, result comparisons, and so on, whether they know for a very good product they are "truly" There is a micro-auction channel for mobile device users. This end point is intentionally blatant in that any kind of search is uniquely unlimited at any level and magical, and the enjoyment part of the search in the first place is surprisingly new associations being part of the results. The search users then know that they are really looking for. The system shown in FIG. 9 as a carrier-based financial tracking server now facilitates appropriate results to be sent back to the user over the course of monitoring monthly usage of services and sending benefits to appropriate entities We can see the addition of our networked pixel services module and its role.
(As further described elsewhere herein, fund distribution can not be exclusive to remote service providers.) For example, to derive or compensate for certain actions, other funds flows, such as users or other third parties, Increase.)
Figure 10 focuses on functional partitioning of the process, which illustrates that operations similar to template matching can be performed on the cell phone by themselves, while tasks of more complex tasks (similar to data associations) are preferably referenced in the cloud for processing .
The elements of the foregoing are extracted in FIG. 10A to illustrate aspects of the technique (generally) as the physical work of the software components. The two ellipses in the figure represent a set of symmetric pairs of software components that relate the setup of the "human real-time" visual recognition session between the mobile device and the general cloud or service providers, data associations and visual query results Emphasize. The ellipses on the left represent "key vectors ", more precisely" visual key vectors ". As is well known, these terms may include everything from simple JPEG compressed blocks of log-polarized facial feature vectors and all the way through them or beyond them. The key to the key vector is that the intrinsic raw information of any given visual recognition task is optimally preprocessed and packaged (if possible compressed). The left ellipse assembles these packets and typically inserts some addressing information to be routed (the final addressing may not be possible because the packet may ultimately be routed to remote service providers - details on this Preferably, this processing is performed as close to the raw sensor data as possible, such as by processing circuitry integrated on the same substrate as the image sensor, which may be provided in a packet form from another stage or stored in memory Respond to software commands.
The right ellipse manages the remote processing of key vector data, for example, to configure appropriate services, to direct traffic flow, and so on. Preferably, such software processing is implemented on a "cloud side" device, an access point, a cell tower, etc., generally on a low possible communication stack. (As real-time visual key vector packets are streamed over a communication channel, the lower the communication stack on which they are identified and routed, the smoother the "human real-time" seeing experience will be as a given visual awareness task.) It is necessary to support this arrangement The remaining high level processing is included in Fig. 10a for context, and can generally be performed via basic mobile and remote hardware capabilities.
Figures 11 and 12 illustrate that some cloud-based pixel processing services may be pre-established in a pseudo-static manner, while other providers may use periodic As shown in FIG. In many implementations, these latter providers compete each time a packet is available for processing.
Consider a user snapping cell phone images of unfamiliar vehicles that want to learn manufacturing and modeling. Various service providers can compete for this business. The founding vendor can provide the brand to create and run free recognition to collect collector data. The images presented in this service return information that simply indicates the make and model of the car. Consumer reports can provide alternative services that not only provide manufacturing and model data, but also provide technical specifications for the vehicle. However, they can charge 2 cents for the service (or can be based on bandwidth, for example, 1 cents per megapixel). Edmunds or JD Powers provides data such as consumer reports, but can provide another service that the user pays for the privileges of the data provided. In exchange, the vendor is given the right to have one of its partners transmit a text message for user advertising products or services. Payments can take the form of credit for monthly cell phone voice / data service charges for the user.
Using the criteria specified by the user, stored preferences, context, and other rules / heuristics, the query router and response manager (in cell, cloud, distributed, etc.) Determines whether or not it should be handled by one of the service providers of the static atmosphere or whether it should be provided to the providers based on the auction - in that case adjusting the outcome of the auction.
The static wait service can be identified when the phone is initially programmed and can only be reconfigured when the phone is reprogrammed. (For example, Verizon may specify that all FFT actions for the pawns are routed to the server it provides for this purpose), or the user may periodically identify the good providers for a particular task via the configuration menu Or specify that certain jobs should be referenced for auction. Some applications may appear where static service providers are popular; The work may be too common, or the services of one provider may not be so inferior, so there is no good reason for the contention for the provision of services.
In the case of services referred to in an auction, some users may raise their prices above all other considerations. Other users can force domestic data processing. Other users may want service providers to strive to meet "green", "ethical" or other standards of integrated practice. Other users may prefer a richer data output. The weights of the different criteria may be applied by the query router and the response manager in making the determination.
In some circumstances, one input to the query router and response manager may be at the user's location so that when the user is in Oregon's home, a different service provider may be selected than when the user is on vacation in Mexico . In other cases, the required turnaround time is specified, which may determine some vendors to be non-qualified and may compete further with other vendors. In some instances, the query router and response manager need not make any determination if, for example, the stored results identifying the service provider selected in the previous auction are still available and do not exceed the "freshness" threshold.
Pricing provided by vendors may vary with processing load, bandwidth, time of day, and other considerations. In some embodiments, providers may be informed of the quotes offered by the competitors (using known trusted devices that ensure data integrity) and may be given the opportunity to make their quotes more attractive. This bidding war may continue until there are no biders to change the offered demands. The query router and response manager (or, in some implementations, the user) then selects.
For illustrative convenience and visual clarity, Figure 12 shows a software module labeled "Bead Filter and Broadcast Agent ". In most implementations, this forms part of the query router and response manager module. The bead filter module determines whether some vendors - from a number of possible vendors - should be offered an opportunity to bid for this processing operation. (The user's preference data or historical experience may indicate that certain service providers are ineligible.) The broadcast agent module then communicates with the selected beiders to notify them of the user's work for processing, To provide the necessary information.
Preferably, the bead filter and the broadcast agent do their work in advance of at least some of the data available for processing. That is, as soon as predictions are made for actions that the user may be likely to request in the near future, these modules begin work to identify the provider to perform the expected service that is required. After several hundred milliseconds, the user key vector data may be available for actual processing (if the prediction proves correct).
Sometimes, like Google's offering AdWords systems, service providers are not referenced in their respective user transactions. Instead, each provides beading parameters, which are stored and referenced whenever a transaction is considered to determine which service provider will win. These stored parameters may be updated occasionally. In some implementations, the service provider puts updated parameters in the bead filter and broadcast agent whenever available. (The bead filter and broadcast agent may be used by many demographic users, such as all Verizon subscribers in the area code 503, or all subscribers to the ISP in the community, or all users of the domain well-dot-com. Or more localized agents may be utilized, such as one for each cell phone tower).
If there is a break in traffic, the service provider can discount the services for the next moment. The service provider therefore sends (or sends) a message stating that in the Unix era an eigenvector extraction for an image file of up to 10 megabytes for 2 cents up to 1244754176 international standard time, and after that time the price will return to 3 cents )can do. The bead filter and the broadcast agent thus update the table with the stored beading parameters.
(Readers assume that reverse auction arrangements used by Google to place advertisements by advertisers on web search results pages are familiar.) Levy's May 22, 2009 "Secret of Googlenomics: Data-Fueled Recipe Brews Profitability, Exemplary techniques are provided in Wired Magazine.
In other implementations, the broadcast agent polls the beiders - communicating the relevant parameters, and requesting bead responses whenever a transaction is provided for processing.
Once the predominant bidder is determined, the data is available for processing and the broadcast agent sends the key vector data (and other parameters as it may be appropriate for the particular job) to the winning bidder. Thereafter, the bidder executes the requested operation and returns the processed data to the query router and response manager. This module logs the processed data and participates in any necessary accounting (for example, trusting the service provider with a reasonable fee). The response data is then transmitted back to the user device.
In variant arrangement, the one or more competing service providers actually perform some or all of the requested processes, but "hook" the user (or query router and response manager) by providing only partial results. As a taste of what's available, the user (or query router and response manager) can be guided to make a different choice from the related criteria / heuristics shown elsewhere.
Function calls sent to external service providers naturally do not have to provide the ultimate result the consumer is looking for (e.g., to identify the car or translate the menu listing in French into English). They may be component operations such as calculating an FFT, performing a SIFT procedure or a log-polar conversion, calculating histogram or eigenvectors, identifying edges, and the like.
Sooner or later, a rich ecosystem of professional processors will be expected to appear - serving countless processing requests from cell phones and other thin client devices.
The Importance of Money Flow
In exchange for user information (e.g., audience ratings), or for exchanges for actions taken by the user, such as completion of an investigation, visit to a particular place, location tracking of a store, etc., incentives paid by service providers themselves Additional business models of remote services that involve subsidization may be possible.
Services can likewise be paid by a third party, such as a coffee shop, which provides value by providing differentiated services to consumers in the form of free / discounted use of remote services while consumers are sitting at the store. have.
In job aggregation, an economy is possible in which calls between remotely processing credits are generated and exchanged between users and remote service providers. This is entirely transparent to the user and can be managed, for example, as part of a service plan with the user's cell phone or data service provider. Or as a very explicit aspect of certain embodiments of the present technique. Service providers and other providers can award credits to users who are part of a frequent-user program to take actions and make commitments with particular providers.
For other currencies, users may be selected to explicitly donate, store, exchange, or, in general, exchange their credit when necessary.
Considering these points in more detail, the service can pay to the users participating in the audience rating panel. For example, a Nielsen company can provide services to the public, such as identification of television programming from audio or video samples presented by consumers. These services are provided free of charge to consumers who agree to share some of the media consumption data with Nielsen (such as by acting as an anonymous member for the city's audience ratings panel), provided to other consumers on a fee basis . For example, Nielsen can provide 100 credits - small payments or other value - to consumers who participate each month, or provide credit each time a user presents information to Nielsen.
In another example, a consumer may be rewarded to accept ads or ad impressions from a company. When consumers go to the Pepsi Center in Denver, consumers can be rewarded for their Pepsi-brand experiences. The amount of micropayment may be proportional to the amount of time the consumer interacted with the different Pepsi-Brand objects (including audio and images) on the venue.
In addition, large brand owners can provide credits individually. Credits can be routed to friends and social / business knowledge. To illustrate, a Facebook user can share credit (redeemable for goods / services or cash-in-exchange) from his Facebook page - something that others like to visit or enjoy. In some cases, credit may be made available only to those navigating Facebook pages in a particular way, such as by linking to a page from a user's business card or from another launch page.
As another example, it may be advantageous to provide music services that are beneficial, paid, or otherwise, such as downloading songs from iTunes or songs from iTunes, music identification services, or identifying clothes matching the particular shoe Consider a Facebook user who has received credit that can be applied. These services can be associated with a specific Facebook page so that friends can call services from that page-in particular, consume the credit of the host (again, with appropriate permission or invitation by the hosting user) . Similarly, friends can present images to a face recognition service accessible through an application associated with a user's Facebook page. The images presented in this way are analyzed for the faces of the friends of the host, and the identification information is returned to the presenter, for example, via the user interface provided on the original Facebook page. Again, the host can be evaluated for each of these actions, but only authorized friends can allow themselves to use the service itself for free.
Credits and payments can also be routed to charities. Spectators who leave the theater after a particularly acrimonious film about poverty in Bangladesh can capture images of associated movie posters, which serve as a portal for donations to charities that help the poor in Bangladesh. When recognizing a movie poster, the cell phone can provide a graphical / touch user interface, dialing to specify the amount of donation the user can make, and at the end of the transaction, the charity To the financial account associated with the transaction.
Specific Hardware In the arrangement Add about
As is known in the above and cited patent documents, general object recognition by mobile services is needed. Some approaches to specialized object recognition have emerged, and they have provided an enhancement to specific data processing methods. However, no architecture has been proposed that goes beyond the specialized object recognition to general object recognition.
Visually, general object recognition arrangements require access to good raw visual data - preferably without device abrupt quirks, scene changes, user abrupt changes, and the like. Developers of systems built around object identification will serve their users by focusing on object identification work in the near future, rather than the myriad of existing obstacles, resource sinks, and third-party dependencies that are most likely to be encountered. something to do.
As is well known, virtually all object identification techniques may utilize a pipe for the "cloud " - or may depend.
The "cloud" may include everything outside the cell phone. Examples are nearby cell phones, or multiple phones on a distributed network. The unused processing power on these other phone devices may be made available for use (or free of charge) whenever needed. Cell phones of the implementations described herein may collect processing power from these other cell phones.
Such a cloud may be an ad hoc, e.g., another cell phone within the Bluetooth range of the user's phone. The ad hoc network can be extended by extending the local cloud to another pawn that can be reached by bluetooth, although these other pawns can not be reached by the user.
"Clouds" may also be used in other computing platforms such as set-top boxes, automotive processors, thermostats, HVAC systems, wireless routers, local cell phone towers and other wireless network edges And processing hardware for a wireless device). These processors can be used with more conventional cloud computing resources - since they are provided by Google, Amazon, and so on.
(In view of the interests of particular users with respect to privacy, the phone preferably has a user-configurable option that indicates whether the phone is able to query data for cloud resources for processing. Has a default value of "NO" to limit functionality, limit battery life, and limit privacy concerns.) In other arrangements, this option has a default value of "YES ".
Preferably, the image-response techniques should generate a short-term "result or response ", which generally requires some level of interaction with the user-in fact, for fractions of a second for interactive applications, or It is measured in fractions of a few seconds or a minute for the near-term "I am waiting patiently" applications.
For these objects, these are: (1) general manuals (clues to basic searches), (2) geographic manual (you can at least know where you are and connect to geo-specific resources) (3) "cloud supported" manuals such as "identified / enumerated objects" and their associated sites, and (4) active / controllable La ThingPipe (WiFi-equipped thermostats and parking See the techniques described above, such as meters).
The object recognition platform should not be thought of as a classic "local device and local resources only" software intelligence, but it is likely. However, it can be thought of as a local device optimization problem. That is, the software on the local device and its processing hardware must be designed in consideration of the interaction with off-device software and hardware. The balance and interaction of both the control function, the pixel fast processing function, and the application software / GUI provided on the device, compared to the device off. (In many implementations, certain databases useful for object identification / recognition will exist remotely from the device.)
In a particularly preferred arrangement, such a processing platform utilizes image processing near the sensor-on top of the same chip-at least some processing operations are preferably performed by dedicated special purpose hardware.
13, the architecture of the cell phone 10 in which the image sensor 12 supplies two processing paths is shown. One 13 is adapted for the human visual system and includes processing such as JPEG compression. The other 14 is adapted for object recognition. As discussed, some of these processes may be performed by the mobile device, and other processes may be referenced to the cloud 16.
Figure 14 takes an application-centric view of the object recognition processing path. Some applications are entirely on the cell phone. Other applications are entirely external to the cell phone - for example, taking key vector data such as stimulus briefly. Hybrids are more common, such as when some of the processing is done in the cell phone and other processing is done externally, and the application software that coordinates the processing exists in the cell phone.
To illustrate another discussion, FIG. 15 shows a range 40 of a portion of different types of images 41-46 that can be captured by the user's cell phone. A few simple (not complete) comments on some of the treatments that can be applied to each image are provided in the following paragraphs.
The image 41 depicts the thermostat. A steganographic digital watermark 47 is textured or printed on the case of the thermostat. (The watermark is shown in Figure 15, but is typically unrecognizable to the viewer.) The watermark conveys the intended information for the cell phone, allowing the user to provide a graphical user interface that can interact with the thermostat do. Bar codes or other data carriers may alternatively be used. This technique is further described below.
The image 42 depicts an item that includes a barcode 48. This bar code transmits Universal Product Code (UPC) data. Other barcodes may convey different information. The barcode payload is not intended primarily to be read by the user cell phone (as opposed to the watermark 47), but may nevertheless be used by the cell phone to help determine an appropriate response to the user.
The image 43 shows a product that can be identified without reference to any high speed machine-readable information (such as a bar code or watermark). In order to distinguish an apparent image object from a clear background, a segmentation algorithm may be applied to the edge-detected image data. The image object can be identified through its shape, color, and texture. Image fingerprinting may be used to identify reference images with similar labels, and metadata associated with these other images may be collected. The SIFT technique (discussed below) can be utilized for these pattern-based recognition tasks. Mirrored reflections of low texture areas may attempt to indicate that the image object is made of glass. Optical character recognition can be applied to other information (reading visible text). All these clues can be used to identify the depicted item and help determine the appropriate response to the user.
Additionally (or alternatively), similar-image retrieval systems such as Google-like images and Microsoft Live Search may be utilized to find similar images, after which their metadata may be collected. (As in this record, these services do not directly support the uploading of user images to find similar web images. However, the user can send images to flicker (using the flicker's cell phone upload function) This will soon be discovered and handled by Google and Microsoft.)
Image 44 is a snapshot of friends. Face detection and recognition can be exploited (i. E., To identify certain faces in order to indicate that there are faces in the image), and thus users who are maintained by Apple's iPhoto service, Google's Picasa service, Facebook, - to annotate the image with metadata by referring to the associated data). Some face recognition applications may be trained for non-human faces, for example cats including avatars, dog animated characters, and the like. The geographic location and data / time information from the cell phone may also provide useful information.
Those wearing sunglasses are challenged with some face recognition algorithms. The identification of these individuals can be assisted by associations with people whose identities can be more easily determined (e.g., by conventional face recognition). That is, by identifying images of different groups in iPhoto / Picasa / Facebook / etc that include one or more latter individuals, other individuals depicted in such photos may also be present in the subject image. These candidates form far fewer possibilities than normally provided by unlimited iPhoto / Picasa / Facebook / etc. Face vectors that are recognizable from the sunglasses-worn faces in the target image can then be compared against these smaller possibilities to determine the best match. In the general case of face recognition, if a score of 90 is required to be considered a match (from any top matching score of 100), a score of 70 or 80 may be sufficient in retrieving these group-limited sets of images . (As in image 44, if two people are depicted without sunglasses, the appearance of both of these individuals in the picture with one or more other individuals may be determined by, for example, this analysis implemented by increasing the weighting factor in the matching algorithm To increase the relevance to the.
Image 45 shows part of Prometheus' statue at Rockefeller Center, NY. The identification may follow the disclosures set forth elsewhere in this specification.
Image 46 is a landscape depicting the Maroon Bells Mountain area of Colorado. This image object can be recognized by reference to geographic location data from the cell phone along with geographic information services such as GeoNames or Yahoo! GeoPlanet.
(Note that with the processing of one of the images 41-46 of Fig. 15, the noted techniques may be applied to others of the images as well.) In addition, in some aspects, For example, although the landscape image 46 is depicted farther to the right, the geographic location data is strongly correlated with the metadata "Maroon Bells" Thus, this particular image provides a much easier case than is provided by many different images.)
In one embodiment, this processing of the image occurs automatically - without the use of high-speed user commands every hour. The subject, information on power constraints and network connections, can be continuously collected from such processing and used to process subsequent-captured images. For example, an initial image of a sequence containing a photograph 44 shows the members of the group depicted without wearing sunglasses-simplifying the identification of those who wear the sunglasses later.
16 , Implementation
Figure 16 is at the core of a particular implementation incorporating certain of the features discussed earlier. (Other discussed features may be implemented by the technician within this architecture, based on the provided disclosure.) In this data driven arrangement 30, the operation of the cell phone camera 32 is controlled by the setup module 34, And is then controlled by the control processor module 36. The control processor module 36 controls the flow of the packet data. (The control processor module 36 may be a primary processor or a coprocessor of a cell phone, or this function may be distributed.) The packet data specifies operations to be performed by a reliable chain of processing stages 38.
In one particular implementation, the setup module 34 depicts the parameters utilized by the camera 32 - on a frame-by-frame basis, when collecting exposures. The setup module 34 also specifies the type of data that the camera outputs. These command parameters are passed in the first field 55 of the header portion 56 of the data packet 57 corresponding to that frame (Fig. 17).
For example, on a frame-by-frame basis, the setup module 34 may cause the first field 55 to issue a packet 57 that instructs the camera about, for example, exposure length, aperture size, lens focus, have. Module 34 sums the sensor charge to reduce the resolution (e.g., generating a frame of 640 x 480 data from a sensor capable of 1280 x 960), and outputs data only from the red-filtered sensor cells , Outputting data only from the horizontal line of the cells in the middle of the sensor, outputting data only from the 128 x 128 patches of the cells from the center of the pixel data, and the like. The camera command field 55 may further specify the exact time at which the camera captures data, for example, to allow for preferred synchronization with ambient illumination (as described later).
Each packet 56 issued by the setup module 34 may include different camera parameters in the first header field 55. Thus, the first packet can cause the camera 32 to capture an entire frame image with an exposure time of one millisecond. The next packet allows the camera to capture an entire frame image with an exposure time of 10 milliseconds, and the third can specify an exposure time of 100 milliseconds. (These frames can later be processed in combination to produce a high dynamic range image.) The fourth packet commands the camera to down-sample the data from the image sensor and outputs a 4 x 3 array of gray scale luminance values to the output To combine signals from differently color-filtered sensor cells. The fifth packet may instruct the camera to output data only from an 8 x 8 patch of pixels at the center of the frame. The sixth packet can instruct the camera to output only five lines of image data from the top, bottom, middle, and middle-top and middle-bottom rows of the sensor. The seventh packet may instruct the camera to output data only from the blue-filtered sensor cells. The eighth packet ignores any auto-focus commands, but can instead instruct the camera to capture the entire frame at infinity focus. Etc.
Each such packet 57 is provided from the setup module 34 via a bus or other data channel 60 to a camera controller module associated with the camera. (Details of a digital camera, including an array of photosensor cells, associated analog-to-digital converters, and control circuitry, etc., are well known to those skilled in the art and are not discussed elsewhere.) Camera 32 includes a header field 55 of the packet, Captures the digital image data and stuffs the resulting image data into the body 59 of the packet. It also deletes the camera commands 55 from the packet header (or marks the header field 55 in a manner that allows it to be ignored by subsequent processing stages).
If packet 57 was created by setup module 34, it also contained a series of different header fields, each specifying how the corresponding successive post-sensor stage 38 would process the captured data. As shown in FIG. 16, there are several such post-sensor processing stages 38.
The camera 32 outputs an image-stuffed packet generated by a camera (pixel packet) onto a bus or other data channel 61, which is passed to the first processing stage 38.
The stage 38 checks the header of the packet. The first header field encountered by the control of the stage 38 is the field 58a since the camera deletes (or is marked to be ignored) the command field 55 that carried the camera commands. This field details the parameters of the operation to be applied by the stage 38 to the data in the body of the packet.
For example, the field 58a specifies the parameters of the edge detection algorithm to be applied by the stage 38 to the image data of the packet (or simply what such an algorithm should be applied to). It may also specify that the stage 38 is to replace the original image data in the body of the packet with the resulting edge-detected set of data. (The replacement of the data rather than the attachment may be indicated by the value of the single bit flag in the packet header.) The stage 38 performs the requested operation (this may involve configuring the programmable hardware in certain implementations have). The first stage 38 then deletes (or marks them to be ignored) instructions 58a from the packet header 56 and outputs the processed pixel packets for operation by the next processing stage.
The control of the next processing stage (including stages 38a and 38b discussed herein below) examines the header of the packet. Since the field 58a has been deleted (or marked to be ignored), the first field encountered is the field 58b. In this particular packet, the field 58b does not perform any processing on the data in the body of the packet, but instead simply discards the field 58b from the packet header and sends the packet to the next stage can do.
The next field of the packet header may instruct the third stage 38c to perform 2D FFT operations on the image data found in the packet body, based on 16 x 16 blocks. It can also be used to send data to the wireless interface for Internet transmission to the address 126.96.36.199 executed by the specified data (for example, describing the operation to be performed on the FFT data received by the computer at that address, such as texture classification) And can instruct the stage to hand-off the processed FFT data. It also responds to commands for use by handing off a single 16 x 16 block of FFT data corresponding to the center of the captured image on the same or a different air interface for transmission to the re-executed address 188.8.131.52 (E.g., retrieve a match from an archive of stored FFT data, return information if a match is found, and also store this 16 x 16 block in the archive with the associated identifier). Finally, the header created by the setup module 34 can instruct stage 38c to replace the body of the packet with a single 16 x 16 block of FFT data that is dispatched to the air interface. As before, the stage also edits the packet header to remove (or mark) the responding instructions, so that the header command field for the next processing stage is encountered first.
In other arrangements, the addresses of the remote computers are not hard-coded. For example, a packet may include a pointer to a database record or memory location (on the phone or in the cloud), which contains the destination address. Or stage 38c may be instructed to hand off the processed pixel packet to the query router and the response manager (e.g., FIG. 7). These modules then examine the pixel packet to determine what type of processing is required and route it to the appropriate provider (either in the cell phone or in the cloud if the resources allow it - between the stable static providers , Or to a provider identified through an auction). The provider returns the requested output data (e.g., texture classification information, and information about any matching FFT in the archive), and continues processing for each command of the next item in the pixel packet header.
The data flow continues through as many functions as a particular operation may be required.
In the illustrated specific arrangement, each processing stage 38 removes instructions that operate from the packet header. These instructions are specified in the header in the sequence of processing stages, and this elimination allows each of the stages to examine the first instructions remaining in the header for indication. Other arrangements can of course be utilized alternatively. (E.g., the module may insert new information into the header - either on the front, back, or elsewhere in the sequence - based on the processing results). This corrected header then controls the packet flow and subsequent processing do.)
In addition to outputting data for the next stage, each stage 38 may further include an output 31 that provides data to the control processor module 36 again. For example, processing initiated by one of the local stages 38 should be adjusted to optimize the suitability of the upcamming frame of captured data for a particular type of processing (e.g., object identification). This focus / exposure information can be used as prediction setup data for the camera and next time a frame of the same or similar type is captured. Control processor module 36 may set up a frame request using a filtered or time-series predicted sequence of previous frames or a set of focus information from a subset of those frames.
Error and status reporting functions can also be achieved using outputs 31. [ Each stage may also have one or more other outputs 33 - either locally or remotely ("in the cloud") - within the cell phone to provide data to other processes or modules. Data (in packet form, or in other formats) may be directed to such outputs in accordance with instructions in packet 57 or elsewhere.
For example, the processing module 38 can make a data flow selection based on a certain processing result that has been executed. For example, if the edge detection stage identifies a distinct contrast image, the outgoing packet may be routed to an external service provider for FFT processing. The provider may return the resulting FFT data to the other stages. However, if the image has bad edges (such as out of focus), the system may not want the FFT to be performed on the data and the next processing. Thus, the processing stages may cause branches in the data flow depending on the parameters of the processing (such as the identified image features).
Commands specifying such conventional branching may be included in the header of the packet 57, or they may be provided. Figure 19 shows an arrangement. The instructions 58d specify the condition in the original packet 57 and specify the location in the memory 79 and the subsequent instructions 58e'58g ' It can be replaced with a packet header. If the condition is not met, execution proceeds according to the header instructions already in the packet.
In other arrangements, other variations may be utilized. For example, all possible conditional instructions may be provided to the packet. In other arrangements, one or more header fields do not contain explicit instructions, although packet architectures may still be used. Rather, they simply refer to, for example, the memory location at which corresponding instructions (or data) are retrieved by the corresponding processing stage 38.
Memory 79 (which may include cloud components) may also facilitate adaptation of the process flow if conditional branching is not utilized. For example, the processing stage may yield output data (e.g., a convolution kernel, a time delay, a pixel mask, etc.) that determines parameters of a filter or other algorithm to be applied by a later stage. These parameters may be identified (e.g., determined / computed and stored) by the preprocessing stage of the memory and recalled for use by a later stage. In Figure 19, for example, processing stage 38 generates parameters stored in memory 79. The subsequent processing stage 38c retrieves these parameters later and uses them at the execution of the assigned operation. (The information in the memory may be labeled (if known), or other addressing arrangements may be used to identify the module / provider from which they originated or to which they were received.) Thus, May be adapted to environments and parameters that are not known when originally directed to the setup module 34 to create the packet 57.
In one particular embodiment, each of the processing stages 38 includes hardware circuitry dedicated to a particular task. The first stage 38 may be a dedicated edge-detection processor. The third stage 38c may be a dedicated FFT processor. Other stages may be dedicated to other processes. These include, but are not limited to, DCT, wavelets, Haar, Hough and Fourier-Mel-in transformation processors, different kinds of filters (e.g., Wiener, low pass, band pass, high pass), and face recognition, Calculation, shape extraction, color and texture feature data, barcode decoding, watermark decoding, object segmentation, pattern recognition, age and gender detection, emotion classification, orientation determination, compression, decompression, log-polar mapping, convolution, interpolation , Decimation / down-sampling / anti-aliasing; All of the behaviors such as correlation, square root and squared operations, matrix multiplication, perspective transformation, butterfly operations (combining results of smaller DFTs into larger DFTs, or decomposing larger DCTs into sub- Or a stage for executing a portion.
These hardware processors may be field-configurable instead of dedicated. Thus, each of the processing blocks of FIG. 16 may be dynamically reconfigurable as long as the environment is legitimate. At a moment, the block may be configured as an FFT processing module. At the next moment, it can be configured as a filter stage or the like. At a moment, the hardware processing chain can be configured as a barcode reader; And then a face recognition system or the like.
Such hardware reconfiguration information can be downloaded from the cloud or from services such as the Apple App Store. And the information does not need to reside statically on once downloaded phones - it can be called from the cloud / app store whenever it is needed.
Assuming increased broadband availability and speed, the hardware reconfiguration data may be downloaded to the cell phone whenever the cell phone is turned on or initialized, or whenever a particular function is initialized. It is a dilemma for a number of different versions of an application to be placed on the market at a given time - depending on the problems faced by the companies supporting the products of the different versions in the field when the last downloaded different users update. Every time a device or application is initialized, all or selected features of the last version are downloaded to the phone. And it works for components such as hardware drivers, software for hardware layers, as well as overall system functionality. In each initialization, the hardware is newly configured with the last version of the applicable instructions. (For code used during initialization, it can be downloaded for use in the next initialization.) Only when some updated code can be downloaded and specific applications need it - for the specialized functions, Can be dynamically loaded as well as for configuration. The instructions may also be adapted to specific platforms, for example, the iPhone device may utilize accelerometers that are different from the Android device, and the application instructions may vary accordingly.
In some embodiments, each application processor may be concatenated in a fixed order. The edge detection processor may be the first, and the FFT processor may be the third.
Alternatively, the processing modules may be interconnected by one or more busses (and / or crossbar arrangements or other interaction architectures) that allow any stage to receive data from any stage and output the data to any stage . Another interconnect method is a network on the chip (effectively, packet-based LAN; similar to crossbar in adaptability, but programmable by network protocols). Such arrangements may also support one or more stages to take data - as output, to process the data - to perform other processing - iteratively.
One iteration arrangement is shown by stages 38a / 38b in Fig. The output from stage 38a may be taken as an input to stage 38b. Stage 38b may be instructed not to process the data but to apply it again to the input of stage 38a. This can be looped several times as desired. When the iteration by the stage 38a is completed, its output can be passed from the chain to the next stage 38c.
In addition to merely acting as a pass-through stage, the stage 38b can perform its own type of processing on the data processed by the stage 38a. The output of which can be applied to the input of stage 38a. Stage 38a may be instructed to reapply or pass the processing to the data generated by stage 38b. Any combination of stages 38a / 38b processing can be achieved accordingly.
The roles of the stages 38a and 38b in the above may also be reversed.
In this manner, the stages 38a and 38b may (1) apply one or more stages 38a processing to the data; (2) applying one or more stage (38b) processing to the data; (3) applying a combination and sequence of stages 38a and 38b processes to the data; (4) simply pass the input data to the next stage without processing.
The camera stage may be integrated into an iteration loop. For example, to obtain focus-lock, the packet may be passed to a processing module that evaluates focus from the camera. (Examples include - finding high frequency image components - finding strong edges, etc.). The sample edge detection algorithms can include Canny, Sobel, and differential. Edge detection is useful for tracking or object tracking The output from this processing module may be looped back to the camera's controller module and the focus signal may change. The camera captures the subsequent frame with the changed focus signal, and the resulting image is provided back to the processing module for evaluating the focus. This loop continues until the processing module is reported within the achieved critical range. (The parameters of the packet header or memory may be used to specify the output iteration limit, for example, specifying that the iteration should end and output an error signal if the focus that meets the specified requirement is not met within ten iterations You can specify.)
Although this discussion focuses on a series of data processing, images or other data can be processed in two or more parallel paths. For example, the output of stage 38d may be applied to two subsequent stages, each of which starts a respective branch of the fork in processing. These two chains may then be processed independently, or the resulting data from such processing may be combined-or together-in a subsequent stage. (Each of these processing chains may be branched).
As is well known, forks will appear much earlier in the chain. That is, in most implementations, the parallel processing chain will be utilized to generate an image for human consumption - on the contrary - to the machine. Thus, the parallel processing can branch immediately to follow the camera sensor 12, as shown by the matching point 17 of FIG. The processing for the human visual system 13 includes operations such as noise reduction, white balance, and compression. In contrast, processing for object identification 14 may include the operations described in this specification.
When an architecture is branched or involves different parallel processes, different modules may terminate their processing at different times. They can output data when processing ends - asynchronously - when pipelines or other interconnection networks are allowed. When the pipeline / network is free, the next module can deliver the completed results. Flow control can involve arbitration, such as providing a higher priority to one path or data. The packets can deliver priority data that determines the ranking if arbitration is needed. For example, many image processing operations / modules use Fourier domain data, as generated by the FFT module. The output from the FFT module can thus be given a higher priority and higher priority than others when arbitrating data traffic so that Fourier data that may be needed by other modules can be made available with a minimum delay.
In other implementations, some or all of the processing stages are not dedicated purpose processors but are general purpose microprocessors programmed by software. In yet other implementations, the processors are hardware-reconfigurable. For example, some or all of them may be field programmable gate arrays such as Xilinx Virtex series devices. Alternatively, they may be digital signal processing cores such as the Texas Instruments TMS320 series devices.
Other implementations may include PicoChip devices such as PC302 and PC312 multicore DSPs. These programming models allow each core to be independently coded (e.g., in C) and then communicated to others via an internal interconnect mesh. The associated tools specifically provide for the use of such processors in cellular devices.
Other implementations may utilize configurable logic on the ASIC. For example, a processor may include a region of mixed logic with configuration logic-dedicated logic. This allows configurable logic into a dedicated pipeline or bus interface circuit and pipeline.
Implementations may also include one or more modules with a small CPU and RAM, a programmable code space for firmware, and a workspace for processing - essentially a dedicated core. These modules can execute a fairly wide range of computations, which can then be configured as needed for processing using hardware.
All such devices may be deployed in a bus, crossbar, or other interconnect architecture that allows any stage to receive data from any stage and again output data to it. (An FFT or other transformation processor implemented in this manner can be dynamically reconfigured to process blocks such as 16 x 16, 64 x 64, 4096 x 4096, 1 x 64, 32 x 128, etc.)
In certain implementations, some processing modules are cloned - allowing parallel execution on parallel hardware. For example, multiple FFTs may be processed simultaneously.
In variant arrangement, the packet carries instructions to serve to reconstruct the hardware of one or more processing modules. When a packet enters the module, the header causes the module to reconfigure the hardware before the image-related data is accepted for processing. Thus, the architecture is configured on the fly by packets (which may or may not convey image-related data). Packets may similarly convey firmware to be loaded into a module with a CPU core or into an application- or cloud-based layer; Similarly, software commands are used.
Module configuration commands may be received over a wireless or other external network; It does not always have to reside on the local system. If the user requests an operation for which local commands are not available, the system may request reconfiguration data from a remote source.
Instead of passing the configuration data / commands themselves, a packet may simply convey an index number, a pointer, or other address information. This information may be used by the processing module to access the corresponding memory storage where the necessary data / instructions may be retrieved. In the case of a cache, if local memory storage is not found to contain the necessary data / instructions, they may be requested from another source (e.g., access to the external network).
These arrangements reduce the dynamic routing capability to the hardware layer - reconfiguring the module when data arrives.
Parallelism is widely utilized in graphics processing units (GPUs). Many computer systems utilize GPUs as coprocessors to handle operations such as graphics rendering. Cell phones gradually include GPU chips to allow the phones to serve as game platforms; They may be utilized to take advantage of certain implementations of the present technique. (Without limitation, the GPU can be used to perform bilinear and bicubic interpolation, projection transforms, filtering, and the like.
According to another aspect of the technique, the GPU is used to correct lens aberrations and other optical distortions.
Cell phones often display optical nonlinearities such as barrel distortion, focus deflections of parameters, and the like. This is particularly problematic when decoding digital watermark information from a captured image. With the GPU, the image can be treated as a texture map and applied to the correction surface.
Typically, texture mapping is used, for example, to capture images of walls or stone walls on the surface of a prison. Texture memory data is referenced and mapped onto a plane or polygon as it is drawn. In this context, it is the image that is applied to the surface. The surface is formed such that the image is arbitrarily drawn by correcting the transformation.
The steganographic correction signals of the digitally watermarked image are used to identify the distortion in which the image is transformed. (See, for example, Digimarc patent 6,590, 996.) Each patch of the watermarked image may be characterized by affine transformation parameters such as translation and scale. The error function for each position of the captured frame can thereby be derived. From this error information, a corresponding surface can be devised - when the distorted image is projected by the GPU, the surface causes the image to appear in an original shape that is anti-distorted.
The lens may be characterized in this manner with a reference watermark image. Once the associated correction surface is devised, it can be reused in other images captured through the optical system (because the associated distortion is fixed). Other images can be projected onto this correction surface by the GPU to correct lens distortion. (Different focus depths and apertures may require different correction function features, as the light path through the lens may be different.)
When a new image is captured, it can be initially rectilinearized to remove the keystone / trapezoidal perspective effect. Once linearized (e. G., Re-squared for camera lenses), local distortions can be corrected by mapping the linearized image onto the correction surface using the GPU.
Thus, the correction model is inherently on a polygonal surface, where tilts and altitudes correspond to focal irregularities. Each region of the image has local conversion metrics that allow correction of that fragment of the image.
The same arrangement can be similarly used to correct distortion of the lens in an image projection system. Prior to projection, the image is mapped onto the synthesized correction surface - as a texture - to operate against the lens distortion. When the thus processed image is projected through the lens, the lens distortion operates against the previously applied correction surface distortion, causing the corrected image to be projected from the system.
We refer to the depth of field as one of the parameters that can be utilized by camera 32 to collect exposures. Although the lens can be focused accurately at only one distance, the reduction in sharpness is gradual on both sides of the focused distance. (The depth of field depends on the point spread function of the optics - including the lens focal length and aperture.) As long as the captured pixels yield information useful for the intended operation, do.
Sometimes focus algorithms track focus but fail to achieve - wasting cycles and battery life. In some instances it is better to simply grab the frames in a series of different focus settings. A search tree of focus depths or field depths may be used. This is particularly useful when the image includes multiple objects of potential interest, each in a different plane. The system can capture a 6-inch focused frame and another 24-inch focused frame. The different frames may indicate that there are two objects of interest within the field of view - one captured better in one frame and the other captured better in the other frame. Or a 24 inch-focused propulsion frame is found to have no useful data, a 6 inch-focused frame may contain frequency content sufficiently discernible enough to know that there are two or more target image planes have. Based on the frequency content, one or more frames with different focus settings may be captured thereafter. Or the area of the 24 inch-focused frame may have the setting of one of the Fourier properties, the same area of the 6 inch-focused frame may have a different setting among the Fourier properties, The next tentative focus settings can be identified (e.g., at 10 inches), and other frames at that focus setting can be captured. Feedback is applied - it does not need to acquire a full focus fix, but rather follows a search criterion to judge about other captures that may represent additional useful details. The search may branch and branch information depending on the number of identified objects and the associated Fourier, etc., until satisfactory information about all objects is fathered.
The related scheme is to capture and buffer multiple frames when the camera lens system takes adjustments with the intended focus setting. The analysis of the finally captured frames at the intended focus may suggest that the intermediate focus frames represent useful information about, for example, objects not initially appearing or not important. One or more frames that were initially captured and buffered may then be recalled and processed to provide information whose significance is initially unrecognized.
Camera control can also be responded to spatial coordinate information. By using geographic location data and direction (e.g., a self-machine), the camera can verify that it captures the intended target. The camera set-up module can request images of specific objects or locations as well as specific exposure parameters. One or more frames of image data may be captured automatically when the camera is in the correct position for capturing a particular object (which may have been previously identified or identified by computer processing). (In some arrangements, the direction of the camera can be controlled by stepper motors or other electromechanical arrangements so that the camera automatically sets the azimuth and elevation to capture the image data from a particular direction, Electronic or fluidic adjustment of the lens direction may also be utilized.)
As is well known, the camera setup module may instruct the camera to capture a sequence of frames. In addition to advantages such as the synthesis of high dynamic range images, such frames can be aligned and combined to obtain super-resolution images. (As is known in the art, super-resolution can be achieved by other kinds of methods. For example, the frequency content of images can be analyzed, correlated by linear transformations, In addition to other applications, this can be used to decode digital watermark data from an image. In order for an object to generally acquire a satisfactory image resolution If it is too far from the camera, this can be doubled by these super-resolution techniques to obtain the higher resolution needed for successful watermark decoding.)
In an exemplary embodiment, each processing stage has substituted processing results for the input data contained in the packet when received. In other arrangements, the processed data may be added to the packet body while retaining the originally existing data. In this case, the packet grows during processing - because more information can be added. This may be disadvantageous in some contexts, but may also provide benefits. For example, it can avoid the need to branch the processing chain to two packets or two threads. Occasionally, both the original and processed data may be useful in subsequent stages. For example, the FFT stage may add frequency domain information to pixels that contain the original pixel domain image. Both of these can be used to perform a subsequent stage, for example sub-pixel alignment for super-resolution processing. Likewise, the focus metric can be extracted from the image and can be used by the following stage - in accordance with the image data.
It will be appreciated that the above-described arrangements can be used to control the camera to produce different types of image data on a frame-by-frame basis, and to control subsequent stages of the system to process each such frame differently. Thus, the system captures the first frame under the conditions selected to optimize the green watermark detection, captures the second frame under the conditions selected to optimize the barcode reading, And so on. Subsequent stages may be instructed to process each of these frames differently, in order to extract the best found data. All frames can be processed to detect illumination variations. All other frames can be processed to evaluate the focus, for example, by calculating 16 x 16 pixel FFTs at nine different positions within the image frame. (Or there may be a fork that allows all frames to be evaluated for focus, and may be disabled or reconfigured to serve other purposes when the focus branch is not needed).
In some implementations, frame capture may be tuned to capture staganographic calibration signals present in the digital watermark signal, regardless of successful self decoding of the watermark payload data. For example, the captured image data may be of low resolution - enough to identify the calibration signal, but insufficient to identify the payload. Or the camera may expose the image regardless of human perception, for example, overexposure such that image highlights are discolored, or insufficiently exposed so that other portions of the image are indistinguishable. This exposure may be sufficient to capture the watermark directional signal (feedback can, of course, be utilized to capture one or more subsequent image frames-alleviating one or more drawbacks of the previous image frame).
Some digital watermarks are embedded in certain color channels (e.g., blue) rather than over colors as a modulation of image brightness (see, for example, commonly owned patent application 12 / 337,029 to Reed). When capturing a frame containing such a watermark, exposure may be selected to produce the maximum dynamic range in the blue channel, regardless of the exposure of the other colors of the image (e.g., 0-255 in the 8-bit sensor) ). One frame may be captured to maximize the dynamic range of one color such as blue and the latter frame may be captured to maximize the dynamic range of other color channels such as yellow (i.e., along the red-green axis). These frames are then aligned and the blue-yellow difference is determined. The frames may have completely different exposure times depending on the illumination, subject, and the like.
Preferably, the system may have an operational mode of capturing and processing images even when the user is not going to "snap" the photos. If the user presses the shutter button, the non-scheduled image capture / processing operations may be interrupted and the consumer picture taking mode may take precedence. In this mode, capture parameters and processes designed to improve aspects of the human visual system of an image may be exploited instead.
(Note that the particular embodiment shown in Figure 16 will generate packets before any image data is collected. In contrast, Figure 10a and the associated discussion do not indicate packets that existed before the camera. Both arrangements are shown in both embodiments The packets can be established prior to the capture of the image data by the camera, in which case the visual key vector processing and packaging module will be able to process the pixel data - or more typically, The sub-sets or super-sets of previously-formed packets). Similarly, in Figure 16, packets do not need to be generated until after the camera has captured the image data.
As initially noted, one or more processing stages are remote from the cell phone. One or more pixel packets may be routed to the cloud (or through the cloud) for processing. The results may be returned to the cell phone, or may be sent to another cloud processing end (or both). Once back on the cell phone, one or more other local actions may be performed. The data can then be sent back out of the cloud. The processing can thus alternate between the cell phone and the cloud. Finally, the resultant data can generally be provided to the user again in the cell phone.
Applicants expected different vendors to provide competing cloud services for specialized processing tasks. For example, Apple, Google and Facebook can each provide cloud-based face recognition services. The user device sends the processed data packet for processing. The header of the packet may indicate the user, the requested service and - optionally - the micropayment commands. (Again, the header is an index that serves to configure the desired transaction to look up in the cloud database, or to sequence or purchase transactions for operations or partial transactions, postings on Facebook, face-to-face or object- Once such an indexed transactional arrangement is initially constructed, it can be easily called simply by sending a packet to the cloud containing the identifier and image-related data representing the desired operation.
In an Apple service, for example, the server may look up an incoming packet, look up the user's iPhoto account, access face recognition data for the user's friends from the account, , Determine the best match, and return the result information (e.g., the name of the depicted individual) back to the original device.
At the IP address for the Google service, the server may initiate a similar operation, but refers to the user's picasa account. The same is true for Facebook.
It is easier to identify a face among faces for dozens or hundreds of known friends than to identify faces of strangers. Other vendors can provide the latter kind of services. For example, L-1 identity solstice, Ink maintains a database of images from government-issued certificates, such as driver's licenses. With appropriate permissions, we can provide facial recognition services extracted from these databases.
Other processing operations may similarly be operated remotely. One is a bar code processor, which takes processed image data transmitted from a mobile phone and applies a decoding algorithm specific to the type of bar code present. The service may support one, several, or dozens of different types of barcodes. The decoded data may be returned to the phone or the service provider may access other data indexed by the decoded data, such as product information, instructions, purchase options, etc., and return such other data to the phone. (Or both can be provided).
Another service is digital watermark reading. Another service is Optical Character Recognition (OCR). An OCR service provider may further provide services to translate image data processed with transaction services, e.g., ASCII symbols, and then provide ASCII words to the translation engine to render them in a different language. Other services are sampled in FIG. (Practicality provides a list of countless other services and component operations that may also be provided.)
The output from the remote service provider is often returned to the cell phone. In many cases, the remote service provider will return the processed image data. In some cases, it may return ASCII or other such data. However, from time to time, the remote service provider may generate other types of output, including audio (e.g., MP3) and / or video (e.g., MPEG4 and Adobe Fresh).
Video returned from the remote provider to the cell phone may be provided on the cell phone display. In some implementations, such video provides a user interface screen to invite a user to select information or action, or to take a touch or gesture within the displayed presentation to issue an instruction. The software of the cell phone may receive such user input, initiate response operations, or provide response information.
Remote processing services can be provided under a variety of different financial models. The Apple iPhone service plan can be bundled with various remote services, such as iPhoto-based face recognition, at no additional cost. Other services may be billed for each use, monthly subscription, or other usage plans.
Some services are not going to be overly sophisticated and marketed. Others can be competing in quality; Others may be price competitive.
As is well known, stored data may represent good providers for different services. They may be explicitly identified (e.g., sending all FFT operations to the Fraunhofer Institute service), or they may be specified by other attributes. For example, a cell phone user may be designated such that all remote service requests are routed to the providers that are ranked most rapidly in periodically updated surveys of providers (e.g., by a consumer combination). The cell phone can periodically check the published results for this information, or it can be dynamically verified when the service is requested. Other users can specify that service requests should be routed to service providers with the highest consumer satisfaction scores - again, by reference to online rating resources. Another user may specify that they should be routed to providers with the highest consumer satisfaction scores - even if the service is provided free of charge; Otherwise it is routed to the lowest cost provider. Combinations of these arrangements and others are of course possible. The user can, in certain cases, specify a particular service provider - to trump any selection made by the stored profile data.
In yet another arrangement, a user's request for a service may be mailed out, and multiple service providers may express interest in executing the requested operation. Or requests can be sent to various specific service providers for proposals (for example, to Amazon, Google and Microsoft). Responses of different providers (pricing, other conditions, etc.) can be provided to the user, and the user can choose from them, or the selection can be made automatically - based on previously stored rules. In some cases, the one or more competing service providers may be provided with user data and use it to initiate their execution or fully execute the target operation before the service provider selection is finally made - You are given the opportunity to speed up time and meet additional real data. (See also the earlier discussion of remote service providers, including auction-based services, for example with Figures 7 to 12).
As indicated elsewhere, certain external services may pass through a public hub (module), and the public hub is responsible for distributing those requests to appropriate service providers. Equally, the results from certain external service requests can similarly be routed through the public hub. For example, payloads decoded by different service providers from different digital watermarks (or fingerprints decoded from different barcodes or computed from different content objects) may be referred to a public hub , Which can compile statistics and aggregate information (similar to Nelson's monitoring services - investigating different data and consumer encounters). In addition to coded watermark data (barcode data, fingerprint data), the hub may also be provided with a quality or reliability metric associated with each decoding / calculation operation (or alternatively). This can help to indicate packaging issues that need to be considered, printing problems, media error problems, and so on.
In the implementation of Figure 16, communications from and against the cloud are facilitated by the pipe manager 51. [ This module (which may be realized by the query router of FIG. 7 and the cell phone side portion of the response queue) performs various functions related to communicating via datapipe 52. (Pipe 52 will know a data structure that can not contain various communication channels.)
One function of the pipe manager 51 is to negotiate the necessary communication resources. The cell phone may utilize various communication networks and advertising data carriers, e.g., cellular data, WiFi, Bluetooth, etc., some or all of which may be utilized. Each can have its own protocol stack. In one aspect, the pipe manager 51 interacts with the respective interfaces for these data channels - determining the availability of bandwidth for different data payloads.
For example, the pipe manager alerts the cellular data carrier local interface and network that there is payload provisioning for transmission starting at approximately 450 milliseconds. It may further specify the size of the payload (e.g., two megabits), its character (e.g., block data), and the required quality of service (e.g., data throughput rate). It can also specify a priority level for transmissions so that the interface and the network can service such transmissions prior to low priority data exchanges in the event of a collision.
The pipe manager knows the expected size of the payload due to the information provided by the control processor module 36. (In the illustrated embodiment, the control processor module specifies a particular process for calculating the payload, and thus estimates the size of the resulting data). The control processor module may also predict the character of the data, such as, for example, whether it will be available as a fixed block or intermittently in bursts, the rate to be provided for transmission, and so on. The control processor module 36 can also predict the time that the data is ready for transmission. Priority information is also known by the control processor module. In some instances, the control processor module automatically sets the priority level. In other examples, the priority level is specified by a person or specified by the particular application being served.
For example, a user may explicitly signal through a graphical user interface of a cell phone, or a particular application may routinely require an image-based operation to be processed immediately. This may be the case, for example, where other actions from the user are expected based on the results of the image processing. In other cases, the user may explicitly signal or allow a particular application to be executed when the image-based operation is convenient (e.g., when necessary resources may have low or empty utilization) can do. This may be the case, for example, when a user sends a snapshot to a social networking site such as Facebook and likes an annotated image with the names of the individuals depicted through face recognition processing. For example, intermediate prioritization (represented by the user or by the application), which is a process within one minute, ten minutes, one hour per day, etc. may also be utilized.
In the illustrated arrangement, the control processor module 36 notifies the pipe manager of the expected data size, character, timing and priority so that the pipe manager can use them at the time of the agreement for the desired service. (In other embodiments, some information may be provided.)
If the carrier and interface can satisfy the request of the pipe manager, another data exchange can prepare for data transmission and the remote system can continue to prepare for the expected operation. For example, a pipe manager can establish a secure socket connection with a particular computer in the cloud that receives a particular data payload and identifies the user. If the cloud computer is performing a face recognition operation, it can prepare for an operation by searching for face recognition features and associated names for friends of a designated user from Apple / Google / Facebook.
Thus, in addition to the channel preparing for external communication, the pipe manager enables pre-warming of the remote computer to prepare the expected service request. (The service may and may not be requested.) In some instances, the user can operate the shutter button, and the cell phone does not know what action to follow. Is the user request a face recognition operation? Is the bar code decoding operation? Is it an image posting to Flickr or Facebook? In some cases, the pipe manager - or control processor module - can pre-warm several processes. Or it can predict what actions will be undertaken based on past experience and warm up the appropriate resources. (For example, if the user-performed facial recognition operations follow the last three shutter operations, there is a good chance that the user will again request face recognition). Can begin to execute component operations for a particular operation, particularly those operations that may be useful for various functions.
Pre-warming can also include resources in the cell phone: configuring processors, loading caches, etc.
The situation reconsidered the considerations that the desired resources were ready to handle the expected traffic. In other situations, the pipe manager may report that the carrier is unavailable (e.g., due to a user in an exacerbated wireless service state). This information is reported to the control processor module 36 and can be used to change the schedule of the image processing, the buffer results, or take other corresponding actions.
If other, conflicting data transmissions are in progress, the carrier or interface may respond to the pipe manager that the requested transmission is not acceptable at the requested time, for example, or at the requested quality of service. In this case, the pipe manager may report this to the control processor module 36. The control processor module can trigger a 2 megabit data service requirement and halt rescheduled processing for later. Alternatively, the control processor module can determine that the 2 megabit payload can be created as originally scheduled, and the results can be buffered locally for transmission when the carriers and interfaces can do so. Or other actions may be taken.
Consider a business meeting where participants gather for a group photo before dinner. The user may want all faces in the picture to be recognized immediately so that they can quickly investigate to avoid difficulties that do not associate their name. Even before the user activates the user's shutter button on the cell phone, the control processor module causes the system to process the frames of image data so that the field of view (e.g., elliptical shapes with two apparent eyes at the expected positions ). ≪ / RTI > These can be highlighted with squares on the cell phone's viewfinder (screen) display.
Current cameras may have photographic modes based on lens / exposure profiles (e.g., close-up, night, beach, landscape, snowy scenes, etc.), but imaging devices may use different image- (Or alternatively). One mode may be selected for the user to obtain the names of the persons depicted in the photograph (e.g., through face recognition). Other modes may be selected to perform optical character recognition of the text found in the image frame. Others may trigger actions related to purchasing the depicted item. It is the same as selling the depicted item. Is equivalent to obtaining information about a depicted object, scene, or person (e.g., from a Wikipedia, social network, manufacturer's web site) This is equivalent to establishing a ThinkPipe session or related system with entries.
These modes can be selected by the user in advance of operating the shutter control or later. In other arrangements, a plurality of shutter controls (physical or GUI) are provided to the user - each invoking different available operations. (In other embodiments, it is inferred what operation (s) are likely to be required rather than being explicitly presented to the user.)
If the user of the business meeting takes a group picture depicting twelve individuals and requests names on an instant basis, the pipeline manager 51 informs the control handler module (or application software) that the requested service can not be provided I can report again. Due to bottlenecks or other constraints, the manager 51 may report that only three identities of the faces depicted within the quality of service parameters that are considered to constitute an "immediate" basis can be accommodated. The other three faces can be recognized within two seconds, and the entire set of faces can be recognized after five seconds. (This may in essence be due to constraints by remote service providers rather than carriers.
The control processor module 36 (or application software) may respond to such a report by reference to an algorithm, or a set of rules stored in a local or remote data structure. The algorithm or set of rules indicates that for facial recognition operations the delayed service should be accepted even if provisions are available and that the user should be alerted (via the device GUI) that there will be about N seconds of delay before the full results are available . Optionally, the reported cause of the expected delay may be exposed to the user. Other service exceptions can be handled differently: in some cases, the operation is routed to a provider that is interrupted, rescheduled, or less desirable, and / or alerted to the user.
In addition to considering the capabilities of the local device interface to the network and the capabilities of the network / carrier, the pipeline manager also has query resources outside the cloud (to handle the predicted data traffic) To ensure that services can execute whatever is requested (within specified parameters). These cloud resources may include, for example, data networks and remote computers. If any responds negatively or responds with a service level condition, this can also be reported back to the control processor module 36, so that appropriate action can be taken.
In response to any communications from the pipe manager 51 indicating the likelihood of service the expected data flow, the control process 36 may issue commands corresponding to the pipe manager and / or other modules as needed can do.
In addition to the tasks just described above that pre-negotiate for the required services and set up the appropriate data connections, the pipe manager can also act as a flow control manager - coordinating the transfer of data from different modules in the cell phone, And reports errors back to module 36.
While the above discussion has focused on outgoing data traffic, there is a similar flow back into the cell phone. The pipe manager (and control handler module) can help manage this traffic as well - providing services that are complementary to those discussed in connection with outgoing traffic.
In some embodiments, there may be a pipe manager corresponding module 53 outside the cloud - cooperating with the cell phone's pipe manager 51 in executing the functions described above.
Control processors and pipes Manager's software Example
The study of the field of autonomous robotic engineering shares some similar challenges with the scenarios described herein and more particularly relates to a system and method for enabling systems of sensors to communicate data to local and remote processes that cause locally- Share things. In the case of robotics, it involves moving the robot in an uncomfortable manner; In this case, the most common focus is on providing the desired experience based on images, sounds, and so on.
As opposed to performing simple operations such as obstacle avoidance, aspects of the present technique desire to provide richer experiences based on the semantics of higher levels and thus sensor inputs. The user pointing the camera at the post does not want to know the distance to the wall; The user is much more inclined to want to know about the content on the poster, if the poster is related to the movie, the venue, the reviews, what the friends think of.
Despite these differences, architectural schemes from robot toolkits can be adapted to be used in this context. One such robotic toolkit is like a player project - a set of source-free software tools and sensor applications available as open source from sourceforge-dot-net.
An example of a player project architecture is shown in Figure 19A. A mobile robot (which typically has a relatively low performance processor) communicates with a fixed server (a relatively higher performance processor) using a wireless protocol. Various sensor peripherals are coupled to the mobile robot (client) processor via respective drivers and APIs. Similarly, services can be invoked by a server handler from a software library via other APIs. (The CMU CMVision library is shown in Figure 19a.)
(In addition to the basic tools for interfacing robotic devices to service libraries and sensors, the player project includes a number of mobile robots moving in a 2D environment with various sensors and processes - including visual blob detection - And "Stage" software for rating "Gazebo" to extend the stage model to 3D.)
With this system architecture, new sensors can be quickly exploited - by providing driver software that interfaces with the robot API. Similarly, new sensors can be easily plugged in via the server API. The two player project APIs provide standardized abstractions so that drivers and services do not need to be self-related to the server or robot's specific configuration (and vice versa).
(Figure 20a discussed below also provides locally available operations, externally available operations and an abstraction layer between sensors).
Certain embodiments of the present technology may be implemented in a local processing & remote processing paradigm similar to that of a player project (e.g., named pipes, sockets, etc.) connected by packet network and interprocess communication & . ≪ / RTI > The communication minutiae is a protocol through which different processes can communicate; This can take the form of a message passing paradigm and a message queue, or more network centric schemes in which collisions of key vectors are processed after the fact (retransmission, interruption if practically timely, etc.).
In such embodiments, data from sensors on a mobile device (e.g., microphone, camera) may be packaged in key vector form with associated instructions. The instruction (s) associated with the data may not be represented; These may be implicit or session specific based on context or user needs (such as Bayer conversion) (in picture taking mode, face recognition may be considered).
In a specific arrangement, the key vectors from each sensor are generated and packaged by device driver software processes that extract hardware specific embodiments of the sensor and provide a fully formed key vector that is faithful to the selected protocol.
The device driver software can then place the key vector on the output queue that is unique to that sensor or in a public message queue shared by all sensors. Regardless of the manner in which the local processes consume the key vectors and can perform the necessary operations before placing the resulting key vectors on the queue again. These key vectors, which are then processed by the remote services, are then sent to the remote service, which is similar to a router, which is placed in the packets and distributes the key vectors, or directly to the remote processes for further processing. It is evident to the reader that the instructions for initializing or setting up any of the sensors and processes of the system may be distributed in a similar manner from the control protocol (e.g., box 36 of FIG. 16).
Branch prediction; Commercial Incentives
The technique of branch prediction occurred to meet the needs of increasingly complex processor hardware; This allows processors with long pipelines to fetch data and instructions (and, in some cases, execute instructions in some cases), without waiting for conditional branches to be resolved.
Similar sciences can be applied in this context - predicting what actions a human user will take. For example, as discussed above, the system just described may be capable of "pre-warming " certain processors or communication channels in anticipation of certain data or processing operations coming.
When a user takes the iPhone out of his wallet (the sensor is exposed to increased light) and lifts it up to the level of the eye (detected by the accelerometer), what do you want to do? A reference is made to the past behavior to make a prediction. Specifically, the relevance is what the user did with the phone camera at the last time it was used; What the user did with the phone camera at the same time as yesterday (and a few weeks before that time); What the user last did around the same location, and so on. Corresponding operations can be taken in prediction.
If the user's latitude / longitude corresponds to the location within the video rental store, it helps. It can be expected to perform image recognition on the artwork from the DVD box. To speed up possible recognition, perhaps SIFT or other feature recognition, reference data should be downloaded for candidate DVDs and stored in the cell phone cache. Recent releases are good predictions (except for age-restricted (G-rated) movies or violent movies - stored profile data indicates that the user does not have a history of watching these movies). So, the same is true of movies that viewers have seen in the past (as indicated by historical recordings - also available on phones).
What might be of interest if the user's location corresponds to city streets, and machine and other location data indicate that she is looking northward tilted upwards? Even if there is no image data, a quick reference to online resources such as Google Streetview can suggest that she is looking at business signage along Fifth Avenue. Perhaps, feature recognition reference data for this geography should be downloaded to the cache for quick matching to the image data to be acquired.
To speed up execution, the cache must be loaded in a reasonable way - so the most likely object is considered first. Google Streetview, whose location includes metadata representing the 5th Avenue avenue, has codes for Starbucks, Nordstrom shops and Thai restaurants. The stored profile data for the user reveals a daily visit to Starbucks (she has a branded customer card); She is a customer of regular apparel (using Macy's rather than Nordstrom's credit card); She never eats in a Thai restaurant. Perhaps, the cache should be loaded so that it follows the Starbucks code followed by the Nordstrom and identifies the Thai restaurant the quickest.
A low resolution image captured for provision on the viewfinder fails to trigger a camera feature (e.g., for exposure optimization) that highlights likely faces. It helps. There is no need to pre-warm up complex processing associated with face recognition.
She touches the virtual shutter button to capture a frame of high resolution image and image analysis proceeds - attempting to recognize what is in sight, the camera application will overlay the graphic links associated with the objects in the captured frame . (Or this can happen without user interaction - the camera may be actively watching).
In one specific arrangement, visual "baubles" (Fig. 0) are overlaid on the captured image. Tapping on any of the bobbles pulls up a screen of information like a ranked list of links. Unlike Google Web Search, which ranks search results in order based on aggregated user data, the camera application attempts customized ranks in the user's profile. If a Starbucks code or logo is found in the frame, then the Starbucks link is located on top of that user.
If all the codes for the Starbucks, Nordstrom, and Thai restaurants are found, the links are typically provided in that order (per user preferences deduced from the profile data). However, a cellphone application may have a capitalist appeal, and if circumstances justify it, it may want to promote the link by one or both (perhaps not the best location). In this case, the cell phone routinely sends IP packets to web servers at addresses associated with each of the links, alerting them that the iPhone user has recognized corporate signatures from a particular longitude / latitude. (If privacy considerations and user authorizations are allowed, other user data may also be provided.) The Thai restaurant server responds immediately - providing a random item 25% lower for the next two patrons The system only shows four tables and there are no pending orders; no cooking). The restaurant server provides 5 cents if the phone provides 3 cents if it provides a discount offer to the user in providing search results, or facilitates a link from the ranked list to the second place, or otherwise provides the 5 cents If you only have a discount offer, you will get 10 cents. (Starbucks also responds to incentives as well as attracting them). The cell phone quickly accepts the provision of the restaurant and the payment is made swiftly - more likely to the user (for example, monthly phone bill) or phone carrier (eg AT & T). Links to Starbucks, Thai Restaurant and Nordstrom are offered in that order, and restaurant links mark discounts on the next two staples.
Google's AdWord technology is already well known. It determines which ads are to be served as advertiser links that are close to the results of a Google web search based on factors including the auction-determined payment. Google has adapted this technology to provide ads for third party websites and blogs, based on specific content of sites called services AdSense.
According to another aspect of the technique, AdWord / AdSense technology is extended to visual image search on cell phones.
Consider a user located in a small bookstore that snapped images of Warren Buffett Electric Snowball . The book is quickly recognized rather than providing a corresponding Amazon linked at the top of the list (which may occur with a regular Google search), and the cell phone recognizes that the user is located at an independent bookstore. The context-based rule is first described as a result of providing a non-commercial link. Top ranked for this type is the Wall Street Journal Book Review, proceeding to the top of the list of provided links. However, courtesy is here. The cell phone passes the book title or ISBN (or image itself) to Google AdSense or AdWords, which identifies the advertiser links to be associated with that object. (Google can independently execute its own image analysis of any given image.) In some cases, you can pay for these cell phone-presented images - Google uses data from other kinds of resources Google has the top advertiser position for Barnes & Noble, followed by alldiscountbooks-dot-net. The cell phone application may provide these advertiser links in a graphically clear way (e.g., provided in different parts of the display or in different colors) to indicate their source, or alternatively may provide these in alternate, non-commercial search results, For example, at positions 2 and 4. AdSense earnings collected by Google may be shared with the user and / or carrier of the user again.
In some embodiments, the cell phone (or Google) again pings the servers of the companies whose links will be presented - helping to track the online visibility of their physical world-wide base. The pings include the location of the user and the identification of the object that facilitates pinging. alldiscountbooks-dot-net When you receive this ping, you can check inventory and find that you have a significant excess inventory of snowballs . As with the example provided earlier, we can provide additional payments for some additional promotions (including, for example, "We have 732 bookbinds - inexpensive!" In the provided links).
In addition to providing incentives for a more prominent search listing (e.g., higher in the list or expanded to additional information), a company may also provide additional bandwidth to serve information to the customer. For example, a user may want to capture a video image from an electronic bulletin board and download it to show friends a copy. The user's cell phone identifies the content as a popular clip in the user-provided content (e.g., by reference to the encoded watermark), and a clip available from multiple sites-MySpace is the most popular following YouTube- . In order to drive users to link to MySpace, MySpace can offer to upgrade your baseline wireless service from 3 megabits per second to 10 megabits per second, so the video will be downloaded in 1/3 hours. These upgraded services are only for video downloads or may be longer. The link provided on the screen of the user's cell phone may be corrected to highlight the availability of the faster service. (Again, MySpace can make an associated payment.)
Sometimes it is necessary to open a bandwidth throttle on the cell phone end of the radio link to mitigate network bottlenecks. Or bandwidth service changes must be requested or authorized by the cell phone. In this case, MySpace can inform the cell phone application to take the necessary steps for higher bandwidth services, and MySpace will refund the associated costs to the user (or to the carrier for the benefit of the user's account) .
In some arrangements, the quality of service (e.g., bandwidth) is managed by the pipe manager 51. Commands from MySpace can request that the Pipe Manager request increased service quality and begin to set up the expected high bandwidth session even before the user selects MySpace.
In some scenarios, vendors can negotiate an optional bandwidth for the content. MySpace can deal with AT & T, for example, that all MySpace content delivered to AT & T phone subscribers is delivered at 10 megabits per second - most subscribers typically receive only 3 megabits per second. Higher quality services can be highlighted to the user on the provided link.
From the foregoing, in certain embodiments, the information provided by the mobile phone in response to the visual stimulus may be based on the user's demographic profile, possibly (1) user preferences, and (2) I will recognize that it is a function of all. While demographically identical, users with different tastes are likely to be served with different bobbles or associated information when the restaurants are looking at the dense distances. Users who have the same taste and different preference information - but different demographic factors (age, gender, for example) - also want to be paid differently for different eyeballs by vendors, Information / information.
User behavior modelling
With the help of knowledge of a particular physical environment, specific location and time, and expected behavior profile of the user, simulation models of the physical world and human computer interaction are distributed and distributed from fields to tools and techniques Lt; / RTI > This example shows the number of mobile devices expected at a museum at a particular time; Certain sensors for which such devices may be used; Which stimuli are expected to be captured by these sensors (e.g., where they are pointing to the camera, what the microphone is listening to, etc.). Additional information may also include assumptions about the social relationships between users: are they likely to share common interests? Do they exist within common social circles that may want to share content, share experiences, or create location-based experiences such as wiki-maps? (2009 MobileHCI, by Barricelli, "Map-Based Wikis as Contextual and Cultural Mediators, ")?
In addition, modeling can be used for more advanced prediction models based on innate human behavior (for example, people are more likely to capture images from the scoreboard for more time than during breaks of the game) Events based on generalized heuristics derived from observations at events (e.g., how many people used cell phone cameras to capture images from a scoreboard of Portland Trail Blazers during a basketball game, etc.) .
These models, in addition to the business entities involved in preparing and measuring experience, can inform many aspects of experience with users.
These latter entities are comprised of conventional value chain participants associated with the event product, and arrangements related to measuring and monetizing the interaction. Event planners, producers, and technicians on the producer side and associated rights are the royalty side and community (ASCAP, Directors Guild of America, etc.). In the measurement landscape, two sampling-based techniques and census-driven techniques from the users and devices determined to be subscribed can be utilized. The metrics for more static environments are based on unit revenue (RPU: Revenue) generated by digital traffic generated on the digital service provider network for more advanced models of click through rates (CTR) Per Unit) (how much bandwidth is consumed).
For example, the Mona Lisa painting of the Louvre is likely to have a much higher CTR than the other paintings of the museum, so that the content preparation, for example, if the user does not preload itself on the mobile device when approaching or entering the museum , The content related to the Mona Lisa will be notified of priorities such as content preparation that should be cached close to the edge of the cloud as possible. (Naturally, the same importance is the role CTR plays in monetizing experiences and environments.)
Consider a school group that enters a sculpture museum with a garden with a collection of Rodin works. The museum may provide content related to Roden and his works on servers serving the garden or infrastructure (e.g., router caches). Moreover, because visitors include pre-established social groups, museums can anticipate certain social connectivity. So museums can enable the sharing of capabilities that may not otherwise be used (eg ad hoc networking). If a student asks the museum's online content to learn more about a particular Rodin sculpture, the system can achieve the delivery of detailed information by immediately inviting the student to share this information with the rest of the group. The museum server can suggest specific "friends" of students to whom such information can be shared - if such information is publicly accessible from Facebook or other social networking data sources. In addition to the names of friends, these social networking data sources can also provide device identifiers, IP addresses, profile information, etc. to the students' friends - which can be leveraged to help disseminate educational materials to the rest of the group have. Since this was of interest to the other students of their group, these other students can find this specific relevant information - even if the original student's name is not identified. If the original student is identified with the delivered information, this can increase the interest of the information to the rest of the group.
(The detection of a socially linked entity can be inferred from a review of the museum's network traffic, for example, when a device sends packets of data to another device, and the network of the museum communicates the two ends of the communication-dispatch and forward - If there is an association between two devices in the museum, then if the devices are not devices with historical patterns of network usage, for example, the employees, the system will have two visitors to the museum socially If the web of these communications is detected - the association of several unfamiliar devices can identify the visitors of the social organization. The size of the organization is determined by the number of different participants in this network traffic Demographic information about the organization can be used Middle school students can have a high frequency of MySpace traffic; college students can communicate with external addresses in the college domain; older citizens can demonstrate different traffic profiles. All of this information can be used to automatically adapt the information and services provided to visitors - not only to provide useful information for museum management, but also to market in department stores.)
Consider other situations. One that characterizes headline players (for example, Bruce Springsteen, or Prince). This is the break time for a football super ball. The show can allow hundreds of fans to capture images or audio-video events. Another context with predictable public behavior is the end of the NBA Championship basketball game. Fans may want to celebrate the excitement of the final buzzer: scoreboards, streamers, and confetti falling from the ceiling. In these cases, actions should be taken that can be taken to prepare or optimize or deliver the content or experience. Examples include granting rights to associated content; Rendering virtual worlds and other synthesized content; Routing time - Downstream non-profit network traffic; Queuing advertising resources that can be called when a person purchases commemorative books / music from Amazon (caching pages, (Some pre-written / edited and ready to go), caching star players' twitter feeds, video from the city center showing hometown crowds watching on the Jumbotron display Buffering - such as bursting pleasure in the buzzer; Experience, or anything related to subsequent operations that must be pre-scheduled / cached where possible.
Stimulation (audio, visual, tactile, smell, etc.) for sensors that are most likely to trigger user behavior and attention is even more valuable from an advertising standpoint than a less likely stimulus to trigger such behavior (Google's Adwords Ad Serving System Similar to the underlying economic principles). These factors and metrics directly affect advertising the models through auction models well understood by those skilled in the art.
The dynamic environment in which stimuli are provided to users and their mobile devices is controlled (such as video displays in contrast to static posters) to provide new opportunities for measurement and utilization of metrics such as CTR.
Background music, content on digital displays, lighting, etc. can be modified to maximize CTR and form traffic. For example, illumination for a particular venous area may be increased, or may be flashing when a targeted individual passes through. Similarly, when a Japanese airplane lands at the airport, digital signage, music, etc. will be used to maximize the CTR, obviously (change of advertising for anticipated audience interests) or secretly Change the linked experience to take) can all be modified.
Mechanisms can be introduced as well to counteract erroneous or unauthorized sensor stimuli. Within the confines of the Business Park site, stimuli (posters, music, digital signage, etc.) - or entities responsible for the domain - that are not faithful to the intentions or policies of the property owner may have to be managed. This can be accomplished through the use of geo-specific simple blocking mechanisms (not unlike region coding on the DVD), so that all attempts within a particular GPS coordinate to route the key vector to a particular location in the cloud are managed by the domain owner Lt; RTI ID = 0.0 > and / or < / RTI > routing services.
Other options include filtering the resulting experience. Is your age appropriate? Does Denver Nuggets run against pre-existing advertising or branding arrangements such as Coca-Cola advertising delivered to users inside the Pepsi Center during the game?
This may be achieved by sticking to the MovieLabs content recognition rules associated with the conflicting media content (www.movielabs-dot-com / CRR comparison), the parental controls provided by the carriers to the device, or the DMCA automatic take- Can also be achieved on the device through the use of content rules.
Under the various rights management paradigms, licenses play a key role in determining how content is consumed, shared, modified, and so on. The result of extracting meaning from the location where the stimulus is provided and / or the stimuli provided to the user (and the user's mobile device) may be the issuance of a license for the desired content or experiences (games, etc.) have. To illustrate, consider a user in a rock concert on stage. The user may be granted a provisional license to preview and listen to all music tracks by the performing artists (and / or others) on iTunes. However, such licenses may only be maintained during the concert, only when the doors are open until the headline operation begins to perform, or while the user is on stage. This license is then terminated.
Similarly, passengers departing from international flights may be able to access translation services or navigation services (e.g., on camera-captured scenes) for their mobile devices while they are at the airport for 90 minutes after they arrive, Location-based, or time-limited licenses for an enhanced real-world system that overrides directions to a location, location, toilets, etc.).
Such arrangements can serve as metaphors and filtering mechanisms for experiences. One embodiment in which sharing of experiences is triggered by sensor stimulation is through broadcast social networks (e.g., tweeter) and syndication protocols (e.g. RSS Web feeds / channels). Other users, entities, or devices may use such broadcasts / feeds as a basis for subsequent communications (social, information search, etc.) when logging measurements (e.g. audience ratings) or activities (e.g., a daily journal of a person) Can join. Traffic associated with such networks / feeds can also be measured by devices at a particular location - allowing users to treble the time to know who was communicating at a particular point in time. This makes it possible to retrieve and collect additional information, eg was my friend here last weekend? Was one of my peer groups here? What content is consumed? This traffic also makes it possible to monitor in real time which users share experiences. Monitoring the "tweets" of the performer's song selection during a concert allows the performer to change the songs to be played during the remaining concerts. The same applies to brand management. For example, if users share opinions about a vehicle during a vehicle exhibition, vivid keyword filtering on the traffic can allow the brand owner to relocate certain products for maximum effectiveness (e.g., a new model of the Corvette is spinning Spend more time on the platform, etc.).
Add to Optimization
Predicting the user's behavior or intent is a form of optimization. Other forms involve configuring the process to improve performance.
To illustrate one particular arrangement, again consider the common services classifier of FIG. What key vector actions should be executed locally or remotely, or what kind of hybrid? In what order should key vector operations be executed? Etc. The expected behavior and their mix of scheduling should be configured in a suitable manner, environments and context for the processing architecture being used.
One step in the process is to determine what actions should occur. This determination may be based on explicit requests from the user, historical patterns of use, context and status, and so on.
Many operations are high-level functions, which involve actions of a number of components - performed in a specific order. For example, optical character recognition may require edge pattern detection followed by template region matching, following edge detection. Face recognition may involve skin tone detection, Hough transforms (to identify elliptical regions), identification of feature locations (pupils, tongue, nose), eigenface calculation, and template matching.
The system can identify the component operations that may need to be executed, and the order in which their respective results are required. Rules and heuristics can be applied to help determine whether these actions should be performed locally or remotely.
For example, at one extreme, rules may specify that simple operations such as color histograms and thresholding should generally be performed locally. At the other extreme, complex operations may be generally defaulted to external providers.
Scheduling may be determined based on which operations are prerequisites for other operations. This can also affect whether the operation should be performed locally or remotely (local execution can provide faster results - allowing subsequent operations to start with less delay). The rules may seek to identify the operation whose output (s) is used by the maximum number of subsequent operations and may execute this operation first (allowing each precedent (s)). Operations that are prerequisites for a succession of a small number of other operations are subsequently executed in succession. The operations and their sequence can be plotted as a tree structure - the most globally important ones are executed first and the lower relevancy operations are performed later on other operations.
However, these determinations can be (or can be influenced) by other factors. One is power. If the cell phone battery is associated with a significant drain on a low or low capacity battery, this can tip the balance to allow the operation to be performed remotely.
Another factor is the response time. In some instances, the limited processing power of a cell phone may mean that the local processing is slower than the remote processing (e.g., a more powerful, parallel architecture may be able to execute the operation). In other instances, delays establishing communication and establishing a session with a remote server may make the local execution of the operation faster. Depending on the needs of the user and other operation (s), the rate at which results are returned may or may not be significant.
Another factor are user preferences. As is known elsewhere, the user can set parameters that affect where and when the actions are performed. For example, a user may specify that an action should be performed locally, if the action can be referenced to a remote process by a domestic service provider, but nothing is available.
Routing constraints are other factors. Sometimes, the cell phone will be in WiFi or other service area (e.g. at the concert stage), where the local network provider places restrictions or conditions on remote service requests that can be accessed through the network. In a concert where photographing is prohibited, for example, the local network may be configured to block access to external image processing service providers during the duration of the concert. In this case, normally routed services for external execution must be run locally.
Another factor is the specific hardware that the cell phone is equipped with. If dedicated FFT processing is available on the phone, the execution of the intensive FFT operations is locally utilized. If only a weak general-purpose CPU is available, the intensive FFT operation is most likely to be referenced externally for external execution.
The relevant factor is current hardware utilization. Even if the cell phone is equipped with well-configured hardware for a particular task, it is too busy for the system to be able to refer back to an external source for completion of this type of next task.
Other factors may be the length of the local processing chain and the risk of stalling. The pipeline processing architectures can be stalled for intervals during the waiting of the requested data to complete the operation. This stall can cause all other subsequent operations to be similarly delayed. The risk of possible stalls is assessed (e.g., by historical patterns, or by knowledge requiring other data - such as results from other external processes where the completion of an action is not guaranteed to be available at the appropriate time) Is sufficiently large, the operation may be referred to an external process to avoid stalling the local processing chain.
Another factor is the connected state. Have reliable high-speed network connections been established? Or packets are interrupted, or the network is slow (or completely unavailable)?
Different types of geographic considerations can also be factors. One is a network close to the service provider. The other is whether the cell phone has limited access to the network (such as in a home area), or a pay-per-use arrangement (such as when roaming to another country).
Information about the remote service provider (s) may also be a factor. Does the service provider provide an immediate turnaround or are the requested actions placed in a longer queue than other users waiting for service? When the provider is ready to handle the business, what speed is expected to be implemented? Costs can also be keyfactors, along with other attributes of importance to the user (e.g., whether the service provider meets "green" standards of environmental responsibility). So many other factors can also be considered when appropriate in certain contexts. The sources for such data may include external resources as well as various elements shown in the exemplary block diagrams.
A conceptual diagram of what has been described above is provided in FIG. 19B.
Based on various factors, a determination is made as to whether the operations should be performed locally or remotely. (The same factors can be evaluated to determine the order in which operations should be performed.)
In some embodiments, different factors may be quantized by scores, which may be combined in a polynomial fashion to yield an overall score indicating how the action should be handled. This overall score serves as a metric that indicates the relevant fitness of the operation for remote or external processing. (Similar scoring schemes can be used to choose among different service providers.)
Depending on changing environments, a given operation can be executed locally at a time and remotely executed at a later time (or vice versa). Or the same operation can be performed simultaneously on two sets of key vector data-one locally and one remotely.
Although described in the context of determining whether an action should be performed locally or remotely, the same factors can similarly affect others. For example, they may also be used to determine what information is conveyed by the key vectors.
Consider an environment where cell phones execute OCR on captured images. Using a set of factors, unprocessed pixel data from the captured image may be transmitted to the remote service provider to make this determination. Under different set of factors, the cell phone can perform initial processing such as edge detection, then package the edge-detected data in key vector form, and route it to an external provider to complete the OCR operation have. Under yet another set of factors, the cell phone performs OCR operations of all components up to the end (template matching) and transmits data only for this final operation. (Under the other set of factors, the OCR operation can be completely completed by the cell phone, or the operation of the different components can be alternately performed by the cell phone and remote service provider (s), etc.)
A reference to routing constraints has been made as one possible factor. This is a specific example of a more general factor - external business rules. Consider an early example of a user who participated in an event at the Pepsi Center in Denver. The Pepsi Center can provide wireless communication services to its customers through its own WiFi or other network. Naturally, the PepsiCenter is not happy with its network resources to be used for the benefit of competitors such as Coca-Cola. Thus, the host network may affect cloud services that may be utilized by its heads (e.g., by making some things inaccessible, or by having low priority on data traffic with certain types or specific destinations . The domain owner can exercise control over what actions the mobile device can execute. This control can affect the type of data being passed to the key vector packets as well as the local / remote decision.
Another example is a gymnasium where one may want to stop using cell phone cameras by hindering access to remote service providers for images, as well as photo sharing sites such as Flickr and Picasa. Another example is a school where you may want to discontinue face recognition of students and stakeholders for privacy reasons. In such a case, the access to the facial recognition service providers may be blocked, or may be granted on a case-by-case basis. At the stages, it can be seen that it is difficult for individuals to stop using a cell phone camera - or for using them for specific purposes - but they can take a variety of actions to hinder such use (for example, By denying services that facilitate or facilitate).
The following outlines identify other factors that may be involved in determining which operations should be performed in which sequence from where:
1. Scheduling optimization of key vector processing units based on a number of factors:
Operation mix, operations consist of similar atomic commands (MicroOps, Pentium II, etc.)
o Stall states, actions will generate stalls for the following reasons:
ㆍ Waiting for external key vector processing
ㆍ User input
Change of user focus
o Cost of motion based on:
ㆍ Costs disclosed
Estimated cost based on auction status
ㆍ Battery status and power mode
ㆍ Power profile of operation (expensive?)
ㆍ Past history of power consumption
Opportunity cost, providing the current status of the device, for example, whether other processes should take precedence, such as voice calls, GPS navigation, etc.
User preferences, ie I want a "green" provider or open source provider
• Legal uncertainties (for example, certain providers may be at greater risk of patent violation liability, for example, due to the use of prominent patent methods)
o Domain owner impact:
The privacy concerns of certain physical stages, such as not recognizing faces in schools
≪ RTI ID = 0.0 > - < / RTI > based on rules prohibiting certain actions on a particular stimulus,
Voiceprint matching for broadcast songs highlighting with other singers (Milli-Vanilli's Grammy becomes invalid when the custodians know that actual vocals at the time of recording are performed by other singers)
o Ability to run out of the order of execution of key vectors based on all of the above impact scheduling and optimal path to the desired result
Uncertainty in long chain operations (similar to deep pipelines in processors & branch prediction), which makes it difficult to predict the need for subsequent key vector operations - there may be difficulties due to weak metrics on key vectors have
ㆍ Past behavior
Location (GPS indicates that the device is moving quickly) & Pattern of GPS movements
There is a pattern of exposure to stimuli, as a user walking through the airport terminal is repeatedly exposed to the CNN being provided at each gate
Near sensors indicating that the device is in the pocket
Other schemes such as Least Recently Used (LRU) can be used to track how often the desired key vector operation appears or contributes to the desired effect (recognition of the song, etc.).
Other related pipelined or other time-consumed operations, certain embodiments, may initiate certain conformance tests before associating processing resources for what may be greater than the threshold number of clock cycles. A simple conformance test is to ensure that the image data is potentially useful for its intended purpose, in contrast to data that can be quickly disqualified from the analysis. For example, whether it is all black (e.g., a frame captured in the user's pocket). A suitable focus can also be quickly identified before being accommodated in the extended motion.
(It will be appreciated that certain aspects of the techniques discussed above may have some, if not too late, visual precedents.) For example, considerable work has been devoted to optimizing instructions for pipelined processors. , Allowing user configuration of power settings, such as user-selectable deactivation of power-less GPUs in certain Apple notebooks to extend battery life.)
The above-discussed determinations of the proper instruction mix (e.g., by the public service classifier of FIG. 6) specifically considered particular problems that arise in pipelined architectures. Different principles may be applied to embodiments in which more than one GPU is available. These devices typically have hundreds or thousands of scalar processors that are adapted for parallel execution, so the costs of execution (time, stall risk, etc.) are small. Branch prediction can be handled without prediction; Instead, the GPU processes all the potential results of the branch in parallel, and the system uses whatever it knows to correspond to the actual branch condition, if it is known.
To illustrate, consider face recognition. A GPU-equipped cell phone can invoke commands - when the camera is active in user photo-shoot mode - to construct 20 clusters of scalar processors in the GPU. In particular, each cluster is configured to perform a Hough transform on a small tile from a captured image frame - looking for one or more elliptical shapes that can be candidate faces. Thus, the GPU processes the entire frame in parallel by 20 simultaneous Hough transforms. (Many stream processors are probably not found, but processing speed does not deteriorate.)
Once these GPU Hough transform operations are complete, the GPU can be reconfigured into a smaller number of stream processors - each candidate elliptical shape is analyzed to determine the eye pupil positions, co-location, and mouth-to- I will devote myself. For any ellipse that yields useful candidate face information, the associated parameters are packaged in key vector form and sent to the cloud service to retrieve the key vectors of the analyzed face parameters for known templates of, for example, . (Or this confirmation may be executed by another processor in the GPU or cell phone).
(Such facial recognition - as described elsewhere in this document - can include dozens, hundreds or even thousands of bytes of data from, for example, millions of pixels (bytes) in the original captured image It is interesting to note that extracting a key vector with this smaller portion of information with more dense information content is routed more quickly for processing - sometimes externally. Communication of the extracted key vector information results in a corresponding bandwidth With the ability to maintain cost-effectiveness and implementation practicality-through channels).
As opposed to the GPU implementation just described of face detection for such an action as can be implemented on a scalar processor. Performing Huff-transform-based ellipse detection over the entire image frame is suppressed in terms of processing time - more effort is worthless and delays other tasks assigned to the processor. Instead, this implementation typically causes the processor to examine the pixels as they exit the camera - looking for those with colors within the expected "skin tone" range. Only when an area of skin tone pixels is identified, a Hough transform is attempted for that extract of image data. In a similar manner, attempts to extract facial parameters from the detected ellipses are done in a series of tough ways - often resulting in non-useful results.
Many artificial light sources do not provide consistent illumination. Most of them show temporary deformation in intensity (brightness) and / or color. These variations generally follow the AC power frequency (50/60 or 100/120 Hz), but sometimes not. For example, fluorescent tubes can emit infrared light that changes at a -40 KHz rate. The emitted spectra depend on the specific illumination technique. Organic LEDs for household and industrial lighting can sometimes use individual color blends (e.g., blue and amber) to create a white color. Others may utilize more conventional red / green / blue clusters or blue / UV LEDs with phosphors.
In one particular implementation, the processing stage 38 monitors the average intensity, red, green, or other natural color of the image data, for example, in the bodies of the packets. This intensity data may be applied to the output 33 of that stage. With image data, each packet can carry a timestamp indicating the specific time (based on the absolute value or local clock) at which the image data was captured. This time data may also be provided to the output 33.
Synchronization processor 35 coupled to such output 33 may examine variations in frame-to-frame intensity (or color) as a function of timestamp data to identify its periodicity. Moreover, such a module can predict the next time instant when intensity (or color) has a maximum, minimum, or other specific state. The phase-locked loop can control an oscillator that is synchronized to reflect the periodicity of the aspect of illumination. More typically, the digital filter computes the time interval - optionally with software interrupts - used to set or compare for the timers. A digital phase-locked loop or a delay-locked loop can also be used. (Kalman filters are commonly used for this type of phase fixation.)
The control processor module 36 may poll the synchronization module 35 to determine when the illumination condition is expected to have the desired state. Using this information, the control processor module 36 may instruct the setup module 34 to capture a frame of data under the preferred lighting conditions for a particular application. For example, if the camera is imaging an object that is presumed to have a digital watermark encoded in a green channel, the processor 36 will cause the camera 32 to capture a frame of the image at the instant that green light is expected to be at a maximum, And direct processing stages 38 to process the frames for detection of such watermarks.
The camera phone may typically incorporate a plurality of LED light sources that are also operated in series to produce a flash of white light illumination on the object. However, they can be operated individually or in different combinations so that they can cast different colors of light to the object. The phone processor can individually control the component LED sources to capture frames with non-white illumination. If you capture the image being read to decode the green-channel watermark, then only green illumination may be applied when the frame is captured. Or the camera can capture a plurality of consecutive frames-the different LEDs illuminate the object. One frame may be captured in the 1 / 250th second with a corresponding period of the red sole illumination; The subsequent frame may be captured in the 1/100 th second with a corresponding period of the green only illumination. These frames may be analyzed individually, or may be combined for analysis, for example. Alternatively, a single image frame may be captured over an interval of 1/100 th second, the red LED is activated at its full interval, and the red LED is activated for 1/250 th second during its 1/100 th second interval. Instantaneous ambient illumination can be sensed (or predicted as above), and component LED color light sources can be operated in each manner (e.g., by adding blue light from a blue LED, To behave in the opposite direction of orange).
Other notices; Projectors
Although a packet-based, data driven architecture is shown in FIG. 16, various other implementations are of course possible. These alternative architectures are readily available to those skilled in the art based on the given details.
Skilled artisans will appreciate that the arrangements and details described above are optional. The actual choices of arrangements and details will depend on the specific applications being served and are most likely to differ from the known ones. (For the sake of quote, FFTs can be done for 64 x 64, 256 x 256, whole images, etc., rather than for 16 x 16 blocks.)
Similarly, it will be appreciated that the body of the packet may carry the entire frame of data or may only carry excerpts (e.g., 128 x 128 blocks). Thus, image data from a single captured frame spans a series of several packets. Different excerpts within a public frame may be handled differently depending on which packet they are being delivered to.
Moreover, the processing stage 38 can be commanded to divide one packet into a number of packets, such as by separating the image data into 16 tiled smaller sub-images. Thus, more packets may be provided at the end of the system than those generated at the start.
In the same manner, a single packet may be processed in a series of different images (e.g., images taken sequentially with different focus, aperture or shutter settings: a specific example is the depth-overlapping of the field brackets, Which is a set of focus areas from five images taken with a bracket. This set of data can then be processed by later stages - either as a set or through processing, the process selects one or more excerpts of the packet payload that meet the specified criteria (e.g., the focal sharpness metric) .
In the particular example described above, each processing stage 38 generally replaces the originally received data of the body of the packet with the result of processing. In other arrangements, this need not be the case. For example, the stage may output the processing result to an external module, for example, the output 33, which is the process chain described. (Alternatively, as is well known, the stage may keep the originally received data in the body of the output packet and may increase it to other data, such as its processing result (s)).
Reference was made to determine focus by referring to DCT frequency spectra or edge detected data. Many consumer cameras implement a simpler form of focus identification - simply by determining the intensity difference (contrast) between pairs of adjacent pixels. These cars are picked with the correct focus. Such an arrangement can be used naturally in the above-described arrangements. (Again, the benefits can be accumulated from executing this process on the sensor chip.)
Each stage typically performs a handshaking exchange with an adjacent stage - each time data is passed to or received from an adjacent stage. Such handshaking is routine for those skilled in the art of digital system design and, therefore, is not discussed here in great detail.
The above-described arrangements have considered a single image sensor. However, in other embodiments, multiple image sensors may be used. In addition to enabling conventional stereoscopic processing, two or more image sensors are possible or improve many other operations.
One function that benefits from multiple cameras is to identify objects. To cite a simple example, a single camera can not identify a human face from a face image (e.g., because it can be found in a magazine, on a billboard, or on an electronic display screen). Using the spaced-apart sensors, in the contrast, the image of the 3D aspect can be easily distinguished, allowing the image to be identified. (Depending on the implementation, can be a 3D aspect of the person actually identified).
Another feature that benefits from multiple cameras is the refinement of geographic location. From the differences between the two images, the processor can determine the distance of the device from the landmarks whose position can be accurately known. This allows for improvement of other geographic location data available to the device (e.g., WiFi node identification, GPS, etc.).
Since a cell phone can have one, two (or more) sensors, such a device can also have one, two, or more projectors. Individual projectors are being placed in cell phones by CKing (N70 model distributed by China Vision) and Samsung (MPB200). LG and others have shown prototypes. (These projectors are understood to utilize Texas Instruments electronically tunable digital micro-mirror arrays with LED or laser illumination.) Microvision offers a PicoP display engine, which is a micro-electro-mechanical scanning mirror And optical combiners) to calculate the projector capability. Other suitable projection techniques include DisplayTek's ferroelectric LCOS systems and 3M liquid crystals on silicon (LCOS).
The use of two projectors or two cameras provides differentials of projection or viewing and provides additional information about the object. In addition to stereo features, it also allows for local image correction. For example, consider two cameras that image digital watermarked objects. One camera view of the object provides a measurement of the transform that can be identified from the surface of the object (e.g., by the encoded calibration signals). This information can be used by other cameras to correct the view of the object. And conversely, it is possible. The two cameras can be repeated, yielding a comprehensive feature of the object surface. (One camera can see a better example area of the surface and other edges that other cameras can not see.) Thus, one view can represent information that the other view does not.
When a reference pattern (e.g., a grid) is projected onto the surface, the shape of the surface is revealed by the distortions of the pattern. Figure 16 can be expanded to include a projector, which projects a pattern onto an object for capture by the camera system. (The operation of the projector can be synchronized with the operation of the camera by, for example, the control processor module 36 - it imposes a significant battery drain and thus is activated only when the project is needed.) Modules 38 Remote processing of the resulting image provides information about the surface topology of the project. This 3D topology information can be used as a clue to identify the object.
In addition to providing information about the 3D configuration of an object, the configuration information allows the surface to be virtually remapped to any other configuration, e.g., a plane. This remapping serves as a kind of normalization operation.
In one specific arrangement, the system 30 operates the projector to project the reference pattern in the field of view of the camera. When the pattern is projected, the camera captures a frame of image data. The resulting image is then processed to detect the reference pattern and characterizes the 3D shape of the image project therefrom. Subsequent processing follows after based on the 3D shape data.
(With regard to these arrangements, the reader is referred to Google Books-Scanning Patent 7,508, 978, which utilizes related principles. The patent details specially useful reference patterns among related disclosures.)
If the projector uses a targeted laser illumination (such as a PicoP display engine), the pattern will be focused regardless of the distance to the object on which the pattern is projected. This can be used to help focus the cell phone camera to any object. Since the projected pattern is known in advance by the camera, the captured image data can be processed to optimize the detection of the pattern, such as by correction. (Alternatively, the pattern can be selected to facilitate detection - such as a checkerboard that appears strongly at a single frequency in the image frequency domain when properly focused.) Once the camera is in focus with the best focus of a known aimed pattern Once adjusted, the projected pattern can be discontinuous and the camera can then capture an appropriately focused image of the subject on which the pattern was projected.
Simultaneous detection can also be exploited. The pattern can be projected during the capture of one frame and then turned off for the next capture. The two frames can then be removed. The common image in the two frames is generally erased - leaving a much higher signal-to-noise ratio projected pattern.
The projected pattern can be used to determine the correct focus for multiple objects in the field of view of the camera. The child can pose in front of the Grand Canon. The laser-projected pattern allows the camera to focus on the child in the second frame first frame and focus on the background in the second frame. These frames can then be synthesized - taking a properly focused portion from each.
When lens arrangement is used in the projector system of a cell phone, the camera system of the cell phone can also be used. The mirror can be controllably moved to adjust the camera or projector to the lens. Or a beam splitter arrangement 80 is used (Fig. 20). Where the body of the cell phone 81 incorporates a lens 82 that provides light to the beam-splitter 84. A portion of the illumination is routed to the camera sensor 12. Another portion of the optical path travels to the micro-mirror projector system 86.
The lenses used in cell phone projectors are typically larger in aperture than those used in cell phone cameras so that the camera can be used with significant performance benefits (e.g., Can be obtained. Or, mutually, the beam splitter 84 may be asymmetric - without preferring the two optical paths equally. For example, the beam-splitter may be a partially silver-free element that externally couples a smaller collection (e.g., 2%, 8%, or 25%) of incident light to the sensor path 83. Thus, the beam-splitter can serve to externally couple a larger collection (e.g., 98%, 92%, or 75%) of illumination from the micro-mirror projector for projection. With this arrangement, the camera sensor 12 receives light of a typical (even in the case of larger aperture lenses) for a cell-phone camera, but the light output from the projector is only slightly Cloudy.
In other arrangements, the camera head is separated from the cell phone body - or removable. While the cell phone body is carried in a user's pocket or purse, the camera head is adapted to see beyond the user's pocket (e.g., in the form of a pen-like factor, with a pocket clip, ). Two are communicated by Bluetooth or other wireless arrangement, capturing image data transmitted from the camera head and commands transmitted from the phone body. This configuration allows the camera to constantly examine the scene in front of the user - without the need for the cell phone to be removed from the user ' s pocket / purse.
In the related arrangement, the strobe light for the camera is separated from the cell phone body - or is removable. The light (which may incorporate LEDs) can be placed near the image object, thereby providing illumination from the desired distance and distance. The strobe may be started by a wireless command issued by the cell phone camera system.
(Those skilled in the art of optical system design will know many alternatives to specifically known arrangements.)
Some of the benefits of having two cameras can be realized by having two projectors (with a single camera). For example, two projectors may project alternating or otherwise distinct patterns (e.g., simultaneous but different colors, patterns, polarities, etc.) in the field of view of the camera. By knowing how the two patterns - projected from different points - are provided on the object and different when viewed by the camera, the stereoscopic information can be re-identified.
Many usage models are made possible through the use of projectors including new shared models (see "View & Share: Exploring Co-Present Viewing and Sharing of Pictures Using Personal Projection" by Greaves in 2009 Mobile Interaction with the Real World). These models utilize images generated by the projector itself as a trigger to initiate a shared session explicitly through a commonly understood symbol ("open" code) to hide machine readable triggers. Sharing can also occur through ad hoc networks that utilize peer-to-peer applications or server-hosted applications.
Other outputs from the mobile devices may be similarly shared. Consider key vectors. One user's phone can process images with Hough Transform and other eigenface extraction techniques and then share key vectors resulting from eigenface data with others in the user's social cycle Or allowing them to pool it). One or more of these social-subscribed devices may then perform face template matching that yields an identification of a previously unrecognized face in the image captured by the original user. These arrangements take personal experience and make it a public experience. Moreover, the experience can be a viral experience with key vector data - essentially without boundaries - shared with the majority of others.
Selected other Arrangements
In addition to the arrangements initially described above, other hardware arrangements suitable for use with certain implementations of the present technique utilize the Mali-400 ARM graphics multiprocessor architecture, which is dedicated to different types of image processing tasks referenced in this document Lt; RTI ID = 0.0 > fragments < / RTI >
The standard group Khronos has issued OpenGL ES2.0, which specifies hundreds of standardized graphics function calls for systems containing multiple CPUs and multiple GPUs (direction in which cell phones are gradually migrating) . OpenGL ES2.0 tries to route different operations to different processing units - those details are transparent to the application software. Therefore, it provides all the ways of GPU / CPU hardware and the available matching software APIs.
In accordance with another aspect of the technique, the OpenGL ES2 standard extends not only over different CPU / GPU hardware, but also to provide a standardized graphics processing library across different cloud processing hardware - again, such details are transparent to the calling application .
Gradually, Java Service Requests (JSRs) have been specified to standardize certain Java-implemented operations. JSRs are increasingly designed for efficient implementations of top-level OpenGL ES 2.0 grade hardware.
According to another aspect of the present technique, some or all of the image processing operations known in the art (face recognition, SIFT processing, watermark detection, histogram processing, etc.) can be implemented as JSRs - Lt; RTI ID = 0.0 > standardized implementations.
In addition to supporting cloud-based JSRs, the extended standard specification may also support the query router and response manager functions initially described above-both include static and auction-based service providers.
OpenGL is similar to OpenCV - a computer vision library available under an open source license that allows coders to call various functions - regardless of the specific hardware utilized to run the same thing. (O'Reilly Books, Learning OpenCV , documents in a wide range of languages.) NokiaCV provides similar standardized functions for the Symbian operating system (eg Nokia cell phones).
OpenCV provides support for a wide range of operations including high-level tasks such as face recognition, gesture recognition, motion tracking / understanding, segmentation, as well as a large collection of more atomic, elemental vision / image processing behaviors do.
CMVision is a computer vision tool that has been compiled by researchers at Carnegie Mellon University - another package that can be used in certain embodiments of the technology.
Another hardware architecture utilizes a Field Programmable Object Array (FPOA) arrangement, where hundreds of different types of 16-bit "objects" are arranged in a grid node fashion, each with very high bandwidth channels to neighboring devices And data can be exchanged with each other. (The earlier referenced PicoChip devices are of this class.) Each function is programmable as FPGAs. Again, the differences in image processing tasks can be performed by differences in FPOA objects. These operations may be performed during operation (for example, when an object is capable of executing a SIFT process in one state, performing FFT processing in another state, and performing log-polarity processing in another state, etc.) Can be redefined.
(Although many of the grid arrangements of logic devices are based on "nearest neighbor" interconnects, additional flexibility can be achieved by utilizing "partial crossbar" interconnects. For example, patent 5,448,496 Design Systems).
Also, in the area of hardware, certain embodiments of the present technology utilize "extended depth of field" imaging systems (see, for example, patents 7,218,448, 7,031,054 and 5,748,371). Such arrangements may include a mask in the imaging path that modifies the optical transfer function of the system so as to be insensitive to the distance between the object and the imaging system. The image quality is uniformly poor over the depth of field. Digital post-processing of the image compensates for mask modifications, restoring image quality but maintaining increased field depth. Using this technique, a cell phone camera captures an image - closer to and farther away from all objects (ie, higher frequency details) - when typically required, without requiring further exposure. (Longer exposures exacerbate problems such as hand-jitter and object motion.) In the arrangements described herein, shorter exposures tolerate the transitory delays created by the optical / mechanical focusing elements, Allowing a higher quality image to be provided to the image processing functions without requiring input from the user as to which elements of the image should be focused. This provides a much more intuitive experience when the user can simply direct the imaging device to the desired target without worrying about focus or field depth settings. Similarly, all of the image processing functions are expected to be in the same focus, so that all pixels included in the captured image / frame can be leveraged. In addition, new metadata about the groups of pixels or identified objects related to the depth in the frame may simply generate "depth map" information, and use 3D video capture and You can set the stage for storage of video streams.
In some implementations, the cell phone may have the ability to execute a given operation locally, but may decide to have it run instead by the cloud resource. The determination of whether to process locally or remotely is dependent on a number of factors, including bandwidth costs, external service provider costs, power costs for cell phone batteries, intangible costs of consumer satisfaction by delaying processing, " For example, if the user is running at low battery power and is in a remote location from the cell tower (so that the cell phone runs the RF amplifier at full power on transmission), sending a large block of data for remote processing will result in the remainder of the battery It can consume a significant portion of its lifetime. In this case, the phone may decide to process the data locally, or may decide to send it for remote processing when the phone is close to the cell site or when the battery is recharged. A set of stored rules may be applied to the relevant variables to establish a pure "cost function" for different schemes (e.g., locally processing, remotely processing, processing) Depending on the states of these variables, different results can be shown.
Attractive "cloud" resources are the processing capabilities found at the edges of wireless networks. For example, cellular networks utilize processors to digitally-execute some or all of the commonly performed operations by analog transmission and reception of wireless circuits, such as mixers, filters, demodulators, etc., Software-defined tower stations. Smaller cell stations, so-called "femtocells" typically have powerful signal processing hardware for such processes. Early known PicoChip processors and other field programmable object arrays are widely deployed in these applications.
Wireless signal processing and image signal processing have many commonities, for example, utilizing FFT processing to convert sampled data into the frequency domain, and applying various filtering operations. A cell station device including processors is designed to meet peak consumer demands. This means that considerable processing power is often left unused.
In accordance with other aspects of the present technique, this redundant wireless signal processing capability (and other edges of the wireless networks) of the cellular tower station may be used for image processing (and / or audio or other) do. Since the FFT operations are the same - whether to process sampled radio signals or image pixels - the usage change is often straightforward; Often the configuration data for the hardware processing cores need not be changed much if necessary. And because the 3G / 4G networks are so fast, the processing can be dispatched quickly from the consumer device to the cell station processor, and results are returned at a similar rate. In addition to the speed and computational capability that such a usage change of cell station processors provides, other advantages are reducing power consumption of consumer devices.
Prior to transmitting the image data for processing, the cell phone may promptly contact the communicating cell tower station to confirm that it has sufficient usable capacity to undertake the intended image processing operation. Such query may be performed by the packager / router of FIG. 10; The local / remote router of FIG. 10A, the query router and response manager of FIG. 7; It can be transmitted by the pipe manager 51 or the like in Fig.
Alerting the cell tower / base station to the upcoming processing requests and / or bandwidth requirements allows the cell site to better allocate its processing and bandwidth resources in anticipation to meet those needs.
Cell sites are at the risk of becoming bottlenecks to launch service operations that exhaust their processing or bandwidth capacity. When this occurs, they may be degraded in quality by unexpectedly tightening the processing / bandwidth provided to one or more users, so that others can be served. This sudden service change is undesirable because the channel would change the parameters originally established (e.g., bit rates at which video can be delivered), causing data services to reconfigure their respective parameters (E.g., requiring ESPN to provide a low quality video feed). By re-arranging these details, if channels and services were originally set up, they always result in minor defects, such as video transfer stuttering, silullies interrupted in phone calls, and the like.
In order to avoid the need for these unexpected bandwidth slowdowns and resulting service impairments, cell sites tend to adopt conservative strategy-bandwidth / processing resources staggering allocation to preserve capacity for possible peak demands have. However, this approach worsens the quality of service normally provided - at the expense of conventional services from unexpected expectations.
According to this aspect of the technique, the cell phone sends alerts to the cell tower station that the expected bandwidth or processing needs will come. In fact, cell phones ask to preserve some future service capacity. The tower station also has a fixed capacity. However, knowing the bandwidth required by a particular user, for example 8 Mbit / s for 3 seconds starting at 200 milliseconds, allows the cell site to consider this expected demand when serving other users.
Consider a cell site with an excess (allocated) channel capacity of 15 Mbit / s that normally allocates a 10 Mbit / s channel to a new video service user. If a site knows that a cell camera user has requested a reservation for an 8 Mbit / s channel starting at 200 milliseconds, and a new video service user requests a service, the site will give the new video service user a typical 10 Mbit / s It is possible to allocate a channel of 7 Mbit / s. By initially setting the channel of the new video service user to a slower bit rate, service impairments associated with cutting back bandwidth during an ongoing channel session are avoided. The capacity of the cell sites is the same, but is now allocated in a manner that reduces the need for existing channels to reduce the bandwidth of the mid-transmission.
In other situations, the cell site can determine that the current capacity has been exceeded, but it is expected to be heavily burdened in half a second. In this case, one or more video subscribers may use the current excess capacity to increase throughput for those who have collected several packets of video data ready for delivery, for example, buffer memory. These video packets can now be transmitted over the extended channel in anticipation of the video channel being slowed by 1/2 second. Again, this is practical because the cell site has useful information about future bandwidth needs.
The service reservation message sent from the cell phone may also include a priority indicator. This indicator can be used by the cell site to determine the relative importance that meets the request for the stated perspectives, if arbitration between conflicting service needs is required.
The expected service requests from these cell phones may also allow the cell site to provide a higher quality consistent service than is normally allocated.
It is understood that cell sites utilize statistical models of usage patterns and thus allocate bandwidth. Assignments are typically set conservatively, in anticipation of the realistic worst case of usage scenarios, including scenarios that occur, for example, 99.99% of the time. (Some theoretically feasible scenarios are unlikely to be negligible in bandwidth allocations, but in rare instances where such unlikely scenarios occur - thousands of subscribers may be able to download cell phone images from Washington DC during the Obama inauguration Some subscribers may simply not receive the service, such as when sending messages.)
Statistical models on which site bandwidth allocations are based are understood to deal with subscribers as -partially-unexpected actors. Whether a particular subscriber is requesting service in the next few seconds (and what specific service is requested) has a random aspect.
The greater the randomness in the statistical model, the greater the tendency to become extreme. If predictions of reservations or future needs are routinely presented, for example, to 15% of the subscribers, then the behavior of these subscribers is no longer random. In the worst case, the peak bandwidth demand on the cell site only relates to 85%, not 100% of the randomly operating subscribers. The actual reservation information can be utilized for 15% of the others. Thus, virtual extremes of peak bandwidth utilization are reasonable.
Using the lower peak usage scenarios, more general assignments of the current bandwidth may be granted to all subscribers. That is, if a portion of the user sends alerts to a site that preserves future capacity, the site can predict that the actual peak demand that is likely to come soon is still a site in an unused capacity state. In this case, a 12 Mbit / s channel may be granted to the camera cell phone user instead of the stated 8 Mbit / s channel at the time of reservation request, and / or a 15 Mbit / s channel instead of the usual 10 Mbit / You may authorize the user. Thus, such usage forecasts may allow a site to permit higher quality services, usually in that case, since bandwidth conserves the need to be maintained for a smaller number of unexpected actors.
The expected service requests can also be communicated to other cloud processes that are expected to be associated with the requested services from the cell phone (or cell site), allowing them to similarly allocate their resources as expected. These expected service requests can also serve to change the cloud processing to a pre-warm related processing. The additional information may comprise encryption keys, image dimensions (e.g., configuring the FPOA to act as FFT processors for a 1024 x 768 image to be processed in 16 x 16 tiles, and outputting coefficients for 32 spectral frequency bands , Etc.), or the like, for this purpose.
Now, the cloud resource can alert the cell phone to any information that it expects to be able to be requested from the pawn at the time of execution of the expected action, or to an action that the cell phone may request to execute, Can similarly anticipate and prepare accordingly. For example, the cloud processing may be performed under certain conditions, such as by evaluating whether the originally provided data is not sufficient for the intended use (e.g., the input data is an image with no sufficient focus resolution or with sufficient contrast, , And may request another set of input data. Knowing in advance that the cloud processing can request such additional data, the cell phone may allow for consideration of this possibility in its own operation, for example, unless it is otherwise the case, Maintaining configured processing modules, preserving the interval of sensor time to capture possible alternate images, and the like.
Expected service requests (or the likelihood of conditional service requests) are typically associated with events that can start at several tens or hundreds of milliseconds - sometimes in a few seconds. Situations in which operations begin in the next few tens or hundreds of seconds will be sparse. However, although the prewarning period can be short, significant benefits can be derived: if the randomness of the next few seconds is reduced - each second, the system randomness can be significantly reduced. Moreover, the events to which the requests are associated can themselves be in a longer duration - just like sending a large image file that can take more than 10 seconds.
With regard to pre-set-up (pre-warming), any operation that may take more than the threshold time interval to complete is preferably implemented (e.g., hundreds of milliseconds, milliseconds, 10 milliseconds, etc.) If possible, be prepared as expected. (In some instances, of course, the expected service is never requested, in which case such provision may be worthless.)
In other hardware arrangements, the cell phone processor can selectively activate a Peltier device or other thermoelectric cooler coupled to the image sensor in environments where thermal image noise (Johnson noise) is a potential problem. For example, if the cell phone detects low light conditions, it can activate the cooler on the sensor to try and improve the image signal-to-noise ratio. Or image processing stages may examine the captured image for artifacts associated with thermal noise, and if such artifacts exceed a threshold, the cooling device may be activated. (One method captures a patch of the image at twice the fast sequence, such as a 16 x 16 pixel area. With the absence of random factors, the two patches should be identical-preferably correlated. The variation is a measure of noise - perhaps thermal noise.) Alternative images can be captured at a short time interval after the cooling device is activated - an interval depending on the thermal response time for the chiller / sensor. Likewise, once the cell phone video is captured, the cooler can be activated because the increased switching activity by the circuitry for the sensor increases the temperature and thus increases its thermal noise. (Whether or not to activate the chiller may also be application dependent, for example, the chiller may be activated when capturing the image from which the watermark data is read, but may be activated when capturing the image from which the bar code data can be read It does not.)
As is well known, packets of the FIG. 16 arrangement can carry the header and packet body in various commands and data. In other arrangements, the packet may additionally or alternatively include pointers to records or cloud objects in the database. The cloud object / database record may contain information such as object attributes useful for object recognition (e.g., fingerprints or watermark attributes for a particular object).
If the system has read the watermark, the packet may include a watermark payload, and the header (or body) may include one or more database references whose payload may be associated with related information. The watermark payload read from the business card can be looked up in one database; The watermark decoded from the photograph can be looked up in another database. The system may apply a number of different watermark decoding algorithms to a single image (e.g., MediaSec, Digimarc ImageBridge, Civolution, etc.). Depending on which application a particular decoding operation has been performed, the resulting watermark payload may be transmitted to the corresponding destination database. (As with different bar codes, fingerprint algorithms, eigenface technologies, etc.). The destination database address may be included in the application or configuration database. (In general, addressing can be performed indirectly with intermediate data storage that includes the address of the end database, allowing relocation of the database without changing each cell phone application.)
The system performs an FFT on the captured image data to obtain frequency domain information and then supplies the information to several watermark decoders operating in parallel, each applying a different decoding algorithm. When one of the applications extracts valid watermark data (e.g., as represented by ECC information calculated from the payload), the data is transmitted to a database corresponding to the format / description of the watermark. A plurality of such database pointers may be included in the packet and may be used conditionally - depending on the watermark decoding operation (or bar code reading operation, or fingerprint computation, etc.) yielding useful data.
Similarly, the system may send a face image for the intermediate cloud service, in a packet containing the user's identifier (but not the user's Apple iPhoto, or Picasa, or Facebook user name). The intermediate cloud service can take the provided user identifier and use it to access database records from which user names on these other services can be obtained. The intermediate cloud service then sends the face image data to Apple's server - in the name of the user's iPhoto user; By the name of the user's Google user in the service of Picasa; And Facebook's server name on your Facebook server. Each of these services can then perform face recognition on the image and return the names of the people identified from the user's iPhoto / Picasa / Facebook accounts (either directly to the user or via an intermediary service). The intermediate cloud service, which is capable of serving a majority of users, allows the associated servers (and, if the user is away from home, an alternative neighbor server Lt; RTI ID = 0.0 > address). ≪ / RTI >
Face recognition applications can be used not only to identify people but also to identify relationships among individuals depicted in images. For example, data maintained by iPhoto / Picasa / Facebook includes not only face recognition features and associated names, but also terms (e.g., father, male Friends, siblings, pets, roommates, etc.). Thus, instead of simply searching the user's image collection for all photos of "David Smith", for example, the user's collection may also be searched for all the photos depicting "siblings".
The application software in which the pictures are reviewed provides different colored frames around the different recognized faces - according to the associated relationship data (e.g., blue for siblings, red for boyfriends, etc.).
In some arrangements, your system may have access to this information stored in accounts maintained by your network "friends ". A face that can not be recognized by the face recognition data associated with the user's account in Picasa can be recognized by referring to the Picasa face recognition data associated with the account of the user's friend "David Smith ". The relationship data indicated by the " David Smith "account can be similarly used to provide and organize pictures of the user. An initially unrecognized face is therefore labeled with an indicator indicating that the person is a David Smith roommate , Which essentially remaps the relationship information (e.g., mapping the "roommate" from the user's account to "David Smith's roommate" as indicated in David Smith's account).
The embodiments described above have generally been described in the context of a single network. However, multiple networks may generally be available to the user's phone (e.g., WiFi, Bluetooth, possible different cellular networks, etc.). The user can choose between these children, or the system can apply stored rules to do so automatically. In some instances, a service request may be issued (or returned results) across multiple networks in parallel.
Reference Platform Architecture
The hardware of cell phones was originally introduced for special applications. For example, a microphone was only used for voice transmission over a cellular network; An A / D converter was supplied to supply the modulator from the phone's wireless transceiver. The camera was only used to capture snapshots. Etc.
Additional applications came from leveraging these hardware, and each application had to be developed to talk to the hardware in its own way. Different kinds of software stacks were created - each specialized for a particular application could interact with a specific piece of hardware. It takes an implementation for application development.
This problem is exacerbated when cloud services and / or specialized processors are added to the mix.
To mitigate these difficulties, some embodiments of the present technology may utilize an intermediate software layer that provides a standard interface through which hardware and software can interact. This arrangement is shown in Figure 20A where the intermediate software layer is labeled as "reference platform ".
In these figures, the hardware elements are shown in the dashed boxes containing the processing hardware on the bottom and peripherals on the left. The box "IC HW" is an " intuitive computing hardware ", which is an initial discussed hardware that supports different processing of image related data, such as the modules 38 in Fig. 16, . The DSP is a general purpose digital signal processor, which can be configured to perform specialized operations; The CPU is the main processor of the phone; The GPU is a graphics processing unit. OpenCL and OpenGL are APIs for which graphics processing services (running on a CPU and / or GPU) can be called.
Different specialized technologies are in the midst of one or more digital watermark decoders (and / or encoders), bar code reading software, optical character recognition software, and the like. Cloud services are shown on the right, and applications are shown at the top.
The reference platform establishes a standard interface (e.g., by API calls) for different applications to interact with the hardware, exchange information, and request services. Similarly, the platform establishes a standard interface through which different technologies can be accessed and they can send and receive data to others of the system components. Like cloud services, the reference platform can also handle details that identify service providers - by reverse auction, heuristics, and so on. In instances where the service is available both from the cell phone's technology and from the remote service provider, the reference platform may also be able to perform the weighting of the costs and benefits of the different options and determine which should service a particular service request have.
With such an arrangement, different system components do not need to be self-related to the details of other parts of the system. The application may request the system to read the text from an object in front of the cell phone. It does not need to be self-associated with certain control parameters of the image sensor or the image format requirements of the OCR engine. The application may request the reading of the emotions of the person in front of the cell phone. The corresponding call passes whatever the description of the phone supports this function, and the results are returned in a standardized form. When an improved technology becomes available, it can be added to the phone, and through the reference platform the system takes advantage of its enhanced capabilities. Thus, growing / changing collections of sensors and growing / evolving sets of service providers can be used for tasks that derive meaning from input stimuli (audio as well as audio, e.g. speech recognition) through the use of this adaptive architecture Lt; / RTI >
Arasan Chips Systems, Inc. provides a stacked kernel-level stack to the mobile industry processor interface UniProt software stack to enable integration of specific technologies into cell phones. The arrangement can be extended to provide the functions described above. (Although the Araan protocol is primarily focused on transmission layer issues, it also relates the layers below the hardware drivers.) The Mobile Industry Processor Interface Alliance is a large industrial organization (which operates on advanced cell phone technologies).
For example, an existing image for metadata Collections Leveraging
Publicly available collections of images and other content are more commonly done. Flickr, YouTube, Photo Bucket (MySpace), Picasa, Zooomr, Facebook, Web Shots and Google Images are just a few of them. Often, these resources can act as sources of metadata - as such are explicitly inferred or inferred from files, such as file names, technologies, and so on. Sometimes geo-location data is also available.
Exemplary embodiments in accordance with an aspect of the present technique operate as follows. Cellphone image captures of objects or scenes - as shown in FIG. 21, desk phone. (Images may also be obtained in other ways as well, such as downloaded from another user or downloaded from a remote computer.)
As a preliminary operation, known image processing operations may be applied, for example, to correct color or contrast, to perform ortho-normalization, etc., for the captured image, and so on. Known image object segmentation or classification techniques can also be used to identify the apparent subject area of the image and to separate it for other processing.
The image data is then processed to determine the characterizing features useful for pattern matching and recognition. Color, shape, and texture metrics are commonly used for this purpose. Images may also be grouped based on layout and eigenvectors (the latter being particularly popular for face recognition). As noted elsewhere in this specification, it is to be understood that many other techniques may be utilized.
(The use of vector characterization / classifications and other image / video / audio metrics in faces, images, video, audio and other patterns is well known and is suitable for use with certain embodiments of the present technique (L-1) < / RTI > (L-1) Reference is made to the Academic References cited at the end of this disclosure and incorporated herein by reference in its entirety for all purposes. When used in conjunction with recognition of the same entertainment content, these features are sometimes referred to as content "fingerprints" or "hashes").
After the feature metrics for the image are determined, a search can be made through one or more commonly accessible image stores for images with similar metrics, thereby making it possible to unambiguously identify similar images. (As part of the image collection process, Flickr and other such repositories may be used to store eigenvectors, color histograms, key-point descriptors, FFTs, or images Can be computed.) The search may result in a collection of apparently similar phone images found in the flicker shown in FIG.
The metadata is then obtained from the flicker for each of these images, the technical terms are analyzed and ranked by the occurrence frequencies. In the set of depicted images, for example, the descriptors obtained from such operations and their frequency of occurrence may be as follows:
The pawns (3)
Best Buy (1)
Remote communications (1)
From such an aggregated set of inferred metadata, it can be assumed that terms with the highest count values (e.g., the most frequently occurring terms) are terms that most accurately characterize the user's FIG. 21 image .
The inferred metadata can be augmented or improved if desired by known image recognition / rating techniques. This technique seeks to provide automatic recognition of the objects depicted in the image. For example, by recognizing the touch-tone keypad layout and coil code, such a classifier can label the image of FIG. 21 using terms of telephone and facsimile machines.
If not already in the inferred metadata, the terms returned by the image classifier may be added to the list or given a count value. (Any number, for example 2, may be used, or a value depending on the reported trust of the classifier in distinguished identification may be utilized).
If the classifier triggers one or more terms that already exist, the position of the term (s) in the list may be raised. One way to increase the position of a term is to increase the count value by a percentage (e.g., 30%). Another method is to increase the count value one more than the next-higher term that is not distinguished by the image classifier. (Since the classifier returned the term "telephone" but did not return the term "Cisco", this latter method could rank the term phone as "19" count - one more than Cisco.) Inferred metadata Various other techniques for augmenting / enhancing those caused by image classifiers are easy to implement.
The revised list of metadata triggered from the above may be as follows:
The pawns (3)
Facsimile machines (2)
Best Buy (1)
Remote communications (1)
The list of inferred metadata may be limited to terms with the highest clear reliability, e.g., count values. For example, a subset of the list may be used that includes the top N terms, or the top Mth percentile terms of the ranking list. This subset may be associated with the image of Figure 21 in the metadata store for that image as inferred metadata.
In this example, if N = 4, terms, telephone, Cisco, phone, and VOIP are associated with the image of FIG.
Once the list of metadata is assembled for the image of FIG. 21 (by the above-described procedures or otherwise), various operations may be undertaken.
One option is to provide an image with data derived from the captured content or captured content (e.g., an image such as the image of Figure 21, eigenvectors, color histogram, key point descriptors, FFTs, machine readable data decoded from the image, Feature) metadata to the service provider operating on the presented data and providing responses to the user. Shazam, Snapnow (now LinkMe Mobile), ClusterMedia Labs, Snaptell (now part of Amazon's A9 search service), Mobot, Mobile Acuity, Nokia Point & Find, Kooaba,
The service provider - or the user's device - may be used by one or more other services, such as Google, to obtain a richer set of assistance information that may help the user to better distinguish / deduce / The metadata descriptors can be presented to the web search engine. Or information obtained from Google (or other such database resource) may be used by a service provider to augment / improve the response delivered to the user. (In some cases, the metadata, possibly accompanied by the assistance information received from Google, may allow the service provider to make appropriate responses to the user without requiring image data.)
In some cases, one or more images obtained from the flicker are substituted for the user ' s image. This can be done, for example, if the flicker image indicates a higher quality (using sharpness, illumination histogram or other measurements), and if the image metrics are quite similar. (Similarity can be judged by appropriate distance measurements on the metrics in use.) One embodiment verifies whether the distance measurement is less than a threshold value. When several alternate images pass through this screen, Or alternate may be used in other environments. The replaced image may then be used in place of (or in addition to) the captured image in the arrangements described herein.
In such an arrangement, alternative image data is presented to the service provider. In addition, data for several alternative images is presented. In addition, original image data - along with image data of one or more alternative sets - is presented. In the latter two cases, the service provider may use redundancy to help reduce the chance of error - assuming that the appropriate response is provided to the user. The client software on the cell phone then evaluates the different responses, picks between them (< RTI ID = 0.0 > (E. G., By a bowing arrangement), or < / RTI >
Instead of substitution, the associated public image (s) of one director may be combined or merged with the user's cell phone image. The resulting hybrid image may then be used in the different contexts described in this disclosure.
Another option is to use apparently similar images collected from Flickr to notify the user of the enhancement of the image. Examples include color correction / matching, contrast correction, flash reduction, foreground / background object removal, and the like. With this arrangement, for example, such a system can distinguish that the image of FIG. 21 has foreground components (explicitly post-it notations) on the phone to be masked or ignored. The user ' s image data can thus be improved, and the enhanced image data is then used.
In connection with this, the user's image may have to suffer certain disturbances, for example, by describing the subject from a piecemeal perspective or by poor lighting. This fault may cause the image of the user not to be recognized by the service provider (i.e., the image data presented by the user is unlikely to match any image data in the database being searched). In response to this failure or, in contrast, data from similar images identified from flicker can be presented to the service provider as alternatives - they hope to work better.
Another way - to open many other possibilities - is to search flickr for one or more images with similar image metrics and to collect the metadata described herein (e.g., phone, Cisco, phone, VOIP). Flicker is then, on the basis of meta data is searched twice. Multiple images with similar metadata can be identified by it. Thereafter, data for these different images (including images of various different perspectives, different lights, etc.) may be presented to the service provider - even though they may "look" differently from the user's cell phone image and.
When doing metadata-based searches, the identity of the metadata may not be needed. For example, in a second search of the flickr just referred to, the metadata of the four terms can be associated with the user's image: phone, Cisco, phone and VOIP. Matching can be considered an example in which a subset of these terms (e.g., 3) is found.
Another approach is to rank the matches based on the rankings of the shared metadata terms. Thus, images tagged with phone and Cisco are ranked in better match than images tagged with phone and VOIP. One adaptive way of ranking "matching " is to sum the counts for the metadata descriptors for the user's image (e.g., 19 + 18 + 10 + 7 = 54) (E.g., 35, if the flicker image is tagged with Cisco, Phones, and VOIP). The ratio can then be calculated (35/54) and compared to a threshold value (e.g., 60%). In this case, "matching" is found. Various other adaptive matching techniques may be devised by the skilled artisan.
The examples have searched images in flicker based on the similarity of the image metrics, and optionally the similarity of the (semantic) metadata of the text. Geo-location data (e.g., GPS tags) may also be used to obtain a meta-data hold (toe-hold).
From the image metrics it can not be recognized as an Eiffel Tower if the user captures an artistic abstract shot (e.g., Figure 29) of the Eiffel Tower from the middle of the metalwork or other rarely advantageous point. However, the GPS information captured with the image identifies the location of the image object. Public databases (including flickr) may be utilized to retrieve metadata of text based on GPS descriptors. GPS descriptors for the pictures are entered to create descriptors of the text flies and eepoles.
Google images or other databases can be queried with the terms eiffel and flies to search for other, more likely Eiffel tower common images. One or more of these images may be presented to a service provider to drive processing. (Alternatively, GPS information from the user image retrieves images from flicker from the same location; generates an image of the Eiffel Tower that can be presented to the service provider.)
While GPS is derived from camera-metadata-deployment, most current images in flicker and other public databases are losing geographic location information. However, the GPS information can either share the visual features (by image metrics such as eigenvectors, color histograms, keypoint descriptors, FFTs, or other classification techniques), or a collection of images with metadata matching It can be propagated automatically.
For illustrative purposes, a process may be presented that identifies the user matching the flicker / Google images of the fountain based on feature-awareness when the user takes a cell phone image of the city fountain and the image is tagged with GPS information. For each of these images, the process may add GPS information from the user's image.
A second level of search may also be utilized. From the set of fountain images identified from the first search based on the similarity of appearance, the metadata can be obtained and ranked as above. The flicker can then be retrieved twice (e.g., as discussed above) with metadata that matches within a certain threshold value. For these images, GPS information may also be added from the user's image.
Alternatively, or additionally, a first set of images in flicker / Google similar to the image of the user of the fountain can be identified - by pattern matching as well as by GPS (or both). The metadata may be obtained and ranked from these GPS-matched images. The flicker may be searched twice for a second set of images with similar metadata. For this second set of images, GPS information may be added from the user's image.
Another approach to geo-location images is to search the flicker for images with similar image characteristics (e.g., gist, eigenvectors, color histograms, key-point descriptors FFTs, etc.) The geographic location data is evaluated on the identified images to infer the location. For example, in Proc. of the IEEE Conf. See IM2GPS: Estimating geographic information from a single image by Hays et al. on Computer Vision and Pattern Recognition. The techniques described in the Hays paper are suitable for use with specific embodiments of the present technique (including the use of probability functions to quantify uncertainty in speculative techniques).
When the geo-location data is captured by the camera, it is very reliable. Also, the metadata (location, etc.) authored by the owner of the image is generally reliable. However, uncertainty and other problems arise when metadata descriptors (geographic location or semantics) are inferred or estimated, or when an image is authored by a stranger.
Preferably, this inherent uncertainty must be remembered in a way that later users (humans or machines) can account for this uncertainty.
One approach is to separate uncertain metadata from device-authored or constructor-authored metadata. For example, different data structures may be used. Or different tags may be used to distinguish the classes of such information. Or each metadata descriptor may have its own sub-metadata representing the author, creation date, and source of the data. The author or source field of the sub-meta data may have a data string indicating that the descriptor is inferred, estimated, deduced, etc., or such information may be a separate sub-metadata tag.
Each uncertain descriptor may be provided with a confidence metric or rank. Such data can be determined explicitly or inferentially by the public. For example, assume that when a user views an image on a flicker, she believes she is in Yellowstone and adds a "Yellowstone" location tag with a 95% confidence tag (her estimate of positiveness about the contributed location metadata) . She can add an alternate location meta tag that represents "Montana" with a corresponding 50% confidence tag. (Trust tags do not have to be 100% unified, only one tag can be contributed - less than 100% trust, or multiple tags can be contributed - as in the case of Yellowstone and Montana, There is a possibility of being wrapped.)
Where multiple users contribute to the same type of metadata for an image (e.g., location metadata), the combined contributions may be evaluated to generate aggregate data. For example, five of the six users who contributed metadata were tagged with yellowstone on an image with an average of 93% confidence; One of the six users who contributed to the metadata was tagged in Montena with an average of 50% confidence, and two of the six users, on average, were tagged in Glacier National Park with an average of 15% confidence have.
An inferential determination of metadata reliability can be performed when clear estimates made by contributors are not available, or routinely. This example is the case of Figure 21, where metadata generation counts are used to determine the relevant merit of each item of metadata (e.g., telephone = 19 or 7, depending on the mathematics used). Similar methods can be used to rank the reliability when several metadata contributors provide descriptors for a given image.
Crowd-sourcing techniques are known for distributing image-identifying tasks to online workers and collecting results. However, prior art arrangements are understood to seek a simple short term agreement on identification. More preferably, it quantizes the different kinds of opinions gathered about the image content (and, optionally, information about the temporal variation and the dependent sources), and quantifies the image, its value, its relevance, It appears to use more abundant data to allow automated systems to make more subtle difference decisions.
For purposes of illustration, known crow-sourcing image identification techniques may identify the image of Figure 35 as identifiers "soccer ball" and "dog". These are terms agreed upon from one or several viewers. However, information about the long tail of alternative descriptors, such as, for example, summer, labrador, football, tongue, lunch, dinner, breakfast, fescue, etc., can be ignored. In addition, the demographics and other information relating to the metadata identifiers, people (or processes) that act in the context of their assessments may be ignored. A richer set of metadata may associate a set of sub-meta data describing such other information with each descriptor.
The sub-metadata may indicate, for example, that the tag "football" was contributed by a 21 year old male on June 18, 2008 in Brazil. It is also possible that the tags "lunch," " evening "and" morning "are automatically displayed on the basis of the angle of illumination for the objects, It can be shown that it was contributed by the classifier. These three descriptors may also be used to determine the possibilities assigned by the classifier, for example 50% for the afternoon, 30% for the evening, and 20% for the morning, each of these percentages being stored as a sub- May be associated). One or more metadata terms contributed by the classifier may have other sub-tags indicating an on-line terminology dictionary that helps to understand the assigned terms. For example, such a sub-tag may provide the term "afternoon" with the URL of a computer resource associating a term or synonym that indicates that the term means noon to 7pm. The term dictionary also includes a probability density function < RTI ID = 0.0 > (" P ") < / RTI > indicating that the mean time represented by "afternoon" is 3:30 pm, the intermediate time is 4:15 pm and the term has a Gaussian function Lt; / RTI >
The expertise of the metadata contributors may also be reflected in the sub-metadata. The term " fescue "may have sub-metadata indicating that it has been contributed by a 45 year old plant seed farmer in Oregon. An automated system can conclude that this metadata term has been contributed by someone with rare expertise in the relevant knowledge domain, and therefore can handle the descriptor as highly reliable (though not highly relevant). This reliability can be added to the metadata collection, so that other reviewers of the metadata may benefit from the evaluation of the automated system.
Evaluation of a contributor's expertise may also be made by the contributor itself. Or otherwise, by reputable rankings using collected third party assessments of the contributor ' s metadata contributions. (These reputation ranks are known, for example, from public evaluations of eBay sellers and Amazon book reviewers.) Evaluations can be field-specific, so people have a lot of knowledge about plant types, (Or can be judged self-judged). Again, all such information is preferably stored in sub-meta tags (including sub-sub-meta tags when the information is about a sub-meta tag).
More information about crowd-sourcing, including the use of contributor expertise, is found in Digimarc, published patent application 20070162761.
Going back to the case of geo-location descriptors (which may be associated with numbers, e.g., longitude / latitude, or text), the image may accumulate a long catalog of contributed geo-descriptors over time. An automated system (eg, a server in Flickr) can periodically review the contributed geographic tag information and extract it to facilitate common use. For information related to the number, the process can apply known clustering algorithms to identify clusters of similar coordinates, and average it to produce an average position for each cluster. For example, a hotspot can be tagged by someone in latitude / longitude coordinates in Yellowstone and by others in latitude / longitude coordinates in New Zealand's HealthGate Park. Thus, these coordinates form two separate clusters that can be averaged separately. If 70% of the contributors are placed in coordinates at Yellowstone, the extracted (averaged) value can be given a confidence of 70%. It is assumed that external data can be maintained, but a low probability corresponds to its external state. This extraction of data by the owner may be stored in metadata fields that are publicly readable but not writable.
The same or different approach can be used with the metadata of the added text - for example, to provide a sense of related trust, the frequency of occurrence can be accumulated and ranked on this basis.
The techniques described in this specification find a number of applications in contexts that involve watermarking, bar-coding, fingerprinting, OCR-decoding, and other ways to obtain information from an image. Again, consider the cell phone picture of the desk phone of FIG. The flicker may be retrieved based on image metrics to obtain a collection of object-like images (e.g., as described above). The data extraction process (e.g., watermark decoding, fingerprint calculation, barcode- or OCR-reading) may be applied to some or all of the resulting images, And / or image data may be presented to the service provider (for images and / or related images in FIG. 21).
From the collection of images found in the first search, text or GPS metadata can be collected, and a second search can be done on similarly-tapped images. From the text tags Cisco and VOIP, for example, a search of flicker can find a picture of the underside of the user's phone-OCR-readable data as shown in FIG. Again, the extracted information may be added to the metadata for the FIG. 21 image and / or presented to the service provider to improve the response that can be provided to the user.
As just shown, the cell phone user can be provided with the ability to see below and around the corners - by using one image as a portal to a large collection of related images.
Referring to Figures 44 and 45A, cell phones and associated portable devices 110 typically include a display 111 and a keypad 112. In addition to the keypad associated with the number (or associated with the alphabet), a multifunction controller 114 may be common. One popular controller has a center button 118, and four peripheral buttons 116a, 116b, 116c, and 116d (also shown in FIG. 37).
An exemplary usage model is as follows. The system responds to an image 128 (optionally captured or wirelessly received) by displaying a collection of related images to the user on the cell phone display. For example, a user captures an image and presents it to a remote service. The service determines the image metrics for the presented image (if possible, after pre-processing as described above) and retrieves visually similar images (e.g., flicker). These images are transmitted to the cell phone (e.g., by the service or directly from the flicker) and are buffered for display. The service may, for example, prompt the user to repeatedly press the right-arrow button 116b on the four-way controller (or press and hold) to view a sequence of pattern-like images by instructions provided on the display (Figs. 45A and 130). Each time the button is pressed, the other one of the buffered apparently-similar images is displayed.
By techniques such as those described earlier, or otherwise, the remote service can also retrieve images that are geographically similar to the presented image. They may also be transmitted to the cell phone and buffered. The commands can press the left-arrow button 116d of the controller to review these GPS-like images (FIGS. 45A, 132).
Similarly, the service can retrieve images that are similar in metadata to the presented image (e.g., identified by pattern matching or GPS matching, based on metadata of the text inferred from other images). Again, these images may be transmitted to the phone and buffered for immediate display. The commands may recommend pressing the up arrow button 116a of the controller to view these metadata-like images (FIGS. 45A, 134).
Thus, by pressing the right, left, and up buttons, the user can view images similar to the captured image in appearance, location, or metadata descriptors.
Each time such a review indicates an image of particular interest, the user can press the down button 116c. This operation identifies the image that is currently-reviewed by the service provider, and then can repeat the process with the currently viewed image as the base image. The processing is then repeated with the user-selected image as a basis, and button presses enable review of images that are similar in appearance 16b, position 16d or metadata 16a to the base image.
This process can continue indefinitely. At some point, the user may press the center button 118 of the four-way controller. This action presents the image displayed to the service provider for other operations (e.g. triggering a corresponding response, for example, as described in the initial citations). This action may involve different service providers than providing all alternative images, or they may be the same. (In the latter case, the finally-selected image need not be transmitted to the service provider, since the service provider knows all the images buffered by the cell phone and can track which image is currently being displayed).
The dimensions of the information browsing just described above may be different in other embodiments (similar-appearance images, similar-location images, similar-position images, similar-metadata images). Consider, for example, an embodiment taking an image of the house as an input (or latitude / longitude) and returning sequences of the following images: (a) input - a house for sale nearest the imaged house; (b) input - the house for sale closest to the imaged house; And (c) houses for sale of features closest to the input-imaged house (e.g., bedrooms / bathrooms). (The page of the displayed house may be restricted, for example, by zip code, metropolitan area, district or other formula).
Another example of this user interface technology is the provision of search results from eBay for auctions listing Xbox 360 game consoles. One dimension can be a price (e.g., press button 116b creates a sequence of screens showing Xbox 360 auctions starting at the lowest-priced auctions); The other may be the seller's geographic proximity to the user (shown from nearest to farest by pressing button 116d); The other may be the time to the end of the auction (provided from the shortest time to the longest time by pressing button 116a). By pressing the middle button 118, the entire web page of the auction being displayed can be loaded.
Related examples include identifying the vehicle, searching for similar vehicles on eBay and Craigslist, and providing results on the screen (using image features and associated database (s)) to retrieve user-captured images of the vehicle Lt; / RTI > Button 116b is pressed nationwide to display information about the vehicles provided for sale based on the similarity (first, the same model year / the same color, then the nearest model year / colors) For example, an image, a seller location, and a price). Pressing button 116d produces a sequence of such screens, but is not limited to the user's state (or metropolitan area or 50 mile radius user's location, etc.). Pressing button 116a again produces a sequence of such geographically restricted screens, but this is provided in ascending order of time (rather than the closest model year / color). Again, pressing the middle button loads the entire web page (ebay or cryogist list) of the final-displayed vehicle.
Another embodiment is an application that helps people to recall names. The user sees a familiar person at the party, but can not remember his name. Sneakily, the user snaps the person's image and the image is sent to the remote service provider. The service provider extracts face recognition parameters and searches for similar-appearing faces in a separate database containing face recognition parameters for social network sites or images on those sites (e.g., Facebook, MySpace, (The service can provide sites with user's sign-on credentials, allowing the retrieval of information, or otherwise not publicly accessible.) The names and other information about similar emerging people found via the phone are returned to the user's cell phone - to help remind the user's memory.
Various UI procedures are discussed. When data is returned from the remote service, the user can press the button 116b to scroll the matches in the most similar order regardless of geography. The thumbnails of the individuals matched with the associated name and other profile information may be displayed, or only the full screen images of the person may be provided with the names overlapping. Once the familiar person is recognized, the user can press the button 118 to load the entire Facebook / MySpace / Linked-In page for that person. Alternatively, instead of providing images with names, only a list of texts of names may be provided on all, for example, a single screen - in the order of similarity of face-matching; SMS text messaging can be satisfied with this final arrangement.
Button 116d may be depressed to select their residence as in a user's current location or a specific geographic proximity of a user's reference location (e.g., home) (e.g., same metropolitan area, same week, same campus, etc.) You can scroll through the matches in the order of closest-similarity of the listing person. By depressing the button 116a, a similar display can be created, but limited to people who are "friends" of the user in the social network (or friends who are friends of friends, or people who are in another specified degree of separation of the user).
The related arrangement is a law enforcement tool that allows a civil servant to capture an image of a person and present it to a database containing government driver's license records and / or other sources face portrait / eigenvalue information. Pressing the button 116b causes the screen to display all of the sequence / electrical documents of images for people nationwide with the closest face matches. Depressing the button 116d causes the screen to display a similar sequence, but is restricted to people within the state's state of affairs. Button 116a generates such a sequence, but is restricted to people in the metropolitan area in which the official is working.
Instead of three dimensions of information browsing (e.g., buttons 116b, 116d, 116a for similar-appearing images / similarly-positioned images / similar metadata-tagged images) Can be utilized. Figure 45B shows browsing screens in two dimensions. (Pressing the right button creates a first sequence 140 of information screens; pressing the left button creates a different sequence 142 of information screens.)
Instead of two or more individual buttons, a single UI control may be utilized to navigate in the available dimensions of information. A joystick is one such device. The other is the roller wheel (or scroll wheel). The portable device 110 of Figure 44 has a roller wheel 124 on its side that can roll up or roll down. This can also be pressed in to select (e.g., similar to buttons 116c or 118 of the controller discussed earlier). Similar controls are available on many mice.
In most user interfaces, the opposing buttons (e.g., left button 116b and right button 116d) can navigate information of the same dimension - only in opposite directions (e.g., Forward / reverse). In the particular interface discussed above, it will be recognized that this is not the case (although this may be the case in other implementations). When pressing the right button 116b and then the left button 116d, the system does not return to the original state. Instead, pressing the right button provides, for example, a first similarly- appearing image, and pressing the left button provides, for example, a first similarly- positioned image.
Occasionally, it is desirable to navigate backwards through the same sequence of screens, but in the order in which they were just-reviewed. Various interface controls can be utilized to do this.
One is the "reverse" button. The device 110 of Figure 44 includes various buttons not yet discussed (e.g., buttons 120a-120f around the perimeter of the controller 114. One of them-if pressed- The direction of scrolling associated with nearby button 116b may be reversed by pushing button 120a, for example, so that button 116b is usually in the order of increasing cost The activation of button 120a may cause the function of button 116b to switch to provide an item in the order of decreasing expense, for example. At the time of review, if the user wants to "overshoot" and want to reverse the direction, she can press buttons 120a and then press button 116b again. The initially provided screen (s) In order - the current begins on the screen.
Or the operation of these buttons (e.g., 120a or 120f) may cause the opposite button 116d to scroll backwards in the reverse order through the screens provided by the activation of button 116b.
The prompting of the text or symbol can be overlaid on the display screen in all these embodiments - informing the user of the dimension and direction of the information being browsed (e.g., browsing: increase by cost).
In yet another arrangement, a single button may perform a number of functions. For example, pressing button 116b may cause the system to begin providing a sequence of screens showing images of homes for sale that are closest to the user's location, for example, every 800 milliseconds By the preference data inputted by the user. By pressing button 116b twice, the system can stop the sequence of displaying the static screen of the house for sale. If button 116b is pressed three times, the system may start in a static screen and provide a screen in reverse order, going backwards through initially provided screens. The repeated operations of the buttons 116a, 116b, etc. may operate similarly (but, for example, control different information sequences, such as houses of closest price and houses of closest features).
In rotated arrangements from the process in which the provided information is applied to a base image (e.g., an image snapped by the user), this base image can be provided through a display-for example, Like. Or buttons on the device (e.g., 126a or 120b) may be actuated to immediately call the base image again on the display.
As in products available from Apple and Microsoft (see, for example, Apple's patent publications 20060026535, 20060026536, 20060250377, 20080158169, 20080158172, 20080204426, 20080174570 and Microsoft's patent publications 20060033701, 20070236470, 20080001924), touch interfaces are gaining popularity. These techniques can be used to enhance and extend the just-reviewed user interface concepts - allowing greater degrees of flexibility and control. Each of the known button presses may have a corresponding gesture in the vocabulary of the touch screen system.
For example, different touch-screen gestures may invoke the display of different types of image feeds that have just been reviewed. For example, a brushing gesture to the right may provide right-scroll series of image frames 130 of an image with similar visual content (the initial scrolling speed depends on the speed of the user gesture, - or not decelerated). A brushing gesture to the left may provide a similar left-scroll display of the image 132 with similar GPS information. The upward brushing gesture may provide an image with an up-scrolling display of metadata 134 that is similar in metadata. At any point, the user can tap one of the displayed images to make it the base image, and the process repeats.
Other gestures may call other actions. One such operation is to display an overhead image corresponding to a GPS location associated with the selected image. Images can be zoomed in / out with other gestures. The user may view the image data, map data, data from different times of day or different dates / seasons, and / or data from various overlays (topographies, places of interest, and other data known from Google Earth) Lt; / RTI > Icons or other graphics may be provided on the display depending on the content of the particular image. One such arrangement is detailed in Digimarc, published application 20080300011.
A "Curbside" or "street-level" image is displayed - rather than an overhead image.
It will be appreciated that certain embodiments of the present techniques include common structures that are shared. An initial set of data (e.g., image or meta data such as descriptors or geographic code information, or image metrics such as eigenvalues) is provided. From this, a second set of data (e.g., image, or image metrics or metadata) is obtained. From the second set of data, a third set of data is compiled (e.g., images or image metrics or metadata with similar image metrics or similar metadata). Items from the data from the third set can be used as a result of the process, or by using a third set of data to determine, for example, the fourth data (e.g., Can be compiled from images in the set). Processing can continue. This may, for example, continue to determine the fifth set of data from the fourth data set (e.g., identify a collection of images with metadata terms from the fourth data set). The sixth set of data may be obtained from five sets of data, etc. (e.g., identifying clusters of GPS by which images are tagged in the five sets of data).
The sets of data may be images or may be data of different borrowings (e.g., image metrics, textual metadata, geo-location data, decoded OCR-, barcode-, watermark-data, etc.).
Any data can be served as a seed. The processing may begin with the image data or may be initiated with image data, such as image metrics, metadata of text (similar to semantic metadata), geographic location information (e.g., GPS coordinates), decoded OCR / barcode / You can start with other information. From the first type of information (image metrics, semantic metadata, GPS information, decoded information), a first set of information-like images can be obtained. From the first set, a second different type of information (image metrics / semantic metadata / GPS / decoded information, etc.) may be collected. From the second type of information, a second set of information-like images can be obtained. From the second set, a third different type of information (image metrics / semantic metadata / GPS / decoded information, etc.) may be collected. From the third type of information, a third set of information-like images can be obtained. Etc.
Thus, although the illustrated embodiments are generally processed with reference to image metrics after starting with an image, completely different combinations of operations are also possible. The seed may be a payload from the product bar code. This may create a first collection of images depicting the same bar code. This may result in a set of public metadata. This may cause a second collection of images based on the metadata. Image metrics are calculated from this second collection, and the most prevailing metrics can be used to search and identify a third collection of images. The images thus identified may be provided to the user using the known arrangements.
Certain embodiments of the present technique may be regarded as using an iterative, cyclical process whereby information about a set of images (a single image in many initial cases) is used to identify a third set of images Is used to identify a second set of images that may be used. The next relevant function of each set of images relates to image information of a particular class, e.g., image metrics, semantic metadata, GPS, decoded information, and the like.
In other contexts, the relationship between a set of images and the next set of images is a function of two or more classes of information as well as a class of information. For example, the seed user image may be examined for both image metrics and GPS data. From these two classes of information, a collection of images can be determined - similar images of both the visual appearance and location of an aspect. Other pairs, triplets, etc. of relationships can be used naturally - at the determination of any of the images of successive sets.
Some embodiments of the present technique analyze consumer cell phone images and heuristically determine information about the objects of the images. For example, is it a person, a place, or a thing? From these high-level decisions, the system can better formulate what type of response can be pursued by the consumer-making the operation more intuitive.
For example, if the subject of the picture is a person, the consumer may be interested in adding the person depicted as a Facebook "friend ". Or send a text message to that person. Or you can publish an annotated version of the photo on your website. Or you can simply learn who the person is.
If the destination is a place (e.g., Times Square), the consumer may be interested in local geography, maps and popular nearby.
If the object is an object (e. G., A free species or a beer bottle), the consumer may be interested in information about the object (e.g., its history, other things that use it), or buying or selling an object.
Based on the image type, the exemplary system / service may identify one or more operations that expect the consumer to find the most appropriate response to the cell phone image. One or all of these may be undertaken and cached on the consumer's cell phone for review. For example, scrolling a thumbwheel on a side of a cell phone can provide a series of different screens - each responding to an image object with different information. (Or the screen may be provided to query the consumer as to which of several possible operations is desirable).
In use, the system can monitor which of the available users is selected by the consumer. The usage history of the shovel can be utilized to improve the base model of the consumer's interests and winds, so that the future response can be tailored more favorably to the user.
These concepts will become more apparent by way of example (for example, the aspects depicted in Figures 46 and 47).
Processing a set of sample images
Suppose a traveler uses a cell phone or other mobile device to snapshot a photo of a Prometheus statue in Rockefeller Center, New York. Initially, it is just a bunch of pixels. What will you do?
It is assumed that the image is geocoded with location information (e.g., XMP- or EXIF-latitude / longitude of metadata).
From the geographic code data, a search of the flicker for the first set of images can be undertaken - taken from the same (or nearby) location. Perhaps there will be 5 or 500 images in this first set.
Metadata from the images of this set is collected. There are various types of metadata. One is the words / phrases from the title given to the image. The other is the information of the meta tags assigned to the image - generally by the photographer (e.g., naming the subject of photography and certain properties / keywords), but additionally a capture device , Identifying the date / time, location, etc. of the photo). The other are words / phrases in the narrative description of the photographs authored by the photographer.
Some metadata terms may be repeated across different images. Descriptors that are common to two or more images can be identified (clustering), and the most popular words can be ranked. (This listing is shown at "A" in Figure 46A.) Here, and in other metadata listings, only partial results are given for convenience of explanation.
From the metadata and from other analyzes it may be possible to determine in the first set which images are person-centered, which images are place-centric, and which images are object-centric.
Consider the metadata that can be tagged in a set of 50 images: some of the terms are related to the place. Some relate to people depicted in images. Some are related to things.
Terms related to the place can be identified using various techniques. One is to look up the location descriptors near a given geographic location using a database with geographic information. Yahoo's Geopranet service, for example, when querying the latitude / longitude of the Rockefeller Center, is the "Rockefeller Center", "10024" (Postal Code), "Midtown Manhattan", "New York", "Manhattan" "And" US ".
The same service may be used for requests such as "10017", "10020", "10036", "Theater District", "Carnegie Hall", "Grand Central Station", "American Folk Art Museum" It may return the names of neighbor / sibling neighbors / features.
Given a set of latitude / longitude coordinates or other location information, nearby street names can be obtained from various mapping programs.
A nearby place-descriptor dictionary of terms can be compiled in this way. The metadata obtained from the set of flicker images can then be analyzed with reference to the term dictionary to identify terms related to the place (e.g., matching terms in the term dictionary).
Thereafter, the considerations change to the use of these place-related metadata in the reference set of images collected from the flicker.
Some images may have data that is not place-related metadata. These images are likely to be person-centered or object-centric rather than place-centric.
Other images may have exclusively place-related metadata. These images are likely to be place-centric rather than person-centric or object-centric.
Between the two there are images with two place-related metadata and other metadata. Various rules are devised and can be used to assign the relative relevance of the images to the place.
One rule looks at the number of metadata descriptors associated with an image and determines the fraction found in the term dictionary of place-related terms. This is a metric.
Others see where the place-related descriptors appear in the metadata. When they appear in image titles, they are more likely to be related than if they appear at the end of a long descriptive description of the photograph. The location of the place-related metadata is another metric.
Consideration of the specifics of the place-related descriptors may also be given. The descriptor "New York" or "USA" may be less image-centric than the more specific descriptors such as "Rockefeller Center" or "Grand Central Station". This can generate a third metric.
The associated fourth metric takes into account the frequency of occurrence (or unlikely occurrence) of the term - within the collected metadata or within a superset of that data. From this perspective, "RCA Building" is more relevant than "Rockefeller Center", because it is much less frequent.
These and other metrics may be combined such that each image in the configuration is assigned a place score that indicates its potential location-centricity.
The combination may be a combination of four factors, each in the range of 0-100. However, more particularly, some metrics will be heavier weighted. The following equations using the metrics Ml, M2, M2 and M4 can be utilized to calculate the score S and the factors A, B, C, D and exponents W, X, Lt; / RTI >
Can be utilized to estimate the person-centricity of each image of the set obtained from the flicker.
As in the example just given, a glossary of related terms can be compiled - these terms are associated with people. In contrast to the place name glossary, the human name glossary can be global - rather than relevant to a particular site. (However, different terminology dictionaries may be appropriate for different countries.)
These terminology dictionaries can be compiled from a variety of sources including phone directories, lists of most popular names, and other reference tasks in which names appear. The list can start as follows; "Aaron, Abigail, Adam, Addison, Adrian, Aidan, Aiden, Alex, Alexa, Alexander, Alexandra, Alexis, Allison, Alyssa, Amelia, Andrea, Andrew, Angelina, Angelina, Anna, Anthony, Antonio, Ariana, Arianna, Ashley , Aubrey, Audrey, Austin, Autumn, Ava, Avery ... "
The first names may be considered alone, or the last names may also be considered. (Some names can be place names or people names.) Searching for adjacent first / last names and / or adjacent place names can help distinguish between ambiguous cases, for example, Elizabeth Smith is a person; Elizabeth NJ is the place.)
Personal pronouns, etc. can also be included in these dictionary terms (for example, him, her, him, her, her, her, our, her, me, me, Them, mine, mine, their). Nouns that identify people and personal relationships can also be included (eg, uncle, sister, daughter, grandfather, boss, student, employee, wedding, etc.).
Adjectives and adverbs that are generally applied to a person can also be used in a person-to-terms dictionary (e.g., a t-shirt, a backpack, a sunglasses, (For example, happy, boring, blond, etc.). Man-related verbs can also be used (eg, surfing, drinking).
In this final group, there are certain terms (rather than person-centered) that can also be applied to object-centric images, as in some others. The term "sunglasses" may appear in the metadata for an image that describes sunglasses alone; "Happy" can appear in metadata about the image depicting the dog. There are also some cases where a person-term can also be a place-term (for example, boring, Oregon). In more sophisticated embodiments, the terminology terms may be associated with respective trust metrics, whereby any results based on these terms may be discounted or may be acknowledged to have different degrees of uncertainty have.)
As before, if the image is not associated with any person-related metadata, it can be determined that the image is not likely to be person-centered. Conversely, if all the metadata is human-related, then the image may be person-centered.
For other cases, the metrics, such as those reviewed above, may be based on the number, location, specificity and / or frequency / impossibility of the person-related metadata associated with the image, - can be evaluated and combined to yield a score that represents centrality.
Analysis of the metadata provides useful information on whether the image is person-centered, but other techniques can also be exploited - alternatively, or with metadata analysis.
One technique is to analyze the image by finding contiguous areas of skin tone colors. These features characterize many features of the person-centered images, but are less frequently found in the images of places and objects.
The related technique is face recognition. This science is based on the idea that even digital cameras, which are cheap to see and shoot, need to be able to quickly and reliably identify faces in an image frame (e.g., to focus or expose an image based on these objects) Respectively.
(Facial search techniques are described, for example, in Patents 5,781,650 (Central Florida University), 6,633,655 (Sharp), 6,597,801 (Hewlett-Packard) and 6,430,306 (L-1 Corp.), and January 2002, Face Recognition: A Literature Survey by Zhao et al., In Detecting Faces in Images: A Survey by Yang et al., 1, 24, 1, 34-58, Machine Intelligence, and ACM Computing Surveys, .
Face recognition algorithms can be applied to a set of reference images obtained from a flicker to identify those with distinct faces and to identify the positions of the images corresponding to the faces.
Naturally, many of the photos have accidentally depicted faces in the image frame. While all images with faces can be identified as person-centered, most embodiments utilize other processing to provide a more improved evaluation.
One type of other process is to determine the percentage area of the image frame occupied by the identified face (s). The higher the percentage, the more likely the image is person-centered. This is a different metric than can be used to determine the person-centered score of an image.
Other forms of other processing are (1) the presence of one or more faces of an image, and (2) with human-descriptors of metadata associated with the image. In this case, the face recognition data may be used as a "plus" factor to increase the person-centered score of the image based on metadata or other analysis. For example, a score (on a scale of 0 to 100) can be increased by 10, or increased by 10%, or increased by half of the remaining distance to 100 , Etc.)
Thus, for example, a photograph tagged with the "Elizabeth" metadata is more likely to be a more person-centered image when the face recognition algorithm finds a face in the image than when no face is found.
(In contrast, the absence of any face in the image may be used as a "plus" factor to increase confidence that the image object is a different type, e.g., a place or thing.) Thus, An image lacking a face increases the likelihood that the image will be related to a place named Elizabeth or something named Elizabeth - a pet.)
It can be assumed that the face recognition algorithm recognizes faces as females and the metadata includes female names, and is also more reliable when determined. Of course, such an arrangement requires that the term dictionary - or other data structure - have data that at least associates names and genders.
(For example, the age of the depicted person (s) can be estimated using automated techniques (e.g., as described in U. S. Patent 5,781, 650, Central Florida University) The names found in the image metadata can also be processed to estimate the age of the person (s) of this name, which can be done using public domain information about the statistical distribution of the name as a function of age For example, from public social security management data, and from Web sites detailing the most popular names from birthday records.) Thus, the names Mildred and Gertrude may be associated with an age distribution peaking at age 80 , Madison, and Alexis can be associated with an age distribution with a peak age of 8. Statistically possible correspondence between metadata name and estimated human age If it is found, the person-centered score for the image can be further increased. A statistically impossible correspondence can be used to reduce the person-centered score. (The estimated information about the age of the subject in the consumer's image, It may be information about the gender of the subject, so it can be used to cut into an intuitive response (s).)
Since the detection of a face in an image can be used as a "plus" factor based on metadata, the presence of human-centered metadata can be used as a "plus" Can be used.
Of course, if no face is found in the image, this information can be used to monitor the person-centered score on the image (perhaps down to zero).
Stuff - Central Processing
The object-centered image is the third type of image that can be found in the set of images obtained from the flicker in this example. There are a variety of techniques that can determine the object-centric score for an image.
One technique relies on metadata analysis using the same principles as those described above. A noun dictionary of terms can be compiled - from mass flicker metadata, or from some other corpus (eg, WordNet) - by the frequency of occurrences. Locations and nouns associated with people can be removed from the glossary. The term dictionary may be used in the identified ways to perform analyzes of the image's metadata to yield a score for each.
Another approach uses pattern matching to identify object-centric images that each match against a library of known object-related images.
Another approach is based on initially determined scores on the person-centered and place-centered. The object-centric score can be assigned to the other two scores in an inverse relationship (i.e., if the image scores are low for the person-centered and low for the object-centric, high scores are assigned for the object-centric .
These techniques may be combined, or may be used separately. In any event, a score is generated for each image - there is a tendency to indicate whether the image is more likely or less likely to be object-centric.
Other processing of the sample set of images
The data generated by the techniques described above can be used for each image in a set that represents an image of (1) human-centered, (2) location-centric, or (3) object- Three scores can be generated. These scores do not need to be added to 100% (though they may be). Occasionally, an image may score high in two or more categories. In this case, the image may be viewed as describing a number of associations, for example, both persons and objects.
The set of images downloaded from the flicker can then be separated into groups of, for example, A, B and C, depending on whether they are identified primarily as person-centered, place-centered or object-centric. However, since some images can separate probabilities (for example, an image may have some place-centered indicator and some indicators may be person-centric), for example, Identifying ignores useful information. For a set of images - it is preferable to calculate the weighted scores, taking into account the respective scores of each image of the three categories.
Samples of images from flickr - all taken near the Rockefeller Center - can suggest that 60% is place-centric, 25% person-centric, and 15% object-centric.
This information provides insightful insight (other than its geographic coding) of the traveler's cell phone image - even without being related to the content of the image itself. In other words, it is a good opportunity that the probability that the image is place-centered, person-centered, and object-centered is low. (This order can be used to determine the order of subsequent stage systems in the process - allowing the system to provide the responses that are most likely to be appropriate more quickly.)
This type-evaluation of cell phone photographs can be used - alone - to help determine the automated operation provided to the traveler in response to the image. However, other processing may better evaluate the content of the image, thereby allowing for more specifically tailored operations intuitively.
Similarity evaluations and metadata weighting
Within the set of co-located images collected from the flicker, the place-centered images will have a different appearance from the person-centered or object-centered images, but will have some similarity within the place-centered group. Place-centered images can be characterized by straight lines (e.g., architectural edges). Or by repetitive patterns (windows). Or large areas of similar color and uniform texture near the top of the image (sky).
The person-centered images will also have different attributes than the images of the other two classes, but will have common attributes within the person-centered grade. For example, the person-centered images will typically have faces - characterized by an elliptical shape with two eyes and noses, areas of skin tones, and the like.
Although the object-centric images may be the most different kind, images from any given geography may tend to unify properties or features. Geographically coded photographs on horse tracks will describe horses with some frequency; Geographically coded photographs from the Independence National Park in Philadelphia will regularly attempt to portray the species of freedom.
More confidence in the subject of the cell phone image can be achieved (and moreover, by determining whether the cell phone image is more similar to the place-centered or person-centered or object-centered images in the set of flicker images The exact answer can be intuitive and can be provided to the consumer).
A fixed set of image evaluation criteria may be applied to distinguish images in these categories. However, the above-described embodiment adaptively determines this criterion. In particular, this embodiment examines a set of images and determines which image features / properties / metrics are most reliably (1) grouped with the same-categorized image (similarity); And (2) differentiating (categorizing) the differently categorized images into each other. Among the attributes that can be measured and verified for the similarity / difference behavior within the set of images are the predominant color; Color diversity; Color histogram; Predominant texture; Texture Diversity; Texture histogram; Edginess; Wavelet-domain transform coefficient histograms, and dominant wavelet coefficients; Frequency domain transmission coefficient histograms and dominant frequency coefficients (which may be computed on different color channels); Eigenvalues; Key point descriptors; Geometric class probabilities; Symmetry; The percentage of image areas identified as faces; Image autocorrelation; Dimensional "highlights" of the image, etc. (The combinations of such metrics are more reliable than individual properties.)
One way to determine which metrics are most prominent for these purposes is to compute a variety of different image metrics for the reference images. If the results in a category of images for a particular metric are clustered (e.g., for location-centric images, the color histogram is clustered near specific output values), and images in other categories are clustered If there are few output values near the result, the metric appears to be very suitable for use as an image evaluation criterion. (Clustering is generally performed using an implementation of the k-means algorithm.)
In a set of images from the Rockefeller Center, the system determines whether an edge score of > 40 is reliably associated with images rated as high as the place-center; > 15% face area scores are reliably associated with images scored as high as the person-centered; And with the frequency content for the yellow color peaking at low image frequencies-a color histogram with a local peak of gold tones is more or less related to objects-center-scored images.
The analysis techniques found to be most useful for grouping / distinguishing images of different categories can then be applied to the user's cell phone image. The results can then be analyzed for proximity-distance measurement perspective (e.g., multidimensional space), and the characteristic features are associated with images of different categories. (This is the first cell phone image processed in this particular embodiment.)
Using these techniques, the cell phone image can be scored at 60 for object-centric, 15 for location-centric, and 0 for person-centric (for a scale of 0-100). This is a second, better set of scores that can be used to classify the cell phone image (the first is the statistical distribution of the co-located pictures found in the flicker).
The similarity of the user's cell phone image can then be compared to individual images in the reference set. Early identified similarity metrics may be used, or different measures may be applied. The time or processing spent on this task can be distributed over three different image categories based on the scores just determined. For example, the process may not waste time judging similarity with reference images classified as 100% person-centered, but instead determines similarity with reference images that are classified as object-or place-centric (More difficulties are applied to the former than the latter - for example, four times as much effort). Similarity scores are generated for most of the images in the reference set (including those that are 100% human-centered).
After that, the considerations are returned to the metadata. The metadata from the reference images is reassembled - this time is weighted according to each similarity of each image to the cell phone image. (The weighting may be linear or exponential.) Since the meta-mart from similar images is heavier than the metadata from non-similar images, the resulting set of metadata may be more likely to correspond to the cell phone image So that it is cut.
From the resulting set, the top N (e.g., three) metadata descriptors may be used. Or based on a weighting that includes a meta data set of M% total can be used.
In the given example, the metadata thus identified may include "Rockefeller Center", "Prometheus", and "Skating Link", each having scores of 19, 12, and 5 (see "B" .
Using this weighted set of metadata, the system can begin to determine which responses are most appropriate for the consumer. However, in an exemplary embodiment, the system continues by further improving the evaluation of the cell phone image. (The system can also begin to determine appropriate responses while undertaking another process.)
Processing the second set of reference images
At this point, the system is better informed about the cell phone image. Not only location; Possible types (object-centric) and most likely relevant metadata are also known. This metadata may be used to obtain a second set of reference images from the flicker.
In an exemplary embodiment, the flicker is queried for images having the identified metadata. The query may be geographically restricted to the geographical location of the cell phone, or a wider (or unrestricted) geography may be searched. (Or the query can be executed twice, so half of the images are co-located with the cell phone image, the rest are remotely, etc.)
The search can find images with all the identified metadata and tagging first. In this case, 60 images are found. If more images are desired, the flicker may be retrieved in different pairs or individually for metadata terms. (In these latter cases, the distribution of the selected images may be selected such that the metadata correspondence of the results corresponds to a respective score of different metadata terms, for example 19/12/5.)
The metadata from this second set of images can be collected, clustered, and ranked ("C" in Figure 46B). (E.g., "Nikon", "D80", "HDR", etc.) may also be ignored (for example, "," Black and white ", etc.). The month names can also be removed.
The initially performed analysis-each image in the first set of images is categorized as either human-centered, place-centered or object-centric-is repeated for images in the second set of images . Appropriate image metrics for determining similarities / differences within and between classes of this second set of images can be identified (or initial measurements can be utilized). These measurements are then applied to generate improved scores for the user's cell phone image, as before, person-centered, place-centered, or object-centric. Referring to the second set of images, the cell phone image can be scored at 65 for object-center, 12 for place-center, and 0 for person-center. (These scores, if desired, can be combined with the scores initially determined, for example, by averaging.)
As before, the similarity between the user's cell phone image and each of the images in the second set can be determined. The metadata from each image may then be weighted according to the corresponding similarity measure. The results can then be combined to yield a set of weighted metadata according to image similarity.
A portion of the metadata - often including some highly ranked terms - would be a relatively low value in determining the image-appropriate responses to provide to the consumer. "New York" and "Manhattan" are some examples. In general, relatively uncommon metadata descriptors are more useful.
The measurement of "unusualness" may be based on flicker image tags (either globally or within a geographically located area), or image tags by photographers whose respective images are presented, By determining the frequency of different metadata terms in the associated corpus, such as words in the index of the corpus. The terms in the weighted metadata list may be weighted more heavily (i.e., the second weighting), depending on their authenticity.
This successive processing result may produce a list of metadata shown in "D" of Figure 46B (each shown with a respective score). This information (optionally, along with tags indicating a person / place / thing decision) allows responses to the consumer to be well correlated with the cell phone port.
This set of inferred metadata for the user's cell phone port is entirely stored by the automated processing of other images obtained from public sources, such as Flickr, along with other public resources (e.g., listings of names places) You will know that it has been compiled. The inferred metadata can be naturally associated with the image of the user. However, more importantly for this application, it can help the service provider to determine the best way to respond to the presentation of the user's image.
Determining Appropriate Responses to the Consumer
50, the system just described may be viewed as one particular application of an "image juicer " that receives image data from a user, and may collect, calculate, and / or display information that may be associated with the image Apply different types of processing to speculate.
When the information is identified, it can be transmitted by the router to different service providers. These providers may process different types of information (e.g., semantic descriptors, image texture data, key point descriptors, eigenvalues, color histograms, etc.) or images of different grades , A photo of a soda can, and the like). The outputs from these service providers are transmitted to one or more devices (e.g., a user's cell phone) for presentation or for later reference. The present discussion now considers how these service providers determine which responses may be appropriate for a given set of input information.
One approach is to establish the taxonomy of image objects and corresponding responses. Tree structure may be used and the image is first sorted into one of several high level groups (e.g., person / place / thing), after which each group is divided into different subgroups. In use, the image is evaluated through different branches of the tree until the limits of available information are such that no further progress is made. Actions associated with the terminal leaf or node of the tree are then taken.
A part of a simple tree structure is shown in Fig. (Each node creates three branches, but this is only for illustration; some branches can be used as well.)
If the subject of the image is deduced to be a food item (e.g., if the image is associated with food-related metadata), then the information of the three different screens may be cached in the user's phone. One starts online purchases of items depicted in the online vendor (vendor selection, and payment / shipping details can be obtained from the user profile data). The second screen shows nutritional information about the product. The third screen provides a short-range map that identifies the shops selling the depicted product. The user switches between these responses using the roller wheel 124 on the side of the phone (Figure 44).
If the subject is deduced to be a photograph of a family member or friend, one screen provided to the user provides an option to post a copy of the photograph on the user's Facebook page, and the possible name (s) of the person (s) It runs. (Determining the names of the people depicted in the photo can be done by presenting the photo to the user's account at Picasa.) Picasa performs face recognition operations on the presented user images and displays the individual names provided by the user, Specific database of facial recognition information for friends and others depicted in previous images of the user thereby correlating the eigenvectors to the face images of the user. The other screens include starting a text message for an individual and displaying the address information to a user who is indexed by a picasa-determined identity. ≪ RTI ID = 0.0 > Of the address book. The user can continue any or all of the options provided by switching between the associated screens.
If the subject appears to be a stranger (e.g., not recognized by Picasa), the system will initially launch a person's attempted recognition using publicly available face recognition information. (This information can be extracted from the photos of a known person.) VideoSurf is a vendor with a database of actors and facial recognition features for others L- (I. E., Maintain a database of driver ' s license ports and associated data that may be exploited). The screen (s) presented to the user not only displays the reference photos of the matched persons Along with scores), web, and related documents compiled from other databases. The other screen gives the user the option of sending a "friend" invitation to a MySpace, or a recognized person on another social networking site found to be a recognized person. Another screen specifies the degree of separation between the user and the recognized person. (For example, my brother David has classmate Steve, and he is the son of the person depicted.) These relationships can be determined from relevant information published on social networking sites.
Of course, while each of the options discussed for different sub-groups of image objects may satisfy most user needs, some users may want others. Thus, at least one alternative response to each image may be unlimited - for example allowing the user to navigate different information, or allowing the user to specify the desired response - using the image / metadata processed information Anything that is available is available.
One such unconstrained approach is to present the known double-weighted metadata (e.g., "D" in FIG. 46B) to the general search engine. Google essentially does not need to be the best for this feature, because current Google searches require all search terms to be found in the results. Search engines that perform fuzzy searches and respond to differently-weighted keywords without having to find everything are better. The results may show a different relevance depending on where the keywords are found, where they are found, and so on. (Results containing "Prometheus" but lacking "RCA Building" include the latter, but are ranked more relevant than results lacking electrons.)
The results from such a search may be clustered by other concepts. For example, some results may be clustered because they share the theme "Art Deco". Others can be clustered because they deal with RCA and GE's collaborative history. Others can be clustered because they relate the works of architect Raymond Hood. Others may be clustered because they are related to 20th century American sculptures or Paul Mann. Other concepts discovered to create individual clusters may include John Rockefeller, Mitsubishi Group, Columbia University, Radio City Music Hall, Rainbow Room Restaurant, and the like.
The information from these clusters can be provided to the user on subsequent UI screens, for example, after the prescribed information / actions are provided on the screen. The order of these screens may be determined by the keyword-determined relevance or sizes of information clusters.
Another answer is to provide the Google search screen to the user - taking up preliminary space with twice the weighted metadata as search terms. The user can then delete terms that are not relevant to his or her own interests and add other terms to expedite the web search leading to the action or information required by the user.
In some embodiments, the system response may depend on the user having a "friend" relationship in the social network or some other indicator of trust. For example, if there is an abundant set of information available about Ted's friend Alice, which is rarely known about user Ted, the rich set of information can be used to determine how to respond to Ted with a given content stimulus have.
Similarly, if user Ted is a friend of user Alice and Bob is a friend of Alice, the information related to Bob can be used to determine the appropriate response to Ted.
Assuming there is a different basis for implicit trust, the same principles can be used even when Ted and Alice are strangers. When the basic profile similarity is one possible basis, it is better to share the rare property (or better, several). Thus, for example, if both Ted and Alice are enthusiastic supporters of Dennis Kucinich for the President and share characteristics that are enthusiasts of ginger pickles, the relevant information can be used to determine the appropriate response to be provided to others.
Arrangements just described provide powerful new functionality. However, the "intuiting " of responses that the user may wish to rely heavily on system designers. They consider different types of images that can be encountered and can indicate responses (or choices of responses) that they believe will best satisfy the user's possible winds.
In this regard, the above-described arrangements are based on the human-generated taxonomies of the information that people can search for and Yahoo !, which manually finds web resources that can satisfy different search results. Similar to the initial indexes of the web, such as teams.
In the end, the Web overwhelmed those manual efforts in the organization. The founders of Google were among those who realized that the untapped richness of information about the Web could be gained by examining links between pages and users' behaviors when navigating these links. Thus, understanding of the system came from data in the system rather than from external views.
In the same way, manually created trees of image classifications / responses may appear to be at an early stage in the development of future image-response techniques. Ultimately, these schemes will be obscured by the system itself and the arrangements that depend on the machine understanding derived from its use.
One such technique simply examines which response screen (s) are selected by users in particular contexts. Since these usage patterns are evident, the most popular responses may be moved earlier in the sequence of screens presented to the user.
Likewise, when patterns become apparent in the use of the unlimited search query option, such an operation can be a standard response and can be moved higher in the provisioning queue.
Usage patterns can be cut in the context of various dimensions. Men in New York, aged 40 to 60 years, can demonstrate the interest of responses different from women in Beijing between the ages of 13 and 16, after capturing snapshots of sculptures by 20th century sculptors. Most people snapping a photo of a food processor in the weeks before Christmas may be interested in finding the lowest priced online vendor of the product; Most people snapping photos of the same object in the week before Christmas may be interested in selling listings on eBay or Clygystricks. Preferably, usage patterns are traced to as many demographics and other descriptors as possible to predict most of the user behavior.
More sophisticated techniques derived from the abundant sources of clearly available and speculatively linked data sources currently available can also be applied. These include not only web and personal profile information but also information such as cell phone billings, credit card statements, shopping data from Amazon and eBay, Google search history, browsing history, cached web pages, cookies, , Phone message archives from Google Voice, travel reservations on Expedia and Orbitz, music collections on iTunes, cable television subscriptions, Netflix movie selections, GPS tracking information, social network data and activities, Flickr and Picasa Such as the activities and postings on the same photo sites and video sites like YouTube, the dates and times these records are recorded (our "Digital Life Log"), And other digital data. Moreover, this information is potentially available not only for the user but also for the user's friends / family, for others who have a demographic similarity with the user, and ultimately for everyone else (suitable anonymous and / Or privacy protection devices).
The network of correlations between these data sources is smaller than the network of web links analyzed by Google, but perhaps the diversity and types of links are more abundant. From this, we can capture the richness of inferences and insights that can help inform a particular user about what they might want to do with a particular snapped image.
Artificial intelligence techniques can be applied to data-gathering operations. One class of these technologies is natural language processing (NLP), and the science has developed dramatically in recent years.
An example is a Semantic Map compiled by Cognition Technologies, Inc., a database that can be used to analyze words in context to distinguish these meanings. This function can be used, for example, to resolve homonym ambiguities in the analysis of image metadata (e.g., "bow" is a portion of a ship, or ribbon decoration, The proximity of terms such as "Carnival cruise", "satin", "Carnegie Hall" or "hunting" can provide a possible answer. have). Patent 5,794,050 (FRCD Corp.) describes the basic techniques.
An understanding of the semantics obtained through NLP techniques can be used to enhance image metadata with other associated descriptors-which may be used as additional metadata in the embodiments described herein. For example, a close-up image tagged with the descriptor "hibiscus stamens" may be further tagged with the term "flower" through NLP techniques. (With this record, Flickr has 460 images tagged as "Habiscus" and "Surgery" but without "Flowers".)
Patent 7,383,169 (Microsoft) specifies how other large tasks of dictionaries and languages can be handled by NLP techniques to compile vocabulary knowledge bases that serve as processing sources of this "common sense" information about the world . This common knowledge can be applied to the metadata processing described in this specification. (Wikipedia is another reference source that can serve as a basis for this knowledge base.) Our Digital Life Log is another - to create insights that are unique to us.
When applied to our digital lifelogs, NLP technologies can be used to understand subtle differences in our historical interests and behaviors - information that can be used to model (predict) our current interests and upcoming actions - Lt; / RTI > This understanding can be used to dynamically determine what information should be provided, or what action should be undertaken, in response to a particular user (or other stimulus) capturing a particular image. After that, you will actually arrive at an intuitive calculation.
Although the image / metadata processing described above takes many words, it does not take much time to execute. In practice, the processing of a large number of reference data, compilation of terminology dictionaries, etc. may be done off-line before any input image is provided to the system. Flickr, Yahoo! Or other service providers are rapidly available when they need to periodically compile and respond to an image query by pre-processing reference sets of data for various sites.
In some embodiments, other processing activities may be initiated in parallel with those described above. For example, if the initial processing of the first set of reference images suggests that the snapshot is location-centric, then the system may request possible-useful information from other resources before processing of the user image ends have. For illustrative purposes, the system can immediately request a street map of the surrounding area with a satellite view, a street view, a large transportation map, and the like. Likewise, pages of information about nearby restaurants can be compiled, along with other pages detailing nearby movies and show-times, and another page of local weather forecasts. They can all be sent to the user's phone and cached for later display (e.g., by scrolling the thumbwheel on the side of the phone).
These operations can likewise be undertaken before any image processing occurs - simply based on the geo-code data accompanying the cell phone image.
Although geographic coding data involving cell phone images have been used in the specifically described arrangement, this is not necessary. Other embodiments may select sets of reference images based on other criteria, such as image similarity. (This can be determined by various metrics as described above and also as described below.) Known image classification techniques may also be used to determine one of the various ranks of the images containing the input image, May be searched.) Another criterion is the IP address where the input image is uploaded. Other images uploaded from the same - or geographically close - IP addresses can be sampled to form reference sets.
Despite the absence of geographic code data for the input image, reference sets of images can be compiled based on location. The location information for the input image can be deduced from various indirect techniques. The wireless service provider to which the cell phone image is relayed can identify the particular cell tower from which the traveler's transmission was received. (If the transmission originated via another wireless link, such as WiFi, the location could also be known.) The traveler could have used his credit card one hour earlier at the Manhattan hotel, so the system (with the appropriate privacy protection devices) Allows you to deduce that a photo was taken somewhere near Manhattan. Sometimes, the features depicted in the image are symbolic, so a quick search for similar images in Flickr can find the user (for example, in the Eiffel Tower or in the Statue of Liberty).
Geoplanet was cited as a source of geographic information. However, a number of other geographic information databases may alternatively be used. There is one GeoNames-dot-org. (Note that the "-dot-" transition and the omission of the generic http preamble are used to prevent playback by the Patent Office from being displayed as a live hyperlink to this text.) Place names for a given latitude / longitude In addition to providing parent, child, and sibling information for geographic divisions, GeoNames' free data (available as web services) is also available at the nearest intersection Finding the nearest post office, finding surface elevation, and so on. Another option is Google's Geo Search API, which allows you to interact with and retrieve data from Google Earth and Google Maps.
We will see that archives of aerial imagery are growing exponentially. The portion of this image is a straight view, but the off-axis of the image is gradually oblique. From two or more different oblique views of the location, a 3D model can be created. As the resolution of this image increases, it is available so that a view of the scene, such as a fairly rich set of data - taken for some locations - from the ground level can be synthesized. These views can be matched with street level photos and metadata from one can increase metadata about the others.
As shown in FIG. 47, the embodiments specifically described above utilized various resources including flicker, a database of person names, a word frequency database, and the like. There are several different sources of information that can be utilized in these arrangements. Other social networking sites, shopping sites (e.g., Amazon, eBay), weather and traffic sites, online thesauri, caches of recently visited web pages, browsing history, cookie collection, (As described herein above), etc., can all provide the abundance of additional information that can be applied to the intended work. Some of this data reveals information about user interests, habits, and preferences - data that can better infer content of the snapshot and better cut the intuitive response (s).
Likewise, although Figure 47 illustrates several lines interconnecting different items, these are merely illustrative. Different interconnections can be used naturally.
The arrangements described in this specification are some of the myriad of things that can be utilized. Most embodiments will be different from those described above. Some operations will be omitted, some will be executed in different orders, some will be executed in parallel rather than serially (and vice versa), some additional operations may be included, and so on.
One additional operation is to improve the process just described, for example, by receiving a user-related input after processing the first set of flicker images. For example, the system has identified "Rockefeller Center "," Prometheus ", and "Skating Link " as related metadata for user-snapshot images. The system can query the user as to which of these terms is most relevant (or at least relevant) to his / her particular interest. Other processes (e.g., other searches, etc.) may thus be focused.
Within the image provided on the touch screen, the user can touch the area to indicate an object of specific relevance within the image frame. Thereafter, the image analysis and subsequent operations may be focused on the identified object.
Some of the database searches may be iterative / rotational. For example, results from one database search may be combined with the original search entries and used as inputs to another process.
Most of the above-described processing will know that the bounds are ambiguous. Most of the data has no absolute meaning, but can be in terms of metrics that are only relevant to different ranges of other metrics. Many of these different probabilistic factors can be evaluated and then combined - a statistical stew. The skilled artisan will appreciate that a particular implementation suitable for a given situation may be predominantly arbitrary. However, through experience and base techniques, more inferred ways of weighting and using different factors can be identified and eventually utilized.
If the flicker archive is large enough, the first set of images in the above-described arrangement can be selectively selected to be more likely to be similar to the target image. For example, images taken at about the same time on the day can be searched for flicker. The lighting conditions will be approximately similar, for example, to avoid matching night scenes to daytime scenes, and shadow / shade conditions will be similar. Similarly, images taken at the same season / month can be retrieved from flickr. Thus, problems such as ice skating links at the Rockefeller Center and the seasonal loss of snow in the winter landscape can be alleviated. Similarly, if the camera / phone is equipped with a magnetometer, an internal sensor, or other technique that allows the bearing (and / or azimuth / altitude) to be determined, shots with this degree of similarity can also be searched in flicker.
Moreover, it is desirable that sets of reference images collected from the flicker contain images from many different sources (photographers) - so they do not tend to use the same metadata descriptors.
The images collected from the flicker can be screened for suitable metadata. For example, images without metadata (possibly excluding any number of images) may be removed from the reference set (s). Likewise, images having fewer than 2 (or 20) metadata terms or having no descriptive description may be ignored.
Flicker is often mentioned in this specification, but collections of other content can of course be used. Images in flickr generally have designated license rights for each image. They include various "Creative Commons licenses" as well as "all the rights", allowing the public to use images of different terms. The systems described herein may limit the search for images that meet specified license criteria through flickr (e.g., ignore images marked as "All Rights Reserved").
Other image collections are good in some respects. For example, images. The database at google-dot-com looks better in ranking images based on meta-relevance than flickr.
Flickr and Google maintain publicly accessible image archives. Many other archive archives are secret. Embodiments of the present technique can find applications with both - including any hybrid contexts in which both public and owner image collections are used (e.g., Flickr is used to find an image based on a user image, The flicker image is presented to the secret database to find the match and determine the corresponding response to the user.)
Similarly, reference is made to services such as flicker (e.g., images and metadata) to provide data, but other sources may of course be used.
One alternative resource is an ad hoc peer-to-peer (P2P) network. In one such P2P arrangement, there may optionally be a central index, which allows the peers to communicate when retrieving the desired content and details the content that they are available to share. The index may contain meta-data and metrics for images, along with pointers to nodes where the images themselves are stored.
The peers may include cameras, PADs, and other portable devices from which image information may be available almost immediately after being captured.
In the course of the methods described herein, certain relationships are found between images (e.g., similar geographic location, similar image metrics, similar metadata, etc.). If these data are generally mutual, and the system finds that the color histogram is similar to that of image B during processing of image A, this information may be stored for later use. If the later processing relates image B, then the initial-stored information is consulted so that it can be found that image A does not analyze image B and has a similar histogram. These relationships are similar to virtual links between images.
In order for such relationship information to maintain its utility over time, it is desirable that images be identified in a continuous manner. If the relationship is that image A is on the user's PDA and image B is on a desktop somewhere, then image A is identified even after image A is sent to the user's MySpace account, image B is stored on an anonymous computer in the cloud network A means for tracking image B should be provided.
Images may be assigned digital object identifiers (DOIs) for this purpose. The International DOI Foundation implements the CNRI steering system so that such resources can be determined at the current location via the website doi-dot-org. Another alternative is that images are assigned and digitally watermarked with identifiers tracked by the Digimarc For Images service.
If an image or other information is retrieved from several different repositories, it is often desirable to adapt the query to a particular database to be used. For example, different face recognition databases may use different face recognition parameters. Techniques such as those described in Digimarc's published patent applications 20040243567 and 20060020630 can be utilized to search across multiple data paces to ensure that each database is examined in a suitably tailored query.
Frequent references to images are made, but in many cases, other information may be used in place of the image information itself. In different applications, image identifiers, characterization of eigenvectors, color histograms, keypoint descriptors, FFTs, associated metadata, decoded barcode or watermark data, etc. can be used essentially in place of images For example, a data proxy).
Although initial examples have described geographic coding by longitude / latitude data, in other arrangements, the cell phone / camera may provide location data in one or more other reference systems, such as Yahoo's geoplanetary ID - global location ID (WOEID) .
Location metadata may be used to identify other resources in addition to similarly-located images. Web pages can, for example, have geographic associations (e.g., a blog can be related to the author's neighborhood; a restaurant's web page is associated with a particular physical address). Web service GeoURL-dot-org is a location-to-URL reverse directory that can be used to identify Web sites associated with specific geographies.
GeoURL supports a variety of location tags, including geographic tags, as well as their own ICMB meta tags. Other systems that support geographic tagging include RDF, Geo microformats, and GPSLongitude / GPSLatitude tags that are typically used in XMP- and EXIF-camera meta-information. Flickr uses the syntax established by Geobloggers, for example:
geo: lat = 57.64911
geo: lon = 10.40744
In metadata handling, it is sometimes helpful to clean-up the data prior to analysis, as referenced above. Metadata can also be examined for predominant languages, and if the language is not English (or another specific language implementation), metadata and associated images can be removed from consideration.
Although it has been initially sought that the image object is identified as being one of person / place / thing so that the above-described embodiment is correspondingly taken different actions, the analysis / identification of the image in different grades can be used naturally. Some examples of countless different grade / type groups are animal / vegetable / mineral; Golf / Tennis / Football / Baseball; Male / female; Detected wedding ring / undetected wedding ring; City / countryside; Rain / Clear; Day / night; Children / adults; Summer / Autumn / Winter / Spring; Vehicle / truck; Consumer products / non-consumer products; Cans / boxes / bags; Natural / artificial; Fit for all ages / Parental counseling for children under 13 / Parental counseling for children under 17 / Adults only; And the like.
Sometimes, different analysis engines can be applied to the user ' s image data. These engines can operate sequentially or in parallel. For example, FIG. 48A illustrates an arrangement referenced to two different engines next - if the image is identified as being human-centered. One identifies a person as a family, friend, or stranger. Others identify people as children or adults. The latter two engines work in parallel after the first completes the task.
Sometimes, engines can be utilized without any certainty that they are applicable. For example, Figure 48b shows the engines that perform family / friend / stranger and child / adult analyzes - at the same time the person / place / thing undertakes the analysis. If the latter engine is determined to be a place or thing, then the results of the first two engines are likely to be unused.
(For example, one website can provide aircraft recognition services: once an image of an aircraft has been uploaded to the site, the identification of the airplane < RTI ID = 0.0 > (These techniques are described, for example, in JCIS-2008 Proceedings, by The Sun, and in the Optical Engineering No.1, Volume 42, 2003, by Tien < RTI ID = 0.0 > The arrangements described herein may refer to images that appear to be aircraft to such sites and may use the returned identification information. ≪ RTI ID = 0.0 >  < / RTI > Or all input images can be referenced to such sites; most of the returned results are ambiguous or use It will not.)
Figure 49 shows that different analysis engines can provide their outputs for different response engines. Often different analysis engines and response engines can be operated by different service providers. The outputs from these response engines may then be integrated / coordinated for delivery to the consumer. (This integration can be performed by the user's cell phone - assembling inputs from different data sources; or such work can be performed by a processor elsewhere.)
One example of the technique described herein is a home builder who takes a cell phone image of a drill that requires spare parts. The image is analyzed and the drill is identified by the system as a Black & Decker DR250B, and the user is provided with various information / action options. They can review the photos of drills with similar appearances, review the photos of drills with similar descriptors / features, review the user's manual for the drill, view the list of parts for the drill, Buying a drill used from a borrower, listing a drill of a builder on an ebay, buying parts for a drill, and the like. The builder selects the "buy part" option and processes it to order the required parts (FIG. 41).
Another example is a person shopping for a house. She snaps photos of the house. The system refers to images in both a secret database of MLS information and an open database such as Google. The system reviews photos of the closest houses provided for sale; Review the photos of listings for sale within the same zip code that are closest to the house in the image; Review the photos of listings for sale in homes where the house and features are most similar and within the same zip code; Neighborhood, and school information, etc. (Figure 43).
In another example, the first user snaps the image of Paul Simon at the concert. The system automatically posts the image to the user's Flickr account - along with the metadata deduced by the procedures described above. (The artist's name can be found in Google's search for the user's geographic location; for example, the ticket master web page indicates that Paul Simon is performing on the stage that night.) The first user's picture Later, it is encountered by a system that processes the port of the second concert-gore of the same event from different advantageous locations. The second user sees the port of the first user as one of the system responses to the second port. The system can also alert the first user that he is available to review on his cell phone from another picture of the same event-different point in time, if he presses a particular button twice.
In many such arrangements, one will recognize that "content is a network. &Quot; (Or any other item of digital content or information represented therein) that is associated with each object depicted in each photo or photo serves as an explicit or explicit link to the actions and other content Data and attributes. The user can navigate from one node to the next - navigating between nodes on the network.
Television shows are rated by the number of viewers, and school newspapers are judged by the number of quotations later. Expressed at a higher level, this "audience rating" for physical- or virtual-content will know that this is an object lookup of links that associate it with other physical or virtual-content.
Although Google is limited to the analysis and development of links between digital content, the techniques described herein also allow for the analysis and development of links between physical content (and between physical and electronic content).
Known cellphone cameras and other imaging devices typically have a single "shutter" button. However, the device may be provided with different actuator buttons - each invoking a different operation with the captured image information. With this arrangement, the user can identify the faces in the image for each type of motion intended (e.g., Picasa or VideoSurf information), post them on my Facebook page, Attempt and identify a person, and send "requested friend" to his or her MySpace account).
The functionality of a single actuator button rather than multiple actuator buttons can be controlled according to different UI controls on the device. For example, repeated pressing of a function select button may cause differently intended actions to be displayed on the screen of the UI (as familiar consumer cameras have different photo modes such as close-up, beach, night, portrait, etc.). When the user subsequently presses the shutter button, the selected operation is called.
One joint response (which may not be necessary) is to post an image on the flickr or social network site (s). The metadata inferred by the processes described herein may be stored with an image (perhaps appropriate for its trust).
In the past, the mouse "click" was served to trigger a user-desired action. The operation identified the X-Y-location coordinates on a virtual landscape (e.g., a desktop screen) that clearly indicated the user's intent. Furthermore, this role will be gradually served by a "snap" of shutters - capturing the actual landscape in which the user's intentions are derived.
Business roles can direct responses that are appropriate for a given situation. These roles and responses can be determined using cadastral routing, with reference to data collected by web indexers such as Google.
Crowdsourcing is generally not suitable for real-time implementations. However, inputs that interfere with the system and generate corresponding actions (or that do not create actions for which the user does not select anything) can be referenced offline for crowd source analysis-as a result, Lt; / RTI >
Image-based navigation systems provide a familiar and different topology from web page-based navigation systems. Figure 57A illustrates that web pages on the Internet are related in a point-to-point fashion. For example, web page 1 may be linked to web pages 2 and 3. Web page 3 may be linked to page 2. Web page 2 may be linked to page 4. Etc. 57B illustrates a contrasting network associated with image-based navigation. Individual images are linked to a central node (e.g., a router), which is then linked to other nodes (e.g., response engines) according to image information.
Here, the "router" simply does not route the input packet to the destination determined by the address information carried with the packet, as is the case with familiar Internet traffic routers. Rather, the router takes image information and decides what to do with it, for example, which response system should infer image information.
The routers may be stand-alone nodes on the network, or they may be integrated with other devices. A wearable computer may have a router portion (e.g., a set of software instructions) - it may take image information from a computer and determine how it should be handled . (For example, if image information is recognized as an image of a business card, it may be an OCR name, phone number, and other data, and enter it into the contact database.) Specific responses to different types of input image information For example, by a registry database of the kind maintained by the computer's operating system.
Likewise, although the response engines may be stand-alone nodes on the network, they may also be integrated with other devices (or their functions may be distributed). A wearable computer may be one that takes action on the information provided by the router portion Or may have several different responses.
Figure 52 shows an arrangement utilizing several computers (A-E), some of which may be wearable computers (e.g., cell phones). Computers include general components such as processors, memories, storage devices, and input / output. The storage device or memory may include content such as images, audio and video. Computers may also include one or more routers and / or response engines. Independent routers and response engines may also be coupled to the network.
The computers are networked and schematically illustrated by link 150. [ The connection may be software in at least certain particular computers, including Internet and / or wireless links (WiFi, WiMax, Bluetooth, etc.), peer-to-peer (P2P) clients, May be known by any known networking arrangement, including the software that makes it available to other computers and that allows the computer to interoperate with any particular resources of other computers.
Through the P2P client, the computer A can obtain image, video and audio content from the computer B. Shared parameters on computer B can be set to determine what content is shared and with whom. On the computer B, for example, some content is kept secret; Some content may be shared with known ones (e.g., of the social network "friends"); The remaining content can be freely shared. (Other information, such as geographical location information, may also be shared - under these parameters.)
In addition to setting sharing parameters based on the party, the sharing parameters may also specify sharing based on the age of the content. For example, content / information older than a year can be freely shared, and content older than a month can be shared with friends (or other rule-based constraints). In other arrangements, fresher content may be the most freely shared type. For example, content captured or stored within a past time, day, or week can be freely shared, and content from past months or years can be shared with friends.
The exclusion list may identify (e. G., Never shared or always shared) content - or one or more classes of content - that are treated differently from the rules described above.
In addition to sharing content, computers can also share their respective router and response engine resources across the network. Thus, for example, if computer A does not have a response engine suitable for a particular type of image information, it can pass information to computer B for processing by the answering engine.
It will be appreciated that such a deployed architecture has a number of advantages in terms of reduced cost and increased reliability. Also, "peer" groups may be geographically defined, such as, for example, computers that find themselves within a particular spatial environment (e.g., a region served by a particular WiFi system). Thus, a peer can dynamically establish ad hoc subscriptions to content and services from nearby computers. If the computer leaves the environment, the session ends.
Some researchers predict that all of our experiences will be captured in digital form. In fact, Gordon Bell at Microsoft has compiled his most recent digital archive through his technologies Cyber All, SenseCam and MyLifeBits. Bell's archives include records of all phone calls, daily videos, captures of all the TV and video watched, archive of all visited web pages, map data of all visited places, sleep polygraphs during his sleep apnea, have. (For other information, for example, A Digital Life by Bell at Scientific American in March 2007; Gemmell, MyLifeBits: A Personal Database for Everything, Microsoft Research Technical Report MSR-TR-2006-23; Gemmell, Passive Capture (CARPE '04), pp. 48-55, May 27, 2007 In The New Yorker, a memorandum of memorandum by Wilkinson See also other references cited on the ACM-specific interest group web page for Gordon Bell's Microsoft Research web pages and CARPE (Capture, Archival & Retrieval of Personal Experiences).)
Certain embodiments incorporating aspects of the present technology are well suited for use with such empirical digital content - as input to the system (i.e., the system responds to the user's current experience), or as metadata, habits, As resources for which other properties can be mined (including services in the role of flicker archives in the embodiments described earlier).
In embodiments that utilize personal experience as input, it is desirable to have the system trigger at first and to respond only when the user wishes - rather than limitedly freeing it (currently prohibited from view of processing, memory and bandwidth issues) Do.
The user's wind can be represented by an intended action by the user, for example, by pressing a button or by gesturing with a head or a hand. The system takes a date from the current empirical environment and provides candidate responses.
Perhaps more interest is the systems that determine the user's interest through biological sensors. Electroencephalography can be used, for example, to generate a signal that triggers a response of the system (e.g., triggers one of several different responses in response to a different stimulus in the current environment ). Skin conductivity, pupil dilation, and other autonomic physiological responses can also be selectively or electrically sensed and can provide a triggering signal to the system.
The eye tracking technique may be utilized to identify which objects of the field of view captured by the empirical-video sensor are of interest to the user. If Tony is sitting at a bar and his eyes are touching a rare beer bottle in front of a nearby woman, the system can identify his focus interest and focus his own processing efforts on the pixels corresponding to the bottle. Using the signal from Tony, such as two quick eye-flickers, the system can initiate efforts to provide candidate responses based on the beer bottle - not only Tony's own personal profile data, Perhaps also notified by other data (date, date, ambient audio, etc.). (Candidate recognition and related techniques are disclosed in, for example, Apple's patent publication 20080211766.)
The system can quickly identify the beer as a Doppelbock by, for example, pattern matching from an image (and / or OCR). Using that identifier, we find that other resources that display beer come from Bavaria brewed by the Monks of Paula Street. Its 9% alcohol content is also characteristic.
By identifying the personal empirical archives that friends made available to Tony, the system learns that his friend Geoff likes Doppelbock and most recently drank a bottle of wine in Dublin's buff. Oblique encounters with Tony's illness are being logged into his own empirical archive, and Geoff can face the same later. The oblique encounters can be related to Geoff in Prague in real time, helping to provide on-going data on his friends' activities.
The bar can also supply an empirical data server, and Tony gives it wirelessly authorized access. The server maintains archives of digital data captured by the bar and contributed by customers. The server can also be primed to the associated metadata & information, and management can determine which brands will be available in the coming weeks, or on what day specials, such as Wikipedia pages on brewing methods of foul street monks, Can be regarded as interested. (For each user preference, some users require their data to be cleared when they leave the bar; other users allow the data to be retained.) Tony's system uses local You can routinely see your environment's empirical data server. This time she shows that a woman with Bar Doppelbock in chair 3 has a Tom <encoded last name> among her friends. Tony's system recognizes Geoff's friends cycle (Geoff can make his friends available) include the same Tom.
A few seconds after his double blink, Tony's cell phone vibrates on his belt. When you open it back and turn the roll wheel on the side, Tony reviews a series of screens that provide information gathered by the system - the information that seems most useful to Tony first appears.
With knowledge of this Tony-Geoff-Tom connection (usually closer than six degrees of separation), Tony is primed with minor things about her Doppelbock beer, and Tony walks around with his cup. (Additional details that can be utilized in these arrangements, including user interfaces and visualization techniques, can be found in "Localized Communication with Mobile Devices" by Dunekacke at MobileHCI, 2009)
P2P networks, such as BitTorrent, allow audio, image and video content to be shared, but the same arrangements as shown in Figure 52 allow networks to share a richer set in the context of empirical content. The basic concept of peer-to-peer networks is that despite the technologies that capture the long tail of content, most users are interested in similar content (tonight's NBA game score, current episode of roast, etc.) , It becomes the most efficient mechanism for delivering similar content to users, by combining the content together based on which of your "neighbors" you have on the network, rather than sending individual streams. This same mechanism can be used to provide metadata related to improving the experience, such as drinking a Dopplebock at a bar or watching a highlight of an NBA game tonight on the phone while at a bar. The protocols used in the ad-hoc network described above can leverage P2P protocols either in real P2P form or as experienced servers (similar to early peer-to-peer networks) providing peer-register services, Every device advertises what experiences (metadata, content, social connections, etc.) they can use, whether they are free, paid, or bartered for the same kind of information. Apple's Bonjour software is well-suited for this kind of application.
Within this basic structure, Tony's cell phone can simply retrieve information about Dopplebock by posting a question to the peer network, and can receive rich information from various devices or experience servers in the bar without knowing the source . Similarly, the experience server can also act as a data-recorder, recording their experiences in the ad-hoc network, and providing continuity in experience at times and places. Geoff can visit the same bar at some point in the future and find out if he or she has made some communication or connections with his friend Tony two weeks ago, There is a possibility of leaving the marking on.
The ability to collect social threads represented by traffic on the network also allows the owners of the bars to enhance their experience by coordinating interactions or introductions. This can include people in the form of games or with shared interests, singles, etc., by allowing people to participate in themed games, and customers can either unlock secrets (similar to board game clues) Lt; RTI ID = 0.0 > identity < / RTI > Ultimately, demographic information related to ratings will be a valuable asset for owners when they consider which beer is next to be stored and where to advertise.
Some handheld devices, such as the Apple iPhone, provide a single-button access to predefined functions. Review the pieces of your favorite stocks, review the weather forecasts, and review the general map of your location. Although additional functions are available, the user must initiate a series of additional operations, for example, to reach the favorite web site.
Embodiments of certain aspects of the technique allow other operations to be facilitated by capturing a particular image. You can link the user to the Baby Cam back home, which captures the image of the user's hand and delivers real-time video of the baby's tummy. It is possible to capture an image of the wristwatch and load a map showing traffic conditions, etc. along a certain portion of the route to the user's driving house. This function is shown in Figs.
The user interface for the portable device includes a set-up / training step that allows the user to associate different functions with different visual codes. The user is prompted to capture the image and enter the URL and action name associated with the depicted object. (The URL is a type of response; others can also be used, such as launching a JAVA application.)
The system then characterizes the image that has been snapped (e.g., through pattern / template matching) by deriving a set of feature vectors from which similar images can be recognized. Feature vectors are stored in the data structure (Figure 55) in association with the function name and the associated URL.
In this initial training phase, the user can capture multiple images of the same visual code - possibly from different distances and views and with different lights and backgrounds. The feature extraction algorithm processes the collection to extract feature sets that capture shared similarities of all training images.
Extraction of image features and storage of data structures may be performed in a portable device or in a remote device (or in a distributed manner).
In a later operation, the device identifies each image captured by the device for a correspondence with one of the stored visual codes. If something is recognized, the corresponding action can be undertaken. Otherwise, the device responds with other functions available to the user when capturing the new image.
In another embodiment, the portable device is equipped with two or more shutter buttons. The operation of one button captures the image and performs the operation based on the closest match between the captured image and the stored visual code. The operation of the other buttons captures the image without initiating such an operation.
The device UI may include a control for providing the user with the codes of the visual term dictionary, as shown in Fig. When activated, the thumbnails of the different visual codes are provided on the device display in association with the names of the initially stored functions - reminding the user of the specified vocabulary of codes.
The control that launches a terminology dictionary of such codes can be a self-image. One image that is appropriate for this function is usually a non-featured frame. All dark frames can be achieved by covering the lens and operating the shutter. All bright frames can be achieved by operating the shutter with the lens pointing at the light source. Other substantially uncharacterized frames (of medium density) can be achieved by imaging patches of skin or wall or sky. (In order to be substantially uncharacteristic, the frame should be no more closely featured than matching one of the other stored visual codes. In other embodiments, "featureless" means that the image has a texture metric lower than the threshold It can be concluded that it has.)
(The concept of triggering an operation by capturing all bright frames may be extended to any device function. In some embodiments, all repeated bright exposures alternatively toggle the function on and off. The threshold is also set by the user with UI control or by the manufacturer to establish how such a frame should be "bright" or "dark" in order to be interpreted as an instruction For example, 8-bit (0-255) pixel values from a million pixel sensor can be summed up. If the sum is less than 900,000, the frame can be considered all dark. If large, the frame may be considered all bright.
One of the other non-feature frames may trigger another special response. It may cause the portable device to launch all stored functions / URLs (or, for example, any particular 5 or 10) in the glossary. The device caches the result frames of information and provides them in succession when the user operates one of the phone controls, such as button 116b or scroll wheel 124 of Figure 44, or when making a particular gesture on the touch screen can do. (This function can also be called by other controls.)
The third non-featured frames (i.e., dark, white, or medium-dense) can send the location of the device to the map server, which can then send multiple map views of the user's location have. These views, along with surrounding street-level images, may include aerial views and street map views at different zoom levels. Each of these frames can be cached in the device and can be quickly reviewed by turning the scroll wheel or by other UI controls.
The user interface preferably includes controls for deleting the visual codes and editing the name / function assigned to each. URLs may be defined by typing on a keypad or otherwise navigating to a desired destination and then storing the destination as a response corresponding to a particular image.
The training of the pattern recognition engine can continue through use, and successive images of different visual codes serve to improve the template model in which the visual code is defined, respectively.
It will be appreciated that a variety of different visual codes may be defined, generally using resources available to the user. The hand can define different codes using fingers arranged in different positions (fist, one through five fingers, thumb-index finger OK code, palm-up, thumbnail, American sign language codes, etc.). Clothes and their components (e.g., shoes, buttons) can also be used as ornaments. Features from common peripherals (e. G. Telephone) can also be used.
In addition to launching certain favorite behaviors, these techniques may be used as user interface techniques in other operations. For example, a software program or web service may provide a list of options for the user. For example, rather than manipulating the keyboard to enter selection # 3, the user can capture an image of three fingers - visually symbolizing the selection. The software recognizes the three-finger symbol as meaning digit 3 and inputs the value into the process.
If desired, the visual codes may form part of the authentication procedures, for example, to access a social-networking web site or bank. For example, after entering a name, signature, or password on a site, the user can view the saved image (to confirm that the site is authenticated) and then be prompted to present the image of a particular visual type (Which was originally prescribed by the user, but is no longer explicitly encouraged by the site). The web site identifies features extracted from the image just captured for correspondence with the expected response before allowing the user to access the web site.
Other embodiments may respond to a sequence of snapshots within a certain period of time (e.g., 10 seconds) - the syntax of the image. The image sequence of the "wrist watch", "four fingers", "three fingers" can be set so that the alarm clock function on the portable device sounds at seven o'clock.
In yet other embodiments, the visual codes may be gestures that include motion-captured as a sequence of frames (e.g., video) by a portable device.
Context data (e.g., indicating the user's geographical location, date, month, etc.) may also be used to cut the response. For example, when the user is at work, the response to a particular visual code may be to fetch an image from the security camera from the user's home. At home, the response to the same sign may be to fetch an image from the security camera at work.
In this embodiment, as in others, the response need not be visual. Audio or other output (e.g., tactile, olfactory, etc.) can of course be utilized.
The technique just described allows the user to define a glossary of visual codes and corresponding customized responses. The intended response can be quickly called by imaging an easily available object. The captured image may be of poor quality (e. G., Overexposed, blurred) as it must necessarily be classified and distinguished between relatively small ranges of alternatives.
Visual intelligence dictionary-processing
Another aspect of the technique is to perform one or more visual intelligent pre-processing operations on image information captured by a camera sensor. These operations may be performed without user request and before other image processing operations that the camera performs habitually.
56 is a schematic diagram illustrating certain specific processing performed in an exemplary camera, such as a cell phone camera. The illumination collides on an image sensor comprising an array of photodiodes. (CCD or CMOS sensor technologies are commonly used). The resulting analog electrical signals are amplified and converted to digital form by D / A converters. The outputs of these D / A converters provide image data in most raw or "natural" forms.
The operations described above are generally performed by a circuit formed on a common substrate, "on-chip ". Before the other processes access the image data, one or more other processes are generally executed.
One such other behavior is Bayer interpolation (de-mosaicing). The photodiodes of the sensor array typically capture only a single color of light respectively: red, green or blue (R / G / B) due to the color filter array. This array consists of a tiled 2 x 2 pattern of filter elements: one red, one diagonally opposite one blue, and the other two green. Bayer interpolation effectively "charges the blanks" of the resulting R / G / B mosaic pattern by providing a red signal with a blue filter present.
Another common action is white balance correction. This process adjusts the contrasts of the component R / G / B colors to accurately render certain colors (especially the intermediate colors).
Other operations that may be performed include gamma correction and edge enhancement.
Finally, the processed image data is typically compressed to reduce storage requirements. JPEG compression is most commonly used.
The processed, compressed image data is then stored in the buffer memory. At this point only image information is generally available (e.g., by calling system APIs) to the services and other processes of the cell phone.
One such process, commonly called with this processed image data, is to provide the user with an image on the screen of the camera. The user can then evaluate the image and determine, for example, whether to (1) store it in the camera's memory card, (2) whether to transmit it in a picture message, (3) .
The image is held in the buffer memory until the user indicates to the camera (e.g., via a button-based user interface or graphical control). Without other commands, the use of the processed image data is merely displaying it on the screen of the cell phone.
Figure 57 illustrates an exemplary embodiment of a currently discussed aspect of the technique. After converting the analog signals into a natural digital form, one or more other processes are performed.
One such process is to perform a Fourier transform (e. G., FFT) on the natural image data. This translates the spatial-domain representation of the image into a frequency-domain representation.
The Fourier-domain representation of the natural image data may be useful in a variety of ways. One is to screen an image for possible barcode data.
One familiar 2D barcode is a checkerboard-type array of bright- and dark-colored squares. The size of the components is quadrangles, and thus their repetitive spacing provides a pair of prominent peaks in the Fourier-domain representation of the image at the corresponding frequency. (The peaks may be 90 degrees out-of-phase in the UV plane if the pattern is regenerated at the same frequency in both the vertical and horizontal directions.) These peaks expand significantly over the other image components at ambient image frequencies Peaks often have a magnitude of two to five times - or ten times (or more) of those of the surrounding image frequencies. If a Fourier transform is performed on the tiles from the image (e.g., patches such as 16 x 16 pixels, or 128 x 128 pixels) from the image, certain patches that are entirely within the barcode portion of the image frame It can be seen that there is essentially no signal energy except for the frequency.
As shown in Fig. 57, the Fourier transform information can be analyzed for the tile codes associated with the image of the bar code. A template-type scheme can be used. The template may contain a set of parameters for which the Fourier transform information is tested - to know if the data has an indicator associated with the barcode-type pattern.
If the Fourier data matches the image depicting the 2D barcode, the corresponding information may be routed for other processing (e.g., it may be sent from the cell phone to the barcode-response service). This information may include Fourier transform information derived from natural image data and / or image data.
In the former case, the entire image data need not be transmitted. In some embodiments, a downsampled version of the image data, e.g., a fourth resolution of both horizontal and vertical directions, may be transmitted. Or patches of image data that are most likely to depict portions of the bar code pattern may be transmitted. Alternatively, patches of image data with the lowest likelihood of describing the bar code can not be transmitted. (These may be patches that do not have a peak at a characteristic frequency or have a lower amplitude there than at the periphery.)
Transmission may be prompted by the user. For example, the camera UI may ask the user if information should be directed for barcode processing. In other arrangements, the transmission is immediately dispatched upon determining that the image frame matches the template representing the possible barcode data. User action is not called.
Fourier transform data can likewise be tested for signs of other image objects. The D-barcode can be characterized, for example, by the highest amplitude component at higher frequencies - ("going over the peaks" and going over the other highest amplitude spikes at lower frequencies - along the peaks) Other image content may also be characterized by reference to their Fourier domain representation and corresponding templates may be devised. Fourier transform data may also be expressed in terms of < RTI ID = 0.0 > , And is commonly used to calculate the fingerprints used for automated recognition of media content.
The Fourier-Mel-line (F-M) transformation is also useful for characterizing various image objects / components, including the above-noted known barcodes. The F-M conversion has the advantage of being robust against scale and rotation (scale / rotation infringement) of the image object. In an exemplary embodiment, if the scale of the object increases (such as by moving the camera closer), the F-M conversion pattern moves up; If the scale is reduced, the F-M pattern moves down. Similarly, if the object is rotated clockwise, the F-M pattern moves to the left. (Certain orientations of movements can be tailored depending on implementation.) These attributes make F-M data important in recognizing patterns that can be affine-transformed, such as face recognition, character recognition, object recognition,
The arrangement shown in Fig. 57 applies Melin transformation to the output of the Fourier transform processing to generate F-M data. Thereafter, the F-M can be screened for attributes associated with different image objects.
For example, the text is characterized by a plurality of symbols of approximately similar size consisting of strokes in foreground color as opposed to a larger background field. Vertical edges tend to predominate (even if slightly inclined to the italics) - considerable energy is also found in the horizontal directions. The spaces between the strokes are generally within a fairly narrow range.
These attributes are self-evident as features that tend to be reliable within certain boundaries in the F-M transform space. Again, the F-M data can specify tests that are screened to indicate the likely presence of text in the captured natural image data. If the image is likely to contain text, it can be dispatched to a service handling this type of editor (eg optical character recognition or OCR, engine). Again, the image (or variants of the image) may be transmitted, the transformed data may be transmitted, or some other data may be transmitted.
As with the specific properties of a particular set in the F-M, the text itself is obvious, as are the faces. The F-M data output from the Melin transformation can be tested against different templates to determine the likely presence of a face with the captured image.
Likewise, the F-M data may be examined for the tel-tile codes for which the image data carries a watermark. The watermark orientation signal is a unique signal present in some watermarks that can act as a sign where the watermark exists.
In the example just given, as in others, the templates can be compiled by testing with known images (e.g., "training "). By capturing images of many different text offerings, the resulting transformed data can be examined for attributes that are consistent across the set of samples, or (most likely) within the bounded ranges. These attributes may then be used as templates in which images containing likely text are identified. (Likewise for faces, bar codes and other types of image objects.
Figure 57 illustrates that a variety of different transforms may be applied to the image data. Although they are generally shown as being executed in parallel, more than one may be executed sequentially - all operate on the same input image data, or one conversion is performed using the output of the previous conversion together). Although not all shown (for clarity of illustration), outputs from each of the other conversion processes may be examined for features that suggest the presence of a particular image type. If found, the associated data is transmitted to the appropriate service for that type of image information.
In addition to Fourier transform and Melin transform processes, eigenface (eigenvector) calculations, image compression, cropping, affine distortion, filtering, DCT transform, wavelet transform, Gabor transform and other signal processing operations can be applied (All are considered transforms). Others are known elsewhere in this specification and are incorporated herein by reference. The outputs from these processes are then tested for features indicating that the opportunity for the image to describe a particular class of information is greater than a random opportunity.
Outputs from some processes may be input to other processes. For example, the output from one of the boxes labeled ETC in Figure 57 is provided as input to the Fourier transform process. This ETC box may be, for example, a filtering operation. The sample filtering operations may include Median, Laplacian, Winner, Sobel, High-Pass, Low-Pass, Bandpass, Gabor, Signum, and the like. (Digimarc's patents 6,442,284, 6,483,927, 6,516,079, 6,614,914, 6,631,198, 6,724,914, 6,988,202, 7,013,021 and 7,076,082 show various such filters.)
Sometimes, a single service can handle data that passes through different data types or different screens. In Fig. 57, for example, the face recognition service can receive F-M converted data or eigenface data. Or it may receive image information that has passed through one of several different screens (e.g., its F-M translation through one screen, or its eigenface representation through a different screen).
In some cases, data may be sent to two or more different services.
Although not required, it is desirable that some or all of the processing shown in FIG. 57 is performed by a circuit integrated on the same substrate as the image sensors. (Some of the operations may be performed by programmable hardware on or off-board in response to software commands).
Although the above described operations are described as being immediately after the conversion of analog sensor signals in digital form, in other embodiments such operations may be performed after other processing operations (e.g., Bayer interpolation, white balance correction, JPEG compression, etc.) ).
Some of the services to which information is transmitted may be provided locally in the cell phone. Or they may be provided by a remote device by which the cell phone establishes a link that is at least partially wireless. Or such processing may be distributed among the various devices.
(Although described in the context of conventional CCD and CMOS sensors, this technique is applicable regardless of the sensor type.) Thus, for example, Foveon and full color image sensors can alternatively be used. Or may be sensors using Kodak's TrueSense color filter pattern (adding the full color sensor pixels to a generic Bayer array of red / green / blue sensor pixels). Sensors with infrared output data or advantageously For example, sensors (in addition to visible image data or non-visible image data) that output infrared image data are used to identify faces and other image objects with temperature differences - the frame This helps to segment the image objects in the image.
Devices that utilize the architecture of Figure 57 will, in essence, be aware of having two parallel processing chains. One processing chain generates data for rendering in a perceptual form for use by human viewers. This typically includes at least one of a demosaic processor, a white balance module, and a JPEG image processor. The second processing chain generates data for analysis by one or more machine-implemented algorithms, and illustrative examples include Fourier transform processors, eigenface processors, and the like.
These processing architectures are further described in the earlier cited application 61 / 176,739.
By such arrangements as described above, one or more suitable image-response services may begin to formulate candidate responses to visual stimuli before the user decides what to do with the captured image.
Additional comments on visual intelligence pre-processing
Although static image pre-processing has been discussed with Figure 57 (and Figure 50), such processing may also include temporal aspects such as motion.
Motion is most commonly associated with video, and the techniques described herein can be used when capturing video content. However, motion / time hints are also provided with the "stop" image.
For example, some image sensors are sequentially read from the upper row to the lower row. During a read operation, the image target can move within the image frame (i.e., due to camera movement or target movement). An aggregated view of this effect is shown in Figure 60 and depicts the captured imaging "E" as the sensor moves to the left. The vertical stroke of the character is further from the left edge of the image frame below the top due to the movement of the sensor when the pixel data is being clocked out.
This phenomenon also occurs when the camera assembles data from multiple frames to produce a single "still" image. Although not often known to users, many consumer imaging devices quickly capture a plurality of frames of image data and store different aspects of the data together (e.g., software provided by FotoNation, Inc., now Tessera Technologies, Inc.) ). For example, the device can take three exposures - one optimizes the appearance of the faces depicted in the image frame, the other exposes along the background, and the remainder exposes along the foreground. These are mixed together to create a fun montage. (In another example, the camera captures a burst of frames and, in each, determines whether people are smiling or blinking.) Then, different faces can be selected from different frames to produce a final image. )
Thus, the distinction between video and still images is no longer just a device form, but a form of user.
Motion detection can be accomplished in a spatial domain (e.g., by referring to the motion of feature pixels between frames) or in a transform domain. Fourier transform and DCT data are exemplary. The system can extract the transformed domain signature of the image component and track its movement over different frames - identifying its movement. One exemplary technique, for example, leaves very low frequency edges and the like and deletes the lowest N frequency coefficients. (The highest M frequency coefficients can be ignored as well.) The threshold operation is performed on the magnitude of the remaining coefficients - zero or less than some value (such as 30% of the mean). The resulting coefficients serve as signatures for that image area. (Transformations may be based, for example, on tiles of 8 x 8 pixels.) When a pattern corresponding to this signature is found at a surrounding location in another (or the same) image frame , The motion of the image area can be identified.
Image transfer of semantic information
In many systems it is desirable to implement a set of processing steps (such as those described above) that extract information about incoming content (e.g., image data) in a scalable (e.g., distributed) manner. This extracted information (metadata) is then preferably packaged to facilitate subsequent processing (which may be application specific, more computationally intensive, and executed within the originating device or by a remote system) .
The approximate analogy is user interaction with Google. Naked search terms are not sent to Google mainframes as they are from missing terminals. Instead, the user's computer formats the query as an HTTP request, including the Internet Protocol address (indicating the location) of the originating computer, and makes available cookie information where user language preferences, desired safety search filtering, etc. can be identified . This structure of related information serves as a pioneer in Google's search processing, allowing Google to perform searches more intelligently - providing faster and better results for users.
61 illustrates a portion of the metadata that may be relevant in an exemplary system. The information types of the leftmost column can be calculated directly from the natural image data signals taken from the image sensor. (As is well known, some or all of these may be computed using processing arrangements integrated with the sensor on a common substrate.) The additional information may be stored in the second column, as shown by the information types of the second column, Can be derived by referring to these basic data types. Such other information may be generated by processing in the cell phone, or an external service may be utilized (e.g., the OCR recognition service shown in FIG. 57 may be in a cell phone or may be a remote server, etc.); Similar to the operations shown in Figure 50.)
How can this information be packaged to facilitate subsequent processing? One alternative is to convey this in the "alpha" channel of public image formats.
Most image formats represent images by data transmitted in byte-planes or a plurality of channels. In RGB, for example, one channel carries red luminance, a second channel carries green luminance, and a third carries blue luminance. The same is true for YUV, similar to CMYK (channels transmit cyan, magenta, yellow and black information, respectively) - video (luma, or brightness, channel: Y, and two color channels: U and V ) And LAB (or brightness with two color channels).
These imaging structures are generally extended to include additional channels: alpha. Alpha channels are provided to convey obscure information - background objects represent a visible range through the image.
Generally, although supported by image processing file structures, software and systems, the alpha channel is not used very much (most notably except for computer generated images and radiation). Certain implementations of the technology use an alpha channel to transmit information derived from image data.
Different channels of image formats may generally have the same size and bit-depth. In RGB, generally, the red channel can carry 8-bit data for each pixel in a 640 x 480 array (allowing values of 0-255 to be expressed). The same is true for the green and blue channels. The alpha channel of this arrangement is also typically 8 bits, and is image size and common range (e.g., 8 bits x 640 x 480). Thus, all pixels have a red value, a green value, a blue value, and an alpha value. (Composite image representations are commonly known as RGBA.)
Some of the many ways in which the alpha channel can be used to convey information derived from image data is shown in Figures 62-71 and discussed below.
62 shows an image in which the user can snap to the cell phone. The processor of the cell phone (on the sensor substrate or elsewhere) can apply an edge detection filter (e.g., a Sobel filter) on the image data to produce an edge map. It is determined whether each pixel of the image is a part of an edge. So that this edge information can only be carried in one bit plane of the 8 bit planes available in the alpha channel. This alpha channel payload is shown in FIG.
The cell phone camera can also apply known techniques to identify faces within an image frame. The red, green, and blue image data corresponding to the face regions may be combined to produce a gray-scale representation, which may be, for example, in an aligned correspondence with the faces identified in the RGB image data, It can be included in the alpha channel. The two edge information and the alpha channel conveying the gray scale faces are shown in Fig. (An 8-bit gray scale is used for the faces in the illustrated embodiment, but a shallower bit-depth, such as 6- or 7-bits, is freely available for other bit planes for other information) You can.
The camera can also perform actions to find the positions of the eyes and mouth in each detected face. The markers may be transmitted on the alpha channel - indicating the scale and location of these detected features. The simple form of the marker is a "smiley" bit mapped icon, and the eyes and mouth icons are located at the detected eyes and mouth positions. The scale of the face can be indicated by the length of the icon mouth or by the size surrounding the ellipse (or the space between the eyes and the markers). The tilt of the face can be indicated by the angle of the mouth (or the angle of the line between the eyes or the tilt surrounding the ellipse).
If cell phone processing yields a determination of the name of the person depicted in the image, this can also be expressed in the additional image channel. For example, an oval line that draws the border of a woman's depicted face can be made in dotted lines or in other patterns. Eyes can be represented by crossed lines or Xs instead of darkened circles. The ages of the depicted people can also be approximated and similarly displayed. The treatment can also classify each person's emotional state by visual cue clues and representations such as surprise / happiness / sadness / anger / ambiguity can be expressed (see, for example, Proceedings of Queensland, Australia 2007 see "A Simple Approach to Facial Expression Recognition" on page 476 of the Int'l Conf. on Computer Engineering and Applications Also see patent publications 20080218472 (Emotiv Systems, Pty), and 20040207720 (NTT DoCoMo) do.)
When the decision has some uncertainty (such as by guessing gender, age range, or emotion), the confidence metric output by the analysis process may also be represented in an iconic manner, such as by the width of the line, .
65 illustrates different pattern elements that may be used to display different information, including gender and confidence, in an auxiliary image plane.
The portable device may also perform operations that peak in the optical character recognition of the alphanumeric symbols and strings depicted in the image data. In the illustrated example, the device can recognize the string "LAS VEGAS" in the image. This decision can be memorized by the PDF417 2D bar code added to the alpha channel. The barcode may be located at the location of the OCR'd text in the image frame or elsewhere.
(Or other machine-readable data symbols, such as ID, Aztec, Datamatrix, High Capacity Color Barcode, Maxicode, QR Code, Semacode, and ShotCode), OCR fonts and data patterns Etc. Patterns may be used both to carry arbitrary data and to form halftone image representations In this regard, Xerox patents 6,419,162 and 2001, IEEE Computer Magazine, 3rd < RTI ID = 0.0 > See "Printed Embedded Data Graphical User Interfaces" by Hecht, Vol. 34, pp. 47-55.)
66 shows an alpha channel representation of a portion of the information determined by the device. All of this information is constructed in such a way as to allow it to be transmitted within a single bit plane (of 8 bit planes) of the alpha channel. Information derived from the other of the processing operations (e.g., the analyzes shown in Figures 50 and 61) may be conveyed in this same bit plane or in different bit planes.
While Figures 62-66 illustrate various information that may be conveyed in the alpha channel and its different representations, more is shown in the example of Figure 67-69. These associate new GMC trucks and owner's cell phone burns.
Among other processes, the cell phone in this example is used to recognize the text on the truck grill and the owner's t-shirt, to recognize the owner's face, and to recognize the pool and sky areas to recognize the model, The image data was processed.
The sky is defined by the position of the top of the frame, by the color histogram in the threshold distance of the expected norms, by the weak spectral composition at certain frequency coefficients (e.g., substantially "flat" ). The pool was recognized by texture and color. (Other techniques for recognizing these features are described, for example, by Batlle in "Image and Vision Computing, Volume 18, Issue 6, pp. 515-530, May 2000," A review on strategies for recognizing natural objects in color images , "Pattern Analysis and Applications", Vol. 4, pp. 20-27, March 2001, "Fast Labeling of Natural Scenes Using Enhanced Knowledge" by Hayashi, and IEEE Int'l Conf. &Quot; Improved semantic region labeling based on scene context "by Boutell at Expo, see also patent publications 20050105776 and 20050105775 (Kodak).) Trees could be similarly recognized.
The human face of the image was detected using the same arrangements as those commonly used in consumer cameras. Optical character recognition was performed on the data set induced from the Fourier and Melian transformations after the application of the edge detection algorithm followed by the input image. (The text GMC and LSU TIGERS were found, but the algorithm did not identify text on other texts and tiers on the t-shirt.) With the extra processing time, a portion of this missed text could be decoded.
The truck was first classified as a vehicle, then identified as a truck, then finally, by pattern matching, as Dark Crimson Metallic 2007 GMC Sierra Z-71 with an extended steering wheel. (This identification has been obtained through the use of known reference truck images from resources such as fan sites: IMCDB-dot-com, which are devoted to identifying vehicles on GM trucks websites, Flickr and Hollywood motion pictures. And other methods for model recognition are described in Proc. SPIE, Vol. 7251, 725105, which is incorporated herein by reference in its "Localized Contouring Features in Vehicle Make and Model Recognition".)
FIG. 68 shows an exemplary graphical, bitonal representation of the information that is distinguished when added to the alpha channel of the FIG. 67 image. (Figure 69 shows the different planes of the composite image: red, green, blue and alpha).
Part of the image is detected when the pool is represented by a uniform image of the points. The image area depicting the sky is represented as a grid of lines. (If the trees were specifically identified, they could be labeled using one of the same patterns, but with different sizes / spacings / etc., or a completely different pattern could be used.)
The identification of trucks with the Dark Crimson Metallic 2007 GMC Sierra Z-71 with extended steering wheel is encoded in the PDF417 2D bar code - scaled to the size of the truck and masked by its shape. Since PDF417 redundantly encodes information with error-correcting features, portions of the damaged rectangular barcode do not prevent the encoded information from being recovered.
The face information is encoded in the second PDF417 bar code. This second bar code is oriented at 90 degrees to the truck bar code and is scaled differently to help distinguish two individual symbols for downstream decoders. (Other different orientations could be used, and in some cases, for example, 30 degrees, 45 degrees, etc.).
The face barcode is elliptical in shape and can be outlined with an oval border (this is not depicted). The center of the bar code is located at the midpoint of the human eye. The width of the barcode is twice the distance between the eyes. The height of the elliptic bar code is four times the distance between the lines joining the mouth and eyes.
The payload of the face bar code conveys the information identified from the face. In embodiments, the barcode simply indicates the presence of an appearance of a face. In more sophisticated embodiments, the eigenvectors calculated from the face image can be encoded. If a particular face is recognized, information identifying the person may be encoded. The processor determines the likely sex of the subject, and this information can also be conveyed in the barcode.
Consumer cameras and people appearing in images captured by cell phones are not random: a significant proportion are recurring targets: the owner's children, spouse, friends, users themselves, and so on. There are often many previous images of these recurring objects distributed among devices owned or used by the owner, such as PDAs, cell phones, home computers, network storage devices, and the like. Many of these images are annotated with the names of the depicted people. From these reference images, settings that characterize the face vectors can be computed and used to identify objects in the new ports. (As is well known, Google's Picasa service operates on this principle to identify people in the user's photo collection; Facebook and iPhoto also do.) This library of reference face vectors is shown in the picture of Figure 67 Can be identified to try and identify the depicted person, and the identification can be expressed in the bar code. (The identification may include a person's name and / or other identifier (s), whereby the matched face may include, for example, an index number of a database or contact list, a phone number, a Facebook user name, It is known.)
The text recognized from the regions of the image of Figure 67 is added to the corresponding regions of the alpha channel frame and is provided in a reliably decodable OCR font. (Although OCR-A is depicted, other fonts may be used).
Various other information may be included in the Figure 68 alpha channel. For example, although there are positions in the frame where the processor suspects the text, the OCR does not successfully decode the alphanumeric symbols (possibly on tires, or other characters on a person's shirt) (E. G., A pattern of diagonals). ≪ / RTI > The outline of a person (rather than the representation of his face) can also be detected by the processor and displayed by the corresponding border or fill pattern.
While the examples of FIGS. 62-66 and 67-69 illustrate various different ways of representing semantic metadata in an alpha channel, more techniques are shown in the examples of FIG. 70 and FIG. Here, the user captured a snapshot of the child playing (Fig. 70).
The child's face is rotated away from the camera and captured with poor contrast. However, even with this limited information, the processor makes reference to previous images of the user and makes possible identification: the user's first born child, Matthew Doe (likely to be found in the preserved photos of countless users).
As shown in Figure 71, the alpha channel in this example conveys an edge-detected version of the user's image. The imposition on the head of a child is an image of a child's face replaced. Such a substitute image can be selected for its composition (e.g., depicting two eyes, nose and mouth) and better contrast.
In some embodiments, each person known to the system has an icon face image that serves as a visual proxy for a person in different contexts. For example, some PDAs store contact lists that include face images of contacts. The user (or contacts) provides easily recognized-icon-face images. These icon face images can be scaled to match the head of the person depicted in the image and can be added to the alpha channel at the corresponding face location.
In addition, the alpha channel depicted in FIG. 71 includes a 2D barcode. This barcode may or may not convey the rest of the information distinguished from the processing of the image data (e.g., the name of the child, the color histogram, the exposure metadata, how many faces were detected in the image, Or other transform coefficients, etc.).
In order to make the 2D barcode as powerful as possible for compression and other image processing operations, its size is not fixed, but rather can be dynamically scaled based on environments - such as image properties. In the depicted embodiment, the processor analyzes the edge map to identify areas with uniform edgeness (i. E., Within a critical range). The largest such area is selected. The bar code is then scaled and arranged to occupy the central area of this area. (In a subsequent process, the barcode-replaced edgeness can be largely restored by averaging the edgeness at the center points adjacent to the four barcode sides.)
In another embodiment, the area size is adjusted to an edge to determine where to place the bar code: low edge haze is good. In this alternative embodiment, the smaller area of the lower edge haze may be selected through the larger area of the higher edge haze. The size of each candidate region minus the edgeness of the scaled value may serve as a metric to determine which region should host the bar code. This is the arrangement used in FIG. 71, which leads to the placement of the barcode in the left area of Matthew's head, rather than the area for the larger but more edge-right.
Although Figure 70 is relatively "edge" (e.g., as opposed to Figure 62 in Figure 62), most of the edges may be irrelevant. In some embodiments, the edge data is filtered so that only major edges (e.g., edges indicated by contiguous line contours) are preserved. Within the resulting blank area of the filtered edge map, the processor can deliver additional data. In job arrangements, the processor inserts a pattern to indicate a particular color histogram bin (bin) with the image colors of the person. (In a 64-bin histogram requiring 64 different patterns, Bin 2 indicates that the red channel has a value of 0-63, the green channel has a value of 0-63, the blue channel has a value of 64-127 Other < / RTI > image metrics may be similarly conveyed.
Instead of using different patterns to represent different data, the empty regions of the filtered edge map can be filtered into a noise-like signal. - Steganographically (e.g., Lt; / RTI > (A suitable watermarking technique is described in Digimarc's patent 6,590,996.)
In the alpha channel, you will find that some information - when presented visually to humans in graphical form - conveys useful information. From FIG. 63, a person can identify a male who embraces a woman in front of the code labeled "WELCOME TO LAS VEGAS NEVADA ". From Figure 64, one can see the gray scale faces and the outline of the scene. From Figure 66, a person can additionally identify a bar code carrying certain information and identify two smiley icons that show the locations of the face.
Likewise, in FIG. 68, a viewer with which a frame of graphics information can be rendered can identify the outline of a person, read LSU TIGERS from a person's T-shirt, and find out what appears in the outline of a truck Grids are supported by clues to the GMC text).
From the provision of the alpha channel data of Figure 71, a person can identify a child sitting on a floor floor playing with toys.
The barcode in FIG. 71 is prominently displayed to a person examining the existence of information, such as the barcode in FIG. 66, but the content thereof is not shown.
The rest of the graphical content in the alpha channel may not be beneficial to the person at the time of the investigation. For example, if the child's name is steganographically encoded as a digital watermark in a noise-like signal in FIG. 71, the presence of noise-free information may not be detected by a human.
The examples described above detail the diversity of semantic information that can be intercepted in the alpha channel and the diversity of the representational structures that can be exploited. Naturally, this is just a small sampling; The artist can quickly adapt these presentations to the needs of specific applications, thus creating many different and different embodiments. Thus, for example, any information that can be extracted from an image may be stored in an alpha channel using arrangements similar to those described herein.
It will be appreciated that image related information may be added to the alpha channel at different locations by different processors at different times. For example, a sensor chip in a portable device can perform on-chip processing to perform certain analyzes and add the resulting data to the alpha channel. The device may have other processors that perform additional processing on the image data and / or on the results of the initial analyzes and add a representation of these additional results to the alpha channel. (These additional results may be based in part on data obtained wirelessly from a remote source.) For example, a consumer camera may be linked by Bluetooth to a user's PDA to obtain facial information from the user ' s contact files have.)
The composite image file may be transmitted from a handheld device to an intermediate network node (e.g., to a carrier such as Verizon, AT & T, or T-Mobile, or to another service provider), which may perform additional processing, Add to the channel. (Using more capable processing hardware, these intermediate network nodes perform more complex, resource-intensive processing - such as more sophisticated face recognition and pattern matching.) With higher-bandwidth network access, Can utilize a variety of remote resources to augment the alpha channel with additional data, such as links to Wikipedia entries - or information from the Wikipedia content itself, the telephone database, and image database lookups. The image can then be sent to an image query service provider (e.g., SnapNow, MobileAcuity, etc.), which can continue processing and / or command a response operation based on the information thus provided.
The alpha channel can thus convey an icon view of what all progressive processing has identified the image and learned about it. Each subsequent processor can easily access this information and contribute more. All of these are within constraints of existing workflow channels and long established file formats.
In some embodiments, the source of some or all of the distinct / inferred data is displayed. For example, the stored data can be used to determine whether the OCR generating the specific text is the MAC address of 01-50-F3-83-AB-CC or PDX- LA002290.corp.verizon-dot at 8:35 pm on August 28, it may indicate that it was executed by a Verizon server with a unique identifier such as -com's network identifier. This information may be stored in the alpha channel, in the header data, in the remote storage where the pointer is provided, and so on.
Different processors may contribute to different bit-planes of the alpha channel. The capture device can record the information in bit plane # 1. The intermediate node may store its contributions in bit plane # 2. Certain bit planes may be available for shared use.
Or different bit planes may be assigned different classes or types of semantic information. Information relating to faces or people in the image can always be recorded in bit plane # 1. Can always be recorded in the information bit plane # 2 associated with the locations. The edge map data can always be found in bit plane # 3 (for example, in the form of a 2D bar code) with color histogram data. Other content labeling (for example, grass, sand, sky) can be found in bit plane # 4 with OCR'd text. Textual information such as the content of the text obtained from the web or related links can be found in bit plane # 5. (ASCII symbols may be included as bit patterns, e.g., each symbol takes 8 bits in a plane.) Robustness for subsequent processing is improved by assigning two or more bits to the image plane for each bit of ASCII data Convolutional coding and other error correction techniques may be utilized for some or all of the image plan information. Again, error correction bar codes may be used.)
The indexes for the information conveyed in the alpha channel can be compiled, for example, in the EXIF header associated with the image, allowing subsequent systems to speed up the interpretation and processing of such data. The index utilizes XML-type tags that specify the types of data passed in the alpha channel and optionally other information (e.g., their locations).
Positions may be specified as the position of the most significant bit (or the left most significant bit) in the bit-plane array, e.g., X-, Y-coordinates. Or rectangular bounding boxes may be specified by reference to two corner points (e.g., specified by X, Y coordinates) - detail the area in which the information is represented.
In the example of FIG. 66, the index may convey information such as:
<MaleFace1> AlphaBitPlane1 (637,938) </ MaleFace1>
<FemaleFace1> AlphaBitPlane1 (750,1012) </ FemaleFace1>
<OCRTextPDF417> AlphaBitPlane1 (75,450) - (1425,980) </ OCRTextPDF417>
<EdgeMap> AlphaBitPlane1 </ EdgeMap>
This index is thus found in bit plane # 1 of the alpha channel where the top pixel is at position 637, 938; The female face is similarly represented in the upper pixel located at 750, 1012; OCR'd text encoded as a PDF417 barcode is found in bit plane # 1 of a rectangular area with corner points 75,450 and 1425,980, which also includes an edge map of the image.
Some information can be provided naturally. Different types of indexes with less information may be specified, for example, as follows:
<AlphaBitPlane1> Face, Face, PDF417, EdgeMap </ AlphaBitPlane1>
This type of index may simply indicate that bit plane # 1 of the alpha channel includes two faces, a PDF417 bar code and an edge map.
The index with more information includes the rotation angle and scale factor for each face, the LAS VEGAS payload of the PDF417 bar code, the angle of the LAS VEGAS bar code, the confidence factors for subjective determinations, the names of recognized persons, (E.g., the patterns in FIG. 65 and the graphic labels used in the sky and pool in FIG. 68), the sources of auxiliary data (e.g., Reference image data served as a basis for the conclusion that the truck is a Sierra Z71 in Figure 67) or the like.
As can be seen, the index can also carry information conveyed in the bit planes of the alpha channel. In general, the representation of the different types is used in the graphical representations of the alpha channel versus the index. For example, in the alpha channel, the femininity of the second face is represented by '+' to represent the eyes; In the index, femininity is represented by the XML tag <FemaleFace1>. The redundant representation of information serves as an acknowledgment of data integrity.
Sometimes, header information, such as EXIF data, can be separated from the image data (e.g., when the image is delivered in a different format). Instead of conveying index information in the header, the bit plane of the alpha channel can serve to convey index information, e.g. bit plane # 1. One such arrangement is to encode index information as a 2D barcode. The bar code can be scaled to fill the frame to provide maximum robustness against possible image degradation.
In some embodiments, some or all of the index information is replicated in different data stores. For example, it can be transmitted both from the EXIF header type and bit planes as a bar code # 1. Some or all of the data may also be remotely maintained, such as by Google or other web storage in the "cloud ". The address information passed by the image may serve as a pointer to this remote store. Pointers (which may be URLs, but more generally - when requested) are UIDs or indexes to the database that return the current address of the data sought) may be included in the index and / or in one or more bitplanes of the alpha channel . Or the pointer may be steganographically encoded (in some or all of the composite image planes) within the pixels of the image data using digital watermarking techniques.
In yet other embodiments, a portion or a portion of the information described above as being stored in the alpha channel may additionally or alternatively be remotely stored or encoded in image pixels as a digital watermark. (The picture itself may or may not be duplicated in the remote storage, with or without an alpha channel, by any device in the processing chain.)
Some image formats may contain more than the four planes described above. Geospatial imagery and other mapping techniques typically represent data in formats that extend to half a dozen or more information planes. For example, a multispectral space-based image may be divided into (1) red, (2) green, (3) blue, (4) near-infrared, (5) mid-infrared, (6) far infrared, And can have individual image planes immersed in it. The techniques described above may convey derived / inferred image information using one or more ancillary data planes available in these formats.
As the image moves between the processing nodes, some of the nodes may overwrite the inserted data by the initial processing. Although not required, the overwrite processor may copy the overwritten information to the remote storage device and include a link or other reference thereto in an image or index or alpha channel - the same is required in the latter case.
When representing information in an alpha channel, consideration can be given to the degradations that this channel may experience. JPEG compression, for example, generally discards high frequency details that do not contribute significantly to human perception of the image. However, this discarding of information based on the human visual system can serve as disadvantages when applied to information present for other purposes (although the human view of the alpha channel is clearly possible, and in some cases useful).
To try to eliminate this degradation, the information in the alpha channel may be represented by features that are unlikely to be regarded as visually irrelevant. The different types of information can be represented by different features, most importantly through strict compression. Thus, for example, the presence of faces in Figure 66 is indicated by thick ellipses. The positions of the eyes can be less relevant and are represented by smaller features. The patterns shown in FIG. 65 may not be reliably distinguished after compression, so that secondary information-loss may be reserved to represent less important. With JPEG compression, the most significant bit-planes are best preserved, while the lower most significant bit-planes are progressively more erroneous. Thus, the most important metadata is delivered in the most significant bit planes of the alpha channel - to improve the viability.
If the description of the kind shown by Figures 62-71 is a common language for delivering metadata, image compression will evolve to take its existence into account. For example, JPEG compression can be applied to the red, green, and blue image channels, but lossless (or low loss) compression can be applied to the alpha channel. Since the alpha channels of the various bit planes can carry different information, they can be compressed separately rather than as bytes of -8-bit depth. (If compressed separately, lossy compression can be accommodated further.) Each bit-plane carries only the vital information, including modified Huffman, modified READ, run-length encoding and ITU-T T.6 Compression schemes known from facsimile technology can be used. Thus, hybrid compression techniques are well suited for such files.
The alpha channel delivery of the metadata may be configured to progressively transmit and decode, generally corresponding to the associated image features, using compressed arrangements such as JPEG 2000. That is, since the alpha channel provides semantic information in the visual domain (e.g., as an icon), it can be expressed to decompress layers of semantic detail at the same rate as the image.
In JPEG 2000, wavelet transform is used to generate data representing an image. JPEG 2000 packages and processes this transform data in a way that produces incremental transmission and decoding. For example, when rendering a JPEG 2000 image, the aggregate details of the image first appear, followed by finer details in succession. The transmission is similar.
Consider the truck and male images of Figure 67. Its JPEG 2000 version renders the representation of a low frequency bold line truck first. Then, the shape of a man appears. Next, features such as the GMC letters on the truck grill and the logo on the men's T-shirts are distinguished. Finally, the facial features of the male, the details of the grass, the details of the trees, and other high frequency miniseries complete the rendering of the image. The transmission is similar.
This progression is shown in the pyramid of Figure 77a. Initially, a relatively small amount of information is presented providing the details of the overall shape. Gradually, the image is filled in - finally ending up with a relatively large amount of small detailed data.
The information of the alpha channel may be similarly configured (Fig. 77B). The information about the truck can be represented by large, low frequency (shape-dominant) symbols. The presence and location of a male can be encoded in a next-most-prevailing expression. The information corresponding to the GMC characters on the truck grill and the letters on the male shirt can be expressed in finely detailed detail on the alpha channel. The finest detail in the image, for example, the facial minutia of a man, can be expressed in the finest detail on the alpha channel. (As can be appreciated, the exemplary alpha channel of Figure 68 does not follow this model very much.)
If the alpha channel delivers that information in the form of machine-readable symbols (e.g., bar codes, digital watermarks, glyphs, etc.), the order of alpha channel decoding may be conclusively controlled. The features with the largest features are first decoded; The features with the finest features are later decoded. Thus, the alpha channel may carry barcodes at various different sizes (all in the same bit frame, for example, distributed side by side or distributed between bit frames). Or alpha channel may carry a plurality of digital watermark signals, e.g., one at a total resolution (e.g., "waxels" corresponding to 10 watermark elements or inches, The same is true for data glyphs: the range of glyphs of larger and smaller sizes may be used, and these may be used in relatively more < RTI ID = 0.0 > It will be decoded early or later.
(JPEG2000 is the most common compression schemes that exhibit gradual behavior, but others exist.) JPEGs with some effort can behave similarly, but these concepts are applicable whenever there is such an impulse.
With such arrangements, corresponding metadata is made available when image features are decoded for presentation-or when transmitted (e.g., by media delivery streaming).
It will be appreciated that the results contributed by the various distributed processing nodes to the alpha channel are immediately available for each subsequent reception of the image. Thus, the service provider receiving the processed image may, for example, depict men and women in Las Vegas, Figure 62; 63 shows the man and his GMC truck; Figure 70 quickly understands that it depicts a child named Matthew Doe. The edge map, color histogram, and other information communicated with these images provide a head start to the service provider in processing the image, for example, to increase it, recognize its content, and initiate a suitable response.
The recipient nodes may also use the delivered data to enhance the stored profile information associated with the user. The node receiving the metadata of FIG. 66 may mark Las Vegas as a potentially interesting location. The system for receiving the metadata of FIG. 68 may deduce that GMC Z71 trucks are associated with the user and / or with the person depicted in the port. These associations can serve as launch points for tailored user experiences.
Metadata also allows images with certain attributes to be quickly identified in response to user queries. (For example, find pictures showing GMC Sierra Z71 trucks.) Preferably, the web-indexing crawlers can identify the alpha channels of the images found on the web and make the images more easily identifiable to searchers You can add information from the alpha channel to the compiled index.
As is well known, the alpha channel-based approach is not necessary for use of the techniques described in this specification. Another alternative is a data structure indexed by the coordinates of image pixels. The data structure may be passed along with the image file (e.g., as in EXIF header data), or may be stored on a remote server.
For example, one entry in the data structure corresponding to pixels 637 and 938 in Figure 66 may indicate that the pixel forms part of the male's face. The second entry for this pixel may indicate a shared sub-data structure in which eigenface values for this face are stored. (The shared sub-data structure may also list all the pixels associated with that face.) The data record corresponding to pixels 622, 970 may indicate that the pixel corresponds to the left eye of the male face. Data records indexed by pixels 155,780 may indicate that the pixel forms part of the text (by OCR) recognized as the spelling "L ", and is also in the color histogram bin 49. [ The source of each datum of the information may also be recorded.
(Instead of identifying each pixel by its X- and Y-coordinates, each pixel can be assigned a sequential number to be referenced.)
Instead of multiple pointers pointing to a common sub-data structure from the data records of different pixels, the entries may form a linked list, where each pixel has a common attribute (e.g. associated with the same face) And a pointer to the next pixel. A record for a pixel may include a plurality of different sub-data structures or pointers to a plurality of different pixels - to associate a pixel with a plurality of different image features or data.
If the data structure is stored remotely, the pointer to the remote store may be included with the image file, for example, stencilographically encoded in the image data and represented as EXIF data. If any watermarking arrangement is used, the origin of the watermark (see Digimarc Patent 6,307,949) can be used as the explicit basis of pixel references as offsets (e.g., instead of using the upper left corner of the image). This arrangement allows the pixels to be correctly identified despite errors such as cropping or rotation.
As with the alpha channel data, the metadata recorded in the remote store is preferably available for retrieval. A web crawler that encounters an image may use a steganographically encoded watermark to identify the corresponding repository of metadata and add metadata to the index terms for the image from the repository (even if found at different locations) Pointers can be used in EXIF data.
It will be appreciated that, with the arrangements described above, existing image standards, workflows, and ecosystems - originally designed to support the pixel image data - are utilized in support of metadata as well in this disclosure.
(Of course, the alpha channel and the other schemes described in this section are not essential to other aspects of the present technique. For example, the alpha channel and other schemes derived from processes such as the processes shown in Figures 50, 57, and 61 The inferred information may be dispatched as packetized data using, for example, WiFi or WiMax, transmitted from the device using Bluetooth, transmitted as SMS short text or MMS multimedia messages, or transmitted using low power peer- Shared with other nodes in the wireless network, carried by wireless cellular transmission, transmitted with other transmission or wireless data services, etc.).
U.S. Patents 5,602,566 (Hitachi), 6,115,028 (Silicon Graphics), 6,201,554 (Ericsson), 6,466,198 (Innoventions), 6,573,883 (Hewlett-Packard), 6,624,824 (Sun) and 6,956,564 (British Telecom), and published PCT application WO9814863 ) Discloses that portable computers can be equipped with devices that can be tilted and can be used for different purposes (e.g., scrolling through menus).
In accordance with another aspect of the technique, the tip / tilt interface is utilized in connection with typing operations, such as composing text messages sent by a simple message service (SMS) protocol from a PDA, cell phone or other portable wireless device.
In one embodiment, the user activates the tip / tilt text input mode (e.g., pushes a button, enters a gesture, etc.) using any of a variety of known means. The scrollable user interface appears on the device screen providing a series of icons. Each icon has the appearance of a cell phone key, such as a button depicting the number "2" and the character "abc". The user tilts the device left or right to scroll backward or forward through a series of icons to reach the desired button. The user then taps the device towards or away from them to navigate between the three letters associated with the icon (e.g., tipping to navigate away from "a", corresponding to "b" No tipping to tip; tipping to navigate to "c" side). After navigating to the desired letter, the user takes an action to select that letter. This action may press a button on the device (e.g., with the user's thumb), or another action may signal the selection. The user then proceeds as described to select subsequent letters. With this arrangement, the user enters a series of texts without the constraints of large fingers on small buttons or UI features.
Many variations are possible. The device need not be a phone; A wristwatch, a keyfob, or other small form factors.
The device may have a touch-screen. After navigating to the desired character, the user can tap the touch screen to make a selection. When tipping / tilting the device, the corresponding letters can be displayed on the screen in an enlarged manner (e.g., overlaying icons on the button or elsewhere to indicate the user's progress in navigation) ).
While accelerometers or other physical sensors are utilized in certain embodiments, others use 2D optional sensors (e.g., cameras). The user can direct the sensor to the floor, knee, or other object, and the device then senses the associated physical movement by sensing the movement (up / down, left, right) of the features within the image frame. In these embodiments, the image frame captured by the camera need not be provided on the screen; The symbol selection UI can be displayed alone. (Alternatively, the UI may be provided as an overlay on the background image captured by the camera.)
In camera-based embodiments, other dimensions of motion may also be sensed, such as in embodiments utilizing physical sensors: up / down. This may provide additional degrees of control (e.g., shifting to uppercase letters, shifting to numbers in characters, or selecting the current symbol).
In some embodiments, the device has several modes: one for inputting text; The other to enter numbers; The other is for entering symbols; Etc. A user may switch between these modes using mechanical controls (e.g., buttons) or through controls of the user interface (e.g., touches or gestures or voice commands). For example, tapping the first area of the screen may select the currently displayed symbol, and tapping the second area of the screen may toggle the mode between character input and numeric input. Or one tab in this second area can switch to character input (default); In this area, the two taps can be switched to numeric input; And in this area, the three taps can switch to the entries of other symbols.
Instead of selecting between individual symbols, such an interface may also include common words or phrases (e.g., signature blocks) to which a user can navigate tip / tilt and then You can choose. There may be multiple lists of words / phrases. For example, the first list can be standardized (pre-programmed by the device vendor) and statistically includes common words. The second list may include words and / or phrases associated with a particular user (or a particular class of users). The user can enter these words in this list, or the device can compile the list during operation - which words are most commonly entered by the user. (The second list may or may not exclude words found on the first list.) Again, the user may switch between these lists as described above.
Preferably, the sensitivity of the tip / tilt interface is adjustable by the user to accommodate different user preferences and techniques.
Although the above embodiments have considered tilts / taps of limited grammar, more extended grammars can be devised. For example, a relatively slow tilting of the screen to the left may cause the icons to scroll in a given direction (left or right depending on the implementation), and the sudden tilting of the screen in that direction may cause a line (or paragraph) Lt; RTI ID = 0.0 > a < / RTI > A sharp tilt in the other direction may cause the device to send a message.
Instead of the speed of the tilt, the angle of the tilt may correspond to different motions. For example, tilting a device from 5 degrees to 25 degrees can cause icons to scroll, but tilting the device above 30 degrees can insert a line break (if left) ).
Different tip gestures can trigger different actions as well.
The arrangements just described are only required for some of the many different possibilities. Technicians employing such techniques are expected to modify and adapt these presentations as appropriate for particular applications.
Affine Capture parameters
According to another aspect of the technique, a portable device captures - and can provide, geometric information related to the location of the device (or the location of the object).
Published patent application 20080300011 of Digimarc discloses various arrangements in which a response to "seeing" can be made, including cell phones overlaying graphic features on top of certain imaging objects. The overlay can be wrapped according to the perceptual distortion of the object.
Steganographic correction signals for which the distortion of the image of the imaging object can be accurately quantized are described, for example, in Digimarc's patents 6,614,914 and 6,580,809; And patent publications 20040105569, 20040101157, and 20060031684. Digimarc's patent 6,959,098 discloses that distortion can be characterized by these watermark calibration signals along with how the image features are visible (e.g., the edges of a rectangular object). From this affine distortion information, the 6D position of the watermarked object associated with the cell phone imager can be determined.
There are various ways in which 6D positions can be described. One is by three position parameters: x, y, z, and three angle parameters: tip, tilt, and rotation. The other is due to the rotation and scale parameters (e.g., shear mapping, translation, etc.) along with the 2D metrics of the four elements that define the linear transformation. The matrix transforms the position of any pixel x, y to the resulting position after the linear transformation has occurred. (The reader refers to references to shear mapping, such as Wikipedia, for information on metrics mass, etc.)
FIG. 58 illustrates how a cell phone can display parameters (e.g., derived from an image or elsewhere) that are affine. The camera may be placed in this mode through UI control (e.g., tapping physical buttons, creating a touch screen gesture, etc.).
In the depicted arrangement, the rotation of the device from the (apparent) horizontal direction is provided at the top of the cell phone screen. The cell phone processor can make this determination by analyzing the image data for one or more generally parallel long straight edge features, averaging them to determine an average, and assuming it is horizontal. If the camera is normally aligned with the horizontal, this average line will be horizontal. The divergence of this line from the horizontal indicates the rotation of the camera. This information may be provided in text (e.g., "12 degrees right"), and / or a graphical representation showing divergence from horizontal may be utilized.
(For example, many cell phones include accelerometers or other tilt detectors, which allow the cell phone handler to output data that can distinguish the angular orientation of the device < RTI ID = 0.0 > do.
In the illustrated embodiment, when the camera is in this mode of operation, it captures a sequence of image frames (e.g., video). The second datum represents the angle at which the features in the image frame have been rotated since the start of image capture. Again, this information may be collected by analysis of the image data and provided in text and / or graphically. (Graphics may include a circle - or arrow - with a line through the center showing the real-time angular movement of the camera left or right.)
In a similar manner, the device may track changes in the apparent size, objects, and / or other features of the edges in the image to determine how much the scale has changed since the image capture was started. This indicates whether the camera has been moved toward or away from the object and how much has been moved. Again, the information can be provided in text and graphically. The graphical representation may include two lines: a reference line, and a second parallel line whose length changes in real time in accordance with the scale change (for movement of the camera closer to the object, For smaller).
Other geometric data, such as translation, different scaling, tip angle (i.e., forward / reverse), may also be derived or provided, for example, although not specifically shown in the exemplary embodiment of FIG.
The above-described determinations can be simplified when the camera field of view includes a digital watermark with staganographic calibration / orientation data of the kind described above in the referenced patent documents. However, the information can also be derived from other features in the image.
Of course, in other embodiments, data from other position sensing arrangements in one or more accelerometers or devices may be used to generate the information provided - either alone or in conjunction with the image data.
In addition to providing such geometry information on the device screen, such information can also be used, for example, in the context of providing a context in which a remote system can be customized, for example, upon detection of gestures made by a device by a user.
Camera-based environment and behavior machine
In accordance with another aspect of the technique, the cell phone functions as a state machine and alters aspects of the functionality based on, for example, previously obtained image-related information. The image-related information may include information about the natural behavior of the camera user, the typical circumstances in which the camera is operated, the inherent physical characteristics of the camera itself, the structure and dynamic properties of the scenes imaged by the camera, Can be focused. The resulting changes to the camera function can be directed toward improving image analysis programs remotely located on an image-analysis server or resident on a camera-device. Image analysis covers the entire range of analysis from digital watermarking to object and face recognition, to 2-D and 3-D bar code reading and optical character recognition, through scene categorization analysis, , Is very broadly interpreted.
Some simple examples will show what is expected to be an important aspect of future mobile devices.
Consider the problem of object recognition. Most objects have different appearances depending on the angle at which they are viewed. Machine version object-aware algorithms can give more precise (more rapid) guesses about what an object is given given some information about the view that the object is being viewed on.
People are creations of customs, including the use of cell phone cameras. This extends to how they normally tilt the hand while holding the phone and taking pictures. After the user establishes a history with the phone, the usage patterns can be distinguished from the captured images. For example, a user may want to take the subject's photos a little to the right rather than directly. This right-sloping tendency in view can generally be attributed to the fact that the user is holding the camera in his right hand, so that the exposures are taken from slightly to the right center.
(The right-slope load can be sensed in a variety of ways, for example, by the lengths of the vertically parallel edges in the image frames. If the edges are no longer on the right sides of the image, I want to show that images have been taken from the view. Differences in illumination across the foreground objects can also be used - brighter lighting on the right of the objects suggests that the right is closer to the lens.
Similarly, in order to easily manipulate the shutter button of the phone while holding the device, this particular user may customarily adopt a grip of the photo to tilt the top of the camera 5 degrees (i.e., to the left) toward the user have. In the captured image objects, these results are generally oblique to an apparent rotation of 5 degrees.
These recurrent biases can be distinguished by examining the collection of images captured by the cell phone and its users. Once identified, the data storing these properties can be stored in memory and used to optimize image recognition processes performed by the device.
Thus, a device may generate a first output (e.g., a provisional object identification) from a given image frame at a time, but may generate a second different output from the same image frame at a later time (e.g., ) Due to the intervention of the camera.
The characteristic pattern of the jitter of the user's hand can also be deduced by experiments of multiple images. For example, by examining images of different exposure durations, it can be seen that the user has jitter with a frequency of 4 Hertz prevailing in the left-right (horizontal) direction. Sharp filters cut in the jitter behavior (and also depending on the length of the exposure) can then be applied to improve the resulting image.
In a similar manner, through use, the device may be configured such that images captured by the user during weekday times 9:00 - 5:00 are generally illuminated with the spectral characteristics of the fluorescence, and a rather abrupt white- Shall be applied for compensation. Using the prior knowledge of this trend, the device can expose the captured photos during those times differently than the baseline exposure parameters - allowing for fluorescent lighting to be achieved and better white balance being achieved.
Over time, the device derives information that models some aspects of the user's habitual behavior or environmental variables. Thereafter, the device adapts the operation of some aspects accordingly.
A device may also adapt to its own characteristics or degradations. These include non-uniformities of the photodiodes of the image sensor, dust on the image sensor, scratches on the lens, and the like.
Again, over time, the device can detect a reoccurrence pattern: (a) one pixel provides an average output signal that is 2% lower than adjacent pixels; (b) successive groups of pixels tend to output signals that are about three digital digits lower than the averages represented; (c) Certain areas of the photosensor are not likely to capture high frequency detail - images in that area are consistently blurred bits, etc. From such a phenomenon of reoccurrence, the device can, for example, (a) have a low gain for the amplifier serving the pixel; (b) dust or other heterogeneous objects are blocking these pixels; (c) the lens flaw can infer that the light in this area of the photosensor is prevented from being properly focused, and so on. Thereafter, appropriate compensation may be applied to mitigate these defects.
Common aspects of an object or "imaged scenes" include at least initial-stage image processing steps that assist in later-stage image analysis routines by optimally filtering and / or transforming pixel data, or other rich Source information. For example, the use of these cameras by a given user for only three basic concerns: digital watermark reading, bar code reading and visual logging of experimental setups in the laboratory can be evident over several days and weeks of camera use. A histogram showing how some given camera usage has caused an " end result "operation, followed by an increase in the number of processing cycles concentrated on the initial detections of the two watermark and barcode basic properties may be evolved over time . Here, by drilling the bits deeper, the Fourier-transformed set of image data can be preferentially routed to the fast 2-D barcode detection function, or else it may be out of priority. The same is true for digital watermark reading, where the Fourier transformed data can be shipped as a specialized pattern recognition routine. A partial summary approach for viewing this state-machine change is that there are only a fixed amount of CPU and image-processing cycles available to the camera device, and selections of which modes of analysis will yield which parts of their cycles do.
An oversimplified representation of such embodiments is shown in FIG.
By the arrangements just discussed, the operation of the imager mounted device is developed through a continuous operation.
Focus problems, improved print-to-web based page layout Linking
Cameras mounted on most cell phones and other portable PDA type devices generally have no adjustable focal points. Rather, the optics consist of a compromise approach - aimed at acquiring matching images under typical portrait snapshots and landscapes. Imaging at close distances generally produces lower results - losing high frequency detail. (This is improved by the "depth of field" image sensors just discussed, but the widespread deployment of such devices has not yet occurred).
The human visual system has different sensitivities to the image at different spectral frequencies. Different image frequencies transmit different impressions. The low frequencies provide global information about the image, such as orientation and general shape. The high frequencies provide fine detail and edges. As shown in Figure 72, the sensitivity of the human visual system peaks at frequencies of about 10 cycles / mm on the retina and falls sharply on the side. (The perception also depends on the contrast between the features to be distinguished - the vertical axis.) Image features with spatial frequencies and contrast in the shaded areas of the parallel lines are generally not perceived by humans. Figure 73 shows an image with individually depicted low and high frequencies (left and right).
Digital watermarking of print media, such as newspapers, can be done by shading the page with a non-offensive background pattern that conveys auxiliary payload data in steganographic (before, during, or after painting). Different columns of text may be encoded with different payload data, e.g., allowing each news story to link to a different electronic resource (see, for example, Digimarc's patents 6,985,600, 6,947,571 and 6,724,912).
In accordance with another aspect of the present technique, the near-focus defects of portable imaging devices can be overcome by embedding a low-frequency digital watermark (e.g., with a spectral configuration on a curve centered on the left-hand side of Figure 72). Instead of encoding different watermarks in different columns, the page is marked as a single watermark on the page - it encodes the identifier for that page.
When the user snaps an image of the newspaper story of interest (the image can capture text / graphics from the desired story / advertisement, or similarly to other content), the watermark of the page is decoded Locally, remotely by a different device, or in a distributed manner).
The decoded watermark serves to provide a data structure on the display screen to index the data structure that returns information to the device. The display provides a map of the newspaper page layout with the different articles / advertisements shown in different colors.
74 and 75 show one specific embodiment. The original page is shown in Fig. The layout map displayed on the user device screen is shown in Fig.
To link additional information about any of the stories, the user simply touches a portion of the display map corresponding to the story of interest. (If the device is not equipped with a touch screen, the map of Figure 75 may be provided with an indicator, e.g., 1, 2, 3 ... or A, B, C ... that identifies different map regions. Can then manipulate the number or alphabetical user interface (e.g., keypad) of the device to identify the article of interest.
The user's choice is sent to the remote server (which may be the same or another one serving the layout map data to the portable device), and then refers to the stored data to identify the information in response to the user's selection. For example, if the user touches the area on the lower right side of the page map, the router system can instruct the server of buick-dot-com to send a page with more information about Buick Lucerne for provision on the user device. have. Or the remote system may send a link to the page to the user device and the device may then load the page. Or the remote system may listen to, for example, an associated podcast; See early stories on the same topic; Direct the reflections; A user device may provide a menu of options for new articles for which options may be given to the user to download the articles as a word file, Or the remote system can send a link to the menu page or web page to the user by email so that the user can review it later. (A variety of these different responses to user-represented choices may be provided as known in the art.
Instead of the map of Figure 75, the system allows the user device to display a screen showing a reduced scale version of the newspaper page itself - as shown in Figure 74. Again, the user can simply touch an article of interest to trigger an associated response.
(E.g., "Banks Owe Billions ...", "McCain Pins Hopes ...", "Buick Lucerne") of all content on that page, instead of providing a graphical layout of pages Can be returned. These titles are provided in the form of a menu on the device screen, and the user touches the desired item (or enters the corresponding number / letter selection).
Layout maps for printed newspapers and magazine pages, respectively, are typically generated by the publishing company as part of its layout processing using automated software from vendors, such as Quark, Impress and Adobe, for example. Thus, conventional software knows which articles and ads appear in which spaces on each printed page. These same software tools or others take this layout map information, associate corresponding links or other data for each story / advertisement, and provide web-accessible < RTI ID = 0.0 > And can be adapted to store the resulting data structure in a possible server.
The layout of newspaper and magazine pages provides orientation information that may be useful for watermark decoding. The columns are vertical. The lines of headlines and text are horizontal. Even at very low spatial image frequencies, such shape orientation can be distinguished. A user capturing an image of a printed page can not "squarely" capture the content. However, these powerful vertical and horizontal components of the image are easily determined by algorithmic analysis of the captured image data, allowing the rotation of the captured image to be distinguished. This knowledge simplifies and expedites the watermark decoding process (since the first step of many watermark decoding operations is to distinguish the rotation of the image from the originally encoded state).
In another embodiment, the transfer of the page map from the remote server to the user device is unnecessary. Again, the area of the page that affects multiple items of content is encoded with a single watermark payload. Again, the user captures an image containing the content of interest. The watermark identifying the page is decoded.
In this embodiment, the captured image is displayed on the device screen, and the user touches the content area of particular interest. The coordinates of the user selection in the captured image area are recorded.
Figure 76 is exemplary. The user used an Apple iPhone, a T-mobile Android phone, and the like to capture an image from an excerpt from a watermarked newspaper page and then touch an article of interest (represented by an ellipse). The location of the touch within the image frame is known as the offset from the top left corner measured in the touch screen software, e.g., in pixels. (The display may have a resolution of 480 x 320 pixels). The touch may be at the pixel location 200,160.
The watermark is shown in Figure 76 by the dotted line diagonal lines over the page. The watermark (as described, for example, in Digimarc's patent 6,590,996) has its origin, but its origin is not in the image frame captured by the user. From the watermark, however, the watermark decoder software knows the image and its scale of rotation. It also knows the offset of the image frame captured from the origin of the watermark. Based on this information and information about the scale on which the original watermark is encoded (the information can be delivered with the watermark, accessed from a remote repository, hard-coded to a detector, etc.) It may be determined that the upper left corner of the captured image frame corresponds to a point 1.6 inches below the top left corner of the original printed page and 2.3 inches to the right (assuming the watermark origin is at the top left corner of the page). From the decoded scale information, the software can identify that a 480 pixel wide of the captured image corresponds to an area of 12 inch wide printed pages originally printed.
The software finally determines the position of the user's touch as an offset from the upper left corner of the original printed page. It is assumed that the corners of the captured image were offset from the upper left corner of the printed page (1.6 ", 2.3 ") and 5" more for the final position in the original printed page of the 6.6 " (160 pixels * 12 "/ 480 pixels) down further (200 pixels x 12" / 480 pixels).
The device then sends these coordinates along with the payload (identifying the page) of the watermark to the remote server. The server looks up the layout map of the identified page (from an appropriate database stored by the page layout software) and refers to the coordinates to determine where the user's touch is in the articles / advertisements. The remote system then returns to the user device the corresponding information associated with the displayed article, as noted above.
Going back to focus, the proximity-focus handicap of the PDA camera can actually be turned into an advantage in decoding watermarks. The watermark information is not retrieved from areas of ink in the text. Subtle modulations of luminance based on most watermarks are lost in areas where full black is printed.
When the page substrate is painted with a watermark, useful watermark information is recovered from regions of the page that have not been printed, e.g., from "white spaces" such as between columns, between lines, at the end of paragraphs. Inked characters are the most neglected "noise". The blur of the printed portions of the page introduced by the focus defects of the PDA cameras can be used to define a mask that identifies areas of large amounts of ink. These portions can be ignored when decoding the watermark data.
More particularly, fuzzy image data can be thresholded. Any image pixels having a value darker than the threshold value can be ignored. Alternatively, only image pixels having a value brighter than the threshold value are input to the watermark decoder. The "noise" contributed by the characters in ink is thus filtered.
In imaging devices that capture clearly focused text, similar advantages can be generated by processing text with a fuzzy kernel-and by extracting regions that are found to be dominated by such printed text.
By arrangements such as those described above, deficiencies of the portable imaging devices are corrected and improved print-to-web linking based on page layout data becomes possible.
Image search, feature extraction, pattern matching Etc
The image retrieval functions of certain of the above-
In the embodiments described above, content-based image retrieval (CBIR) may also be used. As is familiar to engineers, CBIR essentially involves using this characterization to (1) extract the characterization of the image - generally mathematically - and (2) to evaluate the similarity between the images. Two documents examining these fields are described in IEEE Trans. Pattern Anal. Mach. "Content-Based Image Retrieval at the End of the Early Years" by Smeulders et al., Vol. 22, pp. 1349-1380, Intell., And Datta et al. In ACM Computing Surveys Vol. 2, Image Retrieval: Ideas, Influences and Trends of the New Age ".
Identifying identically visible images from large image databases is a familiar operation in issuing drivers' licenses. That is, the image captured from the new applicant is generally verified against the database of all previous driver's license photos to ascertain whether the applicant has already issued a driver's license (possible under a different name). Methods and systems known from the driver's license field can be utilized in the arrangements described herein. (Examples include Identix Patent 7,369,685, and L-1 Corp. Patents 7,283,649 and 7,130,454.)
Image feature extraction algorithms known as CEDD and FCTH are useful in many of the embodiments herein. "CEDD: Color and Edge Directivity Descriptor - A Compact Descriptor for Image Indexing and Retrieval" by Chatzichristofis et al. At ICVS 2008 in May 2008 at the 6th International Conference on advanced research on Computer Vision Systems ICVS 2008; The latter is described in detail in "FCTH: Fuzzy Color and Texture Histogram - A Low Level Feature for Accurate Image Retrieval" by Chatzichristofis et al., IEEE Computer Society, May 9th, 2008, International Conference on Image Analysis for Multimedia Interactive Services, Proceedings.
Open-source software that implements these techniques is available; See the web page savvash.blogspot-dot-com / 2008/05 / cedd-and-fcth-are-now-open-dot-html. DLLs implementing the function can be downloaded. Classes for input image data (e.g., file .jpg) can be invoked as follows:
double  CEDDTable = new double ;
FCTHTable = new double ;
Bitmap ImageData = new Bitmap ("c: /file.jpg");
GetCEDD = new CEDD ();
FCTH GetFCTH = new FCTH ();
CEDDTable = GetCEDD.Apply (ImageData);
GetFCTH.Apply (ImageData, 2);
CEDD and FCTH may be combined to produce improved results using a common complex descriptor file available from the web page just quoted.
Chatzichristofis made the open source program "img (Finder)" available (see web page savvash.blogspot-dot-com / 2008/07 / image-retrieval-in-facebook-dot-html) - using CEDD and FCTH Based desktop application that retrieves and indexes images from Facebook social networking sites. In use, the user connects Facebook with their personal account data, and the application downloads information from user's images as well as user's friends' image albums and indexes these images for retrieval with CEDD and FCTH features . The index can then be queried by the sample image.
Chatzichristofis also provides a service online search service " imakt (Anaktisi) ", using the image metrics including CEDD and FCTH, in which a user uploads a photo and the service retrieves similar images from one of eleven different image archives. . orpheus.ee.duth-dot-gr / anaktisi /. (Image archives include flicker). In the associated description of the Anaktisi search service, Chatzichristofis explains:
The rapid growth of digital images through the widespread popularization of computers and the Internet has led to the development of critical and efficient image retrieval technologies. CBIR Known as Container Based image retrieval extracts several features describing the image content and maps it to a new space called the so-called feature space of the visual content of the images. The feature space values for a given image are stored in a descriptor that can be used to retrieve similar images. The key to a successful search system is to select the appropriate features that represent the images as accurately and uniquely as possible. The selected features should be unique and sufficient to describe the objects present in the image. To achieve these goals, CBIR systems utilize the features of three basic types: color features, texture features, and spatial features. It is very difficult to achieve satisfactory search results using only one of these types of features.
So far, many proposed search techniques employ methods in which more than one feature type is associated. For example, color, texture, and shape features are used in both IBM's QBIC and MIT photo books. QBIC utilizes color histograms, instantaneous-based feature features, and text descriptors. The photobook utilizes appearance features, texture features, and 2D shape features. Other CBIR systems include SIMBA, CIRES, SIMPLIcity, IRMA, FIRE and MIRROR. The cumulative body of search provides extraction methods for these feature types.
In most search systems that combine two or more feature types such as color and texture, independent vectors are used to describe each kind of information. Although it is possible to achieve very good search scores by increasing the size of the descriptors of images with high dimensional vectors, this technique has several drawbacks. If the descriptor has hundreds or even thousands of beans, the search procedure is significantly delayed and is not practically usable. Also, Descriptor Increasing the size increases the storage requirements that can have significant penalties for databases containing millions of images. Many provided methods limit the length of the descriptor to a smaller number of bins, leaving possible factor values in decimal form in decimal form.
The Moving Picture Experts Group (MPEG) defines a standard for content-based access to multimedia data in the MPEG-7 standard. This standard identifies a set of image descriptors that maintain a balance between the size of the feature and the quality of the search results.
In this web-site, a new set of feature descriptors are provided to the search system. These descriptors are designed to be as small as possible without compromising discrimination, paying particular attention to size and storage requirements. These descriptors integrate color and texture information into one histogram while maintaining sizes between 23 and 74 bytes per image.
High search scores in content-based image search systems can be achieved by adopting related feedback mechanisms. These mechanisms require the user to classify the quality of the query results by marking the retrieved images as related or unrelated. Thereafter, the search engine preferably uses this categorized information in subsequent queries to meet the needs of the user. While relevant feedback mechanisms have been introduced in the information-related field, they are currently receiving considerable attention in the CBIR field. The majority of the relevant feedback techniques proposed in the references are based on modifying the values of the search parameters, which better represent the concepts that the user has in mind. The search parameters are calculated as a function of the related values assigned by the user for all the images so far retrieved. For example, the relevant feedback is frequently formulated in terms of modification of the query vector and / or in terms of adaptive similarity metrics.
Also, at this web-site, Auto Relevance Feedback (ARF) techniques are introduced based on the proposed descriptors. The goals of the proposed auto-related feedback (ARF) algorithm are best re-adapted in the initial search results based on user preferences. During this procedure, the user selects one from the first searched images as related to his initial search predictions. The information from these selected images is used to change the initial query image descriptor.
Another open source content-based image retrieval system is the GIFT (GNU Image Search Tool) created by researchers at Geneva University. One of the tools allows the user to index directory trees containing images. The GIFT server and its client (SnakeCharmer) can then be used to retrieve indexed images based on image similarity. The system is additionally described on the web page gnu-dot-org / software / gift / gift-dot-html. The latest version of the software can be found on the ftp server ftp.gnu-dot-org / gnu / gift.
Another open-source CBIR system is Fire, which is written by Tom Deselaers et al. At RWTH Aachen University and available for download from the web page -i6.informatik.rwth-aachen-dot-de / ~ deselaers / fire /. Fire uses the techniques described in Deselaers' "Features for Image Retrieval: An Experimental Comparison", for example, in the March 2008 Dutch Scrineter, Information Retrieval No.2, Vol. 11, pp. 77-107.
Embodiments of the present invention are generally associated with objects depicted in the image rather than whole frames of image pixels. Recognition of objects in an image (sometimes referred to as computer vision) is a large science considered to be familiar to the reader. Edges and centroids are image features that can be used to help in recognizing objects in images. The other is geometry contexts (see Matching with Shape Contexts by Belongie et al. In IEEE 2000 Workshop on Content Based Access and Image Libraries). Robustness (for example, scale invariance, rotation invariance) to affine transforms is an advantageous feature of certain object recognition / pattern matching / computer vision techniques. Methods based on Hough transform and Fourier melilin transformation represent rotation-invariant properties. SIFT (discussed below) is an image recognition technique with this and other advantageous properties.
In addition to Object Recognition / Computer Vision, the image processing discussed in this specification (as opposed to metadata-related processing) can utilize a variety of other techniques, which can proceed by various names. Image analysis, pattern recognition, feature extraction, feature detection, template matching, face recognition, and eigenvectors. (All of these terms are generally used interchangeably herein.) A reader of interest refers to a Wikipedia having an article for each of the topics just listed, including individual descriptions and citations to relevant information.
The image metrics of the kind described are sometimes referred to as metadata, i.e., "content-dependent metadata ". This is in contrast to "content-descriptive metadata" - more familiar in terms of the use of term metadata.
Communication Devices Interaction
Most of the examples described above involve imaging objects that do not have means to communicate. This section more specifically takes into account those techniques applied to objects that can be equipped or equipped to communicate. Simple examples are WiFi-equipped thermostats and parking meters, Ethernet-linked telephones and hotel bedside clocks with Bluetooth.
Consider the user driving into the user's office city. When she finds an empty parking space, she points her cell phone to the parking meter. The virtual user interface (UI) appears almost immediately on the cell phone screen - allowing the user to purchase two hours from the meter. Inside the office building, a woman finds the meeting room cool and points the cell phone to the thermostat. After a while, a different virtual user interface appears on the cell phone - allowing her to change the settings of the thermostat. Ten minutes before the parking meter is about to run out, the cell phone rings and again provides a UI for the parking meter. The user - from her office - buys another hour.
For industrial users and other applications where security of interaction is important and for applications where anonymity is important, various levels of security and access privileges may be incorporated into the interaction session between the user and the mobile device being imaged . The first level includes simply explicitly or secretly encoding contact commands in the surface feature of the object, such as an IP address; The second level includes providing the disclosure-key information to the device more subtlely, either explicitly through an explicit symbol or via digital watermarking; And the third level can only be obtained by taking unique patterns or digital watermarking actively photographing the object.
The interface provided on the user's cell phone may be configured to facilitate certain work-type interactions with the device and / or in accordance with user preferences (e.g., while the office staff can suspend temperature setting control, Stat "debug" interface can be stopped) can be customized.
Such costs may be unnecessary if there are physical objects or devices incorporating elements such as displays, buttons, dials, or other such features intended for physical interaction with an object or device. Instead, the functionality may be duplicated by a mobile device that actively and virtually interacts with the object or device.
By integrating the wireless chip into the device, the manufacturer effectively enables the mobile GUI for the device.
According to one aspect, this technique includes using a mobile phone to obtain identification information corresponding to the device. With reference to the obtained identification information, the application software corresponding to the device is then identified and downloaded to the mobile phone. This application software is then used to facilitate user interaction with the device. With such an arrangement, the mobile phone acts as a multifunctional controller, which adapts to control a particular device through the use of identified application software with reference to information corresponding to the device.
According to another aspect, this technique includes using a mobile phone to sense information from the housing of the device. Through the use of the sensed information, other information is encrypted using the public key corresponding to the device.
According to another aspect, this technique involves using a mobile phone to sense analog information from a device. The sensed analog information is then converted to digital form and the corresponding data is transmitted from the cell phone. The data thus transmitted is used to verify user proximity to the device before allowing the user to interact with the device using the mobile phone.
According to another aspect, this technique includes utilizing a user interface on a user's cell phone to receive instructions related to control of the device. This user interface is provided on the screen in combination with a cell phone-captured image of the device. While the information corresponding to the command is signaled to the user in the first manner, the command is pending; Then, in the second way, the command was executed successfully.
According to another aspect, the present technique includes initiating a transaction with a device using a user interface provided on a screen of a user's cell phone when the user is approaching the device. Later, the cell phone is used for purposes not related to the device. Later, the user interface is recalled and used to associate with other transactions with the device.
According to yet another aspect, this technique includes a mobile phone including a processor, a memory, a sensor and a display. The instructions in the memory configure the processor to: < Desc / Clms Page number 2 > Downloading first user interface software corresponding to the first device with reference to the sensed information; Interacting with the first device by user interaction with the downloaded first user interface software; Recalling from the memory second user interface software initially downloaded to the mobile phone corresponding to the second device; And interacting with the second device by user interaction with the recalled second user interface software, regardless of whether the user is proximate to the second device.
According to another aspect, this technique includes a mobile phone including a processor, a memory and a display. The instructions in the memory may be used to provide a user interface that allows the processor to select between a number of different device-specific user interfaces stored in memory for the processor to interact with a plurality of different external devices using the mobile phone Processor.
These arrangements are described in more detail with reference to Figs. 78-87.
78 and 79 illustrate a WiFi-mounted thermostat 512 of the prior art. Which includes a temperature sensor 514, a processor 516, and a user interface 518. The user interface includes various buttons 518, an LCD display screen 520, and one or more indicator lights 522. Memory 524 stores programming and data for the thermostat. Finally, the WiFi transceiver 526 and the antenna 528 allow communication with remote devices. (The depicted thermostat 512 is available as a Model CT80 from the Wireless Thermostat Company of the United States.) The WiFi transceiver includes a GainSpan GS1010 System on Chip (SoC) device.
80 shows a similar thermostat 530, but implements the principles according to certain aspects of the present technique. As with thermostat 512, thermostat 530 includes temperature sensor 514, processor 532. The memory 534 may store the same programming and data as the memory 524. However, this memory 534 includes some more software to support the functions described below. (For convenience of explanation, the software associated with this aspect of the technology is given a name: ThingPipe software. The thermostat memory thus has a ThingPipe code, which is used to implement the functions described above - Cooperate with other code on the.
Thermostat 530 may include the same user interface 518 as thermostat 512. However, significant savings can be achieved by omitting many associated parts such as LCD displays and buttons. The depicted thermostat may thus only include indicator light 522, which may be omitted as well.
Thermostat 530 also includes an arrangement in which the identity can be sensed by the cell phone. WiFi emissions from the thermostat may be utilized (e.g., by the MAC identifier of the device). However, other means, such as indicators that can be sensed by the camera of the cell phone, are desirable.
A steganographic digital watermark is one such indicator that can be detected by a cell phone camera. Digital watermark technology is described in the assignee's patents, including 6,590,996 and 6,947,571. The watermark data may be encoded in a texture pattern on the exterior of the thermostat, on the attachment label, on a pseudo wood-grain trim on the thermostat, or the like. (Steganographic encoding is hidden, so it is not depicted in Figure 80.)
Other suitable indicators are 1D or 2D bar codes or explicit symbols, such as bar code 536 shown in FIG. This can be printed on a thermostat housing applied by an attachment label or the like.
Another means for identifying a thermostat, such as RFID chip 538, may be utilized. And the other is a short-range wireless broadcast or network service discovery protocol (e. G., Bonjour) of the Bluetooth identifier - such as by Bluetooth. Object recognition by means such as scale-invariant feature transformation such as SIFT or image fingerprinting can also be used. Other identifiers may also be used - either manually entered by the user, or identified by navigating through the directory structure of possible devices. The technician will know many other alternatives.
81 shows an exemplary cell phone 540, such as an Apple iPhone device. These include conventional elements including a processor 542, a camera 544, a microphone, an RF transceiver, a network adapter, a display, and a user interface. The user interface includes physical controls as well as touch-screen sensors. (Details of the user interface and associated software are provided in Apple's patent publication 20080174570.) The phone's memory 546 includes general operating system and application software. In addition, it includes ThingPipe software for executing the functions described in this specification.
Returning to the operation of the illustrated embodiment, the user captures an image depicting the thermostat 530 digitally watermarked using the cell phone camera 544. [ The handler 542 of the cell phone pre-processes the captured image data (e.g., by applying a Wiener filter or other filtering and / or compression to the image data) and sends the processed data to the remote server 552 via wireless (FIG. 82) - with information identifying the cell phone. (This may be part of the function of the ThingPipe code in the cell phone.) The wireless communication can be made to the WiFi at a nearby wireless access point, and then to the Internet to the server 552. [ Or a cell phone network can be utilized, and so on.
The server 552 applies a decoding algorithm to the processed image data received from the cell phone to extract the steganographically encoded digital watermark data. Which may include an identifier of the decoded data-thermostat, is transmitted by the Internet to the router 554, along with information identifying the cell phone.
The router 554 receives the identifier and looks it up in the namespace database 555. The namespace database 555 queries the most significant bits of the identifier and queries to identify the particular server responsible for the identifiers of the group. The server 556 identified by this process has data belonging to the thermostat. (This arrangement is similar to domain name servers utilized in Internet routing.) Patent 6,947,571 discloses additional disclosures about how watermarked data can be used to identify a server that knows what to do with such data I have.)
In cell phone 540, ThingPipe software responds to the received information by providing a graphical user interface to thermostat 530 on its screen. The GUI may include ambient temperature and setpoint temperature for the thermostat - either from the server 556 or directly from the thermostat (as by WiFi). The provided GUI also includes controls that allow the user to operate to change settings. To raise the setpoint temperature, the user touches the displayed control corresponding to this operation (e.g., the "increase temperature" button). The setpoint temperature provided in the UI display increases immediately in response to the user's actions - perhaps in flashing or other unusual ways to indicate that the request is pending.
The user's touch also allows the ThingPipe software to send corresponding data from the cell phone 540 to the thermostat (the transmission may include some or all of the other devices shown in Figure 82, or directly to the thermostat - WiFi As can be done by. Upon receipt of this data, the thermostat increases the set temperature for each user command. Thereafter, a confirmation message relayed back from the thermostat to the cell phone is issued. Upon receipt of the confirmation message, the flashing of the increased temperature indicator is stopped, and the setpoint temperature is then displayed in a static form. (Other arrangements are possible, of course.) For example, the confirmation message can be rendered to the user as a visible signal - the text provided on the display "accepted ", audible sounding or & .)
In one particular embodiment, the displayed UI is provided as an overlay on the screen of the cell phone and depicts the thermostat at the top of the initially captured image by the user. The features of the UI are provided in an alignment consistent with any corresponding physical controls (e.g., buttons) shown in the captured image. Thus, if the thermostat has temperature up and temperature down buttons (e.g., "+" and "-" buttons in FIG. 79), the graphic overlay will display the displayed image They can be outlined. These are graphical controls that the user touches to raise or lower the set point temperature.
This is schematically illustrated in FIG. 83, where the user has captured an image 560 of a portion of the thermostat. The image includes at least a portion of the watermark 562 (shown for illustrative purposes). Referring to the data and watermark obtained from the server 556 regarding the layout of the thermostat, the cell phone processor scrolls and overlays the dotted lines at the top of the image - outlines the "+" and "-" buttons. When the touch-screen user interface of the phone touches in these outlined areas, the cell phone reports this to the ThingPipe software. Thereafter, these touches are interpreted as instructions to increase or decrease the thermostat temperature, and these commands are sent to the thermostat (e.g., via server 552 and / or 556). On the other hand, this also increases the overlaid "SET TEMPERATURE" graphic at the top of the image and flashes the confirmation message until it is received again from the thermostat.
The registered overlay at the top of the graphical user interface that captured the image data is enabled by the encoded watermark data on the thermostat housing. The calibration data in the watermark allows for rotation of the scale, transformation and placement of the thermostat in the image to be precisely determined. If the watermark is reliably placed on the thermostat in known spatial relationships with other device features (e.g., buttons and displays), the location of these features in the captured image can be determined with reference to the watermark have. (This technique is further described in applicant's published patent application 20080300011.)
If the cell phone does not have a touch-screen, the registered overlay of the UI can still be used. However, instead of providing a screen target for the user to touch, the outlined buttons provided on the cell phone screen may represent the corresponding buttons on the keypad of the phone that the user has to press to activate the outlined function. For example, an outlined box around the "+" button can be periodically glowed orange with the number "2" - the user must press the "2" button on the cell phone keypad to increase the thermostat temperature setting Lt; / RTI > (The number "2" is flashing at the top of the "+" portion of the image - allowing the user to identify the "+" marking in the crowded image when the number is not flashing) Quot; 8 "may be periodically flashing orange with the number" 8 " - indicating that the user must press the "8" button on the cell phone keypad to decrease the thermostat temperature set point. See FIG. 84 at 572,574.
It is believed that the overlay of the graphical user interface to the aligned alignment on the captured image of the thermostat is most likely to be implemented through the use of watermarks, but other arrangements are possible. For example, if the size and scale of the barcode and its location on the thermostat are known, the locations of the thermostat features for overlay can be determined geometrically. It is similar to an image fingerprint-based approach (including SIFT). Once the regular appearance of the thermostat is known (e.g., by server 556), the relevant locations of the features in the captured image can be distinguished by image analysis.
In one particular arrangement, the user captures a frame of the image depicting the thermostat, which is buffered for static display by the phone. Thereafter, the overlay is provided in aligned alignment with this static image. When the user moves the camera, the static image persists, and the overlaid UI is similarly static. In another arrangement, the user captures a stream of images (e.g., video capture) and the overlay is provided in an alignment consistent with the features of the image even if they move from frame to frame. In this case, the overlay can move across the screen in response to the movement of the depicted thermostat in the cell phone screen. This arrangement may allow the user to move the camera to capture the thermostats of the different aspects-perhaps, additional features / controls appear. Or allows the user to zoom the camera so that certain features (and corresponding graphic overlays) appear or appear on a larger scale on the touch screen display of the cell phone. In this dynamic overlay embodiment, the user can selectively lock the captured image at any time, and then continue to work with the overlaid user interface control (static) - maintain the thermostat in the camera's field of view Regardless of what you do.
If the thermostat 530 is of a type that does not have visible controls, then the UI displayed on the cell phone may be in any format. If the cell phone has a touch-screen, the thermostat controls can be provided on the display. If a touch-screen is not present, the display may simply provide a corresponding menu. For example, the user may be instructed to press "2" to increase the temperature set value and "8" to decrease the temperature set value.
After the user issues a command over the cell phone, the command is relayed to the thermostat as described above, and a confirmation message is preferably returned - for rendering to the user by the ThingPipe software.
The displayed user interface is a function of the device (e.g., thermostat) with which the phone is interacting and is also a function of the capabilities of the cell phone itself (e.g. whether it has a touch-screen, Etc.). ≪ / RTI > The instructions and data that allow the cell phone's ThingPipe software to create these different UIs can be stored in the server 556 that manages the thermostat and is passed to the memory 546 of the cell phone with which the thermostat interacts .
Another example of a device that can be so controlled is a WiFi-enabled parking meter. The user captures an image of the parking meter with a cell phone camera (for example, by pressing a button, or image capture can be performed freely - such as every second or several times). The treatments generally occur as described above. The ThingPipe software processes the image data and the router 554 identifies the server 556a responsible for ThingPipe interactions with its parking meter. The server returns status information about the meter and optionally UI interactions (e.g., time remaining, maximum allowable time). These data are displayed on the cell phone UI and are overlaid on the captured image of the cell phone, for example with the controls / commands for purchasing time.
The user interacts with the cell phone to add 2 hours to the meter. The corresponding payment is withdrawn, for example, from the user's credit card account - stored as encrypted profile information on the cell phone or remote server. (Online payment systems, including payment arrays suitable for use with cell phones, are well known and will not be described here in detail here.) The user interface on the cell phone confirms that the payment has been made satisfactorily, Is displayed. The display at the meter on the street side may also reflect the time of purchase.
The user can leave the meter to perform other tasks and use the cell phone for other purposes. The cell phone can go into a low power mode - the screen goes dark. However, the downloaded application software keeps track of the number of people remaining on the meter. This can also be done by querying the data periodically at the associated server. Or independently tracked down time. At a given point, for example, ten minutes remaining, the cell phone rings an alarm.
Looking at the cell phone, the user can see that the cell phone has returned to the active state and the meter UI has been restored to the screen. The displayed UI reports the remaining time and provides the user with the opportunity to purchase more time. The user purchases another 30 minutes. The completed purchase is confirmed on the cell phone display - it shows that there is 40 minutes remaining. The display on the distance meter can be similarly updated.
Note that the user need not physically return to the meter to add time. The virtual link between the cell phone and the parking meter has lasted or reestablished - even though the user could walk the 12 blocks and ride the elevator over multiple floors. The parking meter control is as close as the cell phone.
(Although not specifically described, the block diagram of the parking meter is similar to that of the thermostat of FIG. 80 except that it does not have a temperature sensor.)
Example 3 - Consider a bed alarm clock at a hotel. Most travelers know that the various illogical user interfaces provided by these clocks are erroneous. It is late; The traveler is confused by the long flight and faces a chore to figure out which of the black buttons on the black clock should be manipulated in a hazy hotel room to set the alarm clock at 5:30 am. It is better if these devices can be controlled by an interface provided on the user's cell phone - a standardized user interface known from repeated use by the traveler is desirable.
85 shows an alarm clock 580 that utilizes aspects of the present technique. Like other alarm clocks, it includes a display 582, a physical UI 584 (e.g., buttons), and a control processor 586. However, this clock also includes a Bluetooth radio interface 588, and a memory 590 in which Bluetooth software and ThingPipe for execution by the processor are stored. The clock also has its own means for identifying it, such as a digital watermark or bar code, as described above.
As in the earlier examples, the user captures an image of the clock. The identifier is decoded from the image by the cell phone processor or by the processor of the remote server 552b. From the identifier, the router identifies another server 556b that can know about those clocks. The router passes the identifier along with the address of the cell phone to another server. The server uses the decoded watermark identifier to look up a particular clock and recalls commands related to its processor, display, and other configuration data. It also provides instructions for a particular display of cell phone 530 to provide a standardized clock interface through which clock parameters can be set. The server packages this information in a file, which is sent back to the cell phone.
The cell phone receives this information and provides the user interface described above by the server 556b on the screen. This is a familiar interface that appears whenever the cell phone is used to interact with the hotel alarm clock, regardless of the model or manufacturer of the clock. (In some cases, the phone can simply recall the UI from the UI cache, for example, in a cell phone, which is frequently used.)
The UI includes the control "LINK TO CLOCK". When selected, the cell phone communicates with the clock via Bluetooth. (The parameters sent from the server 556b may be required to establish a session.) Once linked by Bluetooth, the time displayed on the clock is provided with a menu of options on the cell phone UI.
One of the options provided on the cell phone screen is "SET ALARM ". Upon selection, the UI moves to another screen 595 (FIG. 86) and prompts the user to enter the desired alarm time by pressing the digit keys on the phone's keypad. Other paradigms can be used naturally, such as flicking displayed numbers on the touch-screen interface to rotate them, for example, until the desired digits appear.) When the desired time has been entered, the user Press the OK button on the cell phone keypad to set the time.
As before, the entered user data (e.g., alarm time) will remain flashing until the device issues an acknowledgment-the data displayed at that point stops flashing-when an instruction is sent to the device ).
At the clock, the command to set the alarm time to 5:30 a.m. is received by Bluetooth. The ThingPipe software in the alarm clock memory understands the format in which the data is transmitted by the Bluetooth signal and analyzes the commands to set the desired time and alarm. The alarm clock handler then sets the alarm to ring at the specified time.
Note that in this example, the cell phone and the clock communicate directly (rather than via one or more intermediate computers) (although other computers have been referenced by the cell phone to obtain programming details for the clock, No more contact.)
Note that in this example - unlike the thermostat, the user interface does not self-integrate with the image of the clock captured by the user (e.g., in aligned alignment). These improvements are omitted to provide a consistent user interface experience - independent of the particular clock being programmed.
As in the earlier example, the watermark is good at identifying a particular device by the subscriber. However, any other known identification technique, including the known identification techniques described above, may be used.
The optional location modules 596 have not been described in each of the above described devices. One such module is a GPS receiver. Other modern technologies suitable for these modules rely on wireless signals that are typically exchanged between devices (e.g., WiFi, cellular, etc.). Given multiple communication devices, the signals themselves - and the incomplete digital clock signals that control them - form a reference system from which both very precise time and location can be extracted. This technique is described in WO08 / 073347.
Knowing the locations of the devices allows for enhanced functionality to be realized. For example, it allows devices to be identified by their location (e.g., unique latitude / longitude / altitude coordinates) rather than by an identifier (e.g., watermarked or otherwise). Moreover, it allows the proximity between the cell phone and other ThingPipe devices to be determined.
Consider an example of a user accessing a thermostat. Rather than capturing an image of the thermostat, the user can simply launch the phone's ThingPipe software (or it may already be running in the background). This software communicates to the server 552 the current location of the cell phone and requests the identification of other nearby ThingPipe-capable devices. ("Nearby" is, of course, implementation dependent, which may be, for example, 10 feet, 10 meters, 50 feet, 50 meters, etc. This parameter may be defined by the cell phone user, The server 552 checks the database identifying the current locations of other ThingPipe-enabled devices and returns the data to the cell phone identifying those nearby. The listing 598 (FIG. 87) is provided on the cell phone screen, including the distance from the user. (If the location module of the cell phone includes other means for determining the direction in which the device or device is facing, the displayed listing may also include directional clues with distance, e.g., "4" You can include it.)
The user selects THERMOSTAT from the displayed list (for example, by touching the screen or by entering the associated digit on the keypad if it is a touch screen). The phone then establishes a ThingPipe session with the device thus identified as described above. (In this example, the thermostat user interface is not overlaid on top of the thermostat image, since the image is not captured.)
In the three examples described above, there is a question of who should be authorized and how long to interact with the device.
In the case of a hotel alarm clock, the authorization is not important. Anyone in the room may be considered to be authorized to set clock parameters (e.g., current, time, alarm time, display brightness, buzzer or wireless alarm, etc.) capable of detecting a clock identifier. However, authorization should only be continued if the user is within the perimeter of the clock (e. G., Within Bluetooth range). While the next night's guest is sleeping, the previous guest need not have to reprogram the alarm.
In the case of a parking meter, the authorization must be given back to someone who approaches the meter and captures the image (or senses the identifier from a short distance).
As is well known, in the case of a parking meter, a user can recall a UI corresponding to a nager and associate with devices and other transactions. This is somewhat good. Perhaps, 12 hours from the time of image capture is a reasonable time interval within which the user can interact with the meter. (If the user adds time later in the 12 hours - there is no problem when someone else is parked in that space.) Alternatively, the permission to interact with the user's device may allow the new user to (E.g., by capturing an image of the device and initiating a transaction of the type identified above).
The memory that stores the data that sets the user's permission period may be located in the meter, or may be located elsewhere, e.g., at server 556a. The corresponding ID for the user can also generally be stored. This may be the user's telephone number, the MAC identifier for the phone device, or some other generally unique identifier.
In the case of a thermostat, there may be stricter controls on how long a person is allowed to change the temperature. Perhaps only the managers of the office can set the temperature. Other employees may be provided with lower rights, for example, to just see the current ambient temperature. Again, the memory storing such data may be located on the thermostat, on the server 556, or elsewhere.
These three examples are simple, and controlled devices are small results. In other applications, higher security is naturally associated. The field of authentication is well developed, and the technician can derive from known techniques, and techniques for implementing an authentication arrangement that fits the specific needs of any given application.
As technology becomes more widespread, users may have to switch between multiple on-going ThingPipe sessions. The ThingPipe application, when selected, may have a "Recent UI" menu option that is pending or invokes a list of recent sessions. If any are selected, the corresponding UI is recalled, allowing the user to continue the initial interaction with the particular device.
Physical user interfaces - such as for thermostats - are fixed. All users are provided with the same physical display, knobs, dials, and so on. All interactions must force adaptations of these same physical vocabulary controls.
Implementations of aspects of the present technology may be more varied. Users can have stored profile settings - tailoring cell phone UIs to their particular preferences - globally and / or per device. For example, color blind users can be so specified that they always provide a grayscale interface - instead of colors that the user may find difficult to distinguish. A person with a source of vision may be good at displaying information in the largest possible font - regardless of aesthetics. Others may choose to read the text from the display, such as by synthesized speech. One specific thermostat UI can generally provide text representing current data; The user can specify that the UI is not clustered with such information, and that - for that UI - the data information should not be visible.
The user interface may also be customized for specific task-oriented interactions with the object. The technician can call the "debug" interface to the thermostat to tune the associated HVAC system; The clerical staff can call a simpler UI that simply provides current and setpoint temperatures.
Different levels of security and access privileges may also be provided, as different interfaces may be provided to different users.
The first security level simply encodes (explicitly or secretly) the contact commands for the object in the surface features of the object, such as an IP address. The session simply begins with a cell phone that collects contact information from the device. (Indirectly, the information on the device may refer to the remote storage storing contact information for the device).
The second level includes the public-key information, which may be more subtle hidden through steganographic digital watermarking that is explicitly provided on the device through obvious symbols and indirectly accessed or otherwise delivered. For example, machine-readable data on a device may provide the device's public-key - the transmissions from the user must be encrypted using it. The user's transmissions can also convey the user's public key - whereby the device can identify the user and use it to encrypt the data / commands returned to the cell phone.
Such an arrangement allows a secure session with the device. Thermostats in the mall can use this technology. All passers can read the thermostat's public key. However, the thermostat can only provide control rights to specific users - identified by their respective public keys.
The third level includes preventing inherent patterns or control of the device unless the user presents digital watermarking that can only be obtained by actively photographing the device. That is, it is not simple enough to transmit an identifier corresponding to the device. Rather, a minifier that proves the physical proximity of the user to the device must also be captured and transmitted. The user can obtain necessary data only by capturing an image of the device; Image pixels must prove that the user takes a picture in the neighborhood.
To avoid spooling, all previously presented patterns - can be cached to a remote server or device - and new data is verified as new data is received. If the same pattern is presented twice, qualification may be lost - as an obvious replay attack (ie each image of the device must have some variation at the pixel level). In some arrangements, the appearance of a device changes over time (e.g., by a display that provides a periodically varying pattern of pixels) and the presented data is displayed at the immediately preceding time interval (e.g., 5 seconds or 5 minutes) To the device.
In a related embodiment, any analog information (appearance, sound, temperature, etc.) can be sensed from the device or its environment and used to establish user proximity to the device. (An incomplete representation of analog information can be reused to detect playback attacks when converted to digital form.)
One simple application of this arrangement is the scavenger hunt - taking a picture of the device provides the presence of the user in the device. A more practical application is industry settings, and it involves people trying to remotely access devices that are not physically present.
The majority of variations and hybrids of such arrangements will be apparent to those skilled in the art from the foregoing.
Sometimes, references to SIFT techniques are made. SIFT is the first in a series of scale-invariant feature transitions and is a computer vision technique described by David Lowe and described in various articles. His paper is International Journal of Computer Vision, 60, 2 (2004), 91- "Distinctive Image Features from Scale-Invariant Keypoints" on page 110; And International Conference on Computer Vision, Corfu, Greece (September 1999), pp. 1150-1157, as well as patent 6,711,293.
SIFT works by identification and description of local image features and subsequent detection. SIFT features are local, based on the appearance of objects at points of special interest, and are invariant to image scale, rotation and affine transformation. They are also powerful in light changes, noise, and some changes in the point of view. In addition to these attributes, they are unusual, relatively easy to extract, allow accurate object identification with low mismatchability, and are easy to match for (large) databases of local features. Object descriptions by a set of SIFT features are also robust to partial occlusion; Three SIFT features from the object are sufficient to calculate the position and posture.
The technique begins by identifying local image features in the reference image - referred to as keypoints. This is done by winding the image to Gaussian blur filters at different scales (resolutions) and determining differences between successive Gaussian-blurred images. Keypoints are image features with a maximum or minimum difference of Gaussian occurring at multiple scales. (Each pixel of the difference of the Gaussian frame is compared to eight neighbors on the same scale and compared to corresponding pixels of each of the neighboring scales (e.g., nine different scales) Is selected as a candidate key-point.
(The procedure just described is a method of detecting a space-scale extremum of a scale-localized Laplacian transform of an image. The difference in the Gaussian scheme is an approximation of this Laplacian operation expressed in the pyramid setup.)
The above procedure is typically performed, for example, because of having low contrast (and thus sensitivity to noise) or having poorly determined positions along the edge (the difference function of Gaussian has a strong response along the edges, Identifies a number of keypoints that are not suitable for generating many candidate keypoints, most of which are not robust to noise). Unreliable keypoints are screened by performing a fine fit on candidate keypoints for nearby data for a precise location, scale, and ratio of major curvatures. This rejects keypoints that have low contrast or are poorly positioned along the edge.
More particularly, this process begins by interpolating nearby data to more accurately determine the keypoint position - for each candidate keypoint. This is often done by Taylor expansion using keypoints as a source to determine an improved estimate of the maximum / minimum position.
The value of the secondary Taylor expansion can also be used to identify low contrast keypoints. If the contrast is less than the threshold value (e.g., 0.03), the keypoint is discarded.
To remove key points having strong edge responses but poorly localized, a modification of the corner detection procedure is applied. Briefly, this entails calculating the dominant curvature over the edge and comparing the dominant curvature along the edge. This is done by subtracting the eigenvalues of the second-order Hessian matrix.
If keypoints that are not suitable are discarded, the remaining keypoints are evaluated for orientation by the local image slope function. The magnitude and direction of the gradient is computed (at the scale of the keypoint) for all pixels in the neighboring region around the keypoint in the Gaussian blurred image. An orientation histogram with 36 bins is then compiled - each bin contains degrees of orientation. Each pixel in the neighborhood contributes to the histogram, and the contribution is weighted by the slope magnitude and by Gaussian whose sigma is 1.5 times the scale of the keypoint. The peaks in this histogram define the dominant orientation of the keypoint. This orientation data allows the SIFT to achieve rotational robustness, since the keypoint descriptor can be expressed for this orientation.
From the above, a plurality of keypoints with different scales are identified - each having corresponding orientations. This data is immutable to image translation, scale and rotation. The 128 element descriptors are then generated for each keypoint, allowing for illumination and robustness to a 3D perspective.
This operation is similar to the orientation evaluation procedure just reviewed. The keypoint descriptor is computed as a set of oriented histograms for (4 x 4) pixel neighbors. The orientation histograms relate to the keypoint orientation, and the orientation data comes from the Gaussian image with the closest scale to the keypoint scale. As before, the contribution of each pixel is weighted by the gradient magnitude, and by Gaussian where sigma is 1.5 times the scale of the keypoint. The histograms each include eight bins, and each descriptor includes a 4 x 4 array of 16 histograms around the keypoint. This results in a SIFT feature vector with (4 x 4 x 8 = 128 elements). This vector is normalized to improve the invariance to changes in illumination.
The procedure described above is applied to the training images to compile the reference database. The unknown image is then processed as described above to generate keypoint data, and the nearest matching image in the database is identified by Euclidean distance-type measurement. ("best-bin-first" algorithm is typically used instead of pure Euclidean distance computation to achieve size ordering improvements of various orders.) To avoid false positives, When the nearest distance score for the next best match - for example, 25% - produces a "no match" output.
To further improve performance, images can be matched by clustering. This identifies features belonging to the same reference image - allowing unclustered results to be considered fake. Hough transforms can be used - identify clusters of objects in favor of the same object pose.
The paper described in detail the specific hardware embodiment for executing the SIFT procedure is entitled " Parallel Hardware Architecture for Scale and Rotation Invariant Feature Detection "by Bonato et al. In IEEE Trans on Circuits and Systems for Video Tech, . A block diagram of such an arrangement 70 is provided in Fig. 18 (adapted from Bonato).
In addition to the camera 32 that generates pixel data, there are three hardware modules 72-74. Module 72 receives pixels from the camera as inputs and performs two types of operations: the Gaussian filter and the difference in Gaussian. The former is transmitted to the module 73; The latter is transmitted to the module 74. The module 73 calculates the pixel orientation and tilt magnitude. Module 74 detects keypoints and performs stability checks to ensure that keypoints are reliable when identifying features.
Software block 75 (running on an Altera NIOS II field programmable gate array) generates a descriptor for each feature detected by block 74 based on the pixel orientation and tilt magnitude generated by block 73 .
In addition to running different modules at the same time, there is parallelism within each hardware block. An exemplary implementation of Bonato handles 30 frames per second. Cellphone implementations can run somewhat slower, such as at least 10 fps in initial generation.
The reader refers to the Bonato paper for other details.
An alternative hardware architecture for existing SIFT technologies is described in Proc. of Int. It is described in the "Vision Based Modeling and Localization for Planetary Exploration Rovers" by Se et al. In the Astronautical Congress (IAC).
Another arrangement is detailed in "What is That Object Recognition from Natural Features on a Mobile Phone" by Henze et al. In 2009, Bonn, Mobile Interaction with the Real World. Henze et al. Use techniques by Nister et al. And Schindler et al. To extend the use of objects that can be recognized through the use of tree schemes (see, for example, "Scalable " by Nister et al in Proc. Of Computer Vision and Pattern Recognition, Recognition with a Vocabulary Tree ", and" City-Scale Location Recognition "by Schindler et al., Proc. Of Computer Vision and Pattern Recognition, 2007).
The above-described implementations may be utilized on cell phone platforms, or the process may be distributed between the cell phone and one or more remote service providers (or may be implemented using all image-processing execution-phones).
Published patent application WO07 / 130688 relates to a cellphone-based implementation of SIFT, where local descriptor features are extracted by a cell phone processor and transmitted to a remote database for matching to a reference library.
SIFT is probably the most well-known technique for generating powerful local descriptors, but there are other things that can be somewhat appropriate - depending on the application. They GLOH (2005 years IEEE Trans Pattern Anal Mach Intell, the "Performance Evaluation of Local Descriptors" due 10 No. 27 1615-1630 Mikolajczyk side note...); And SURF ("SURF: Speeded Up Robust Features" remarks by Bay et al., 2006 Eur. Conf. On Computer Vision (1) 404-417); In addition, in 2007 Proc. of the 6th IEEE and ACM Int. Symp. &Quot; Efficient Extraction of Robust Image Features on Mobile Devices " by Chen et al. In On Mixed and Augmented Reality; And October 2008 ACM Int. Conf. on Multimedia Information Retrieval, " Outdoors Augmented Reality on Mobile Phone Using Loxel-Based Visual Feature Organization "by Takacs et al. The investigation of the local descriptor features is described in IEEE Trans. On Pattern Analysis & Machine Intelligence, " A Performance Evaluation of Local Descriptors "by Mikolajczyk et al.
The Takacs paper discloses that image matching speed is greatly increased by restricting a large number of reference images (from which matches are derived) to those geographically close to the user's current location (e.g., within 30 meters). Applicants believe that it may be advantageous for the universe to be restricted to specialized domains such as faces, groceries, homes, etc. - by user selection or otherwise.
Addition to audio applications
Voice conversation on the mobile device naturally defines the structure of the session, providing a significant amount of metadata that can be leveraged to prioritize the audio key vector processing (most management information in the form of identified callers, geographic locations, etc.) ).
If a call is received without the CallerID information, it may trigger the processing of voice pattern matching with past calls that are still in the voicemail or key vector data for this is stored. (Google Voice is a long-term repository of potentially useful voice data for recognition or matching purposes.)
Functional blocks for speech recognition can be invoked - taking into account the geographical origin - the origin of the call can be identified, but not the familiar number (for example, not the number normally received on the user's contact list). For example, when it is foreign, speech recognition in the language of the country can be started. When the receiver receives the call, simultaneous voice-to-text conversion to the user's native language is initiated and may be displayed on the screen to assist in the conversion. If the geography is domestically, recalling local dialects / specific accent speech recognition libraries can allow you to more easily cope with southern distinctive tone or Boston accents.
Once the conversation has been initiated, prompts based on speech recognition may be provided on the cell phone screen (or other). Once the speaker on the contact's fabric begins discussing a particular topic, it creates native language queries at reference sites such as Wikipedia, locates the local user's calendar to check for availability, As shown in Fig.
Beyond the evaluation and processing of speech during a session, other audio can be analyzed as well. If the user on the far end of the conversation chooses not to or can not do local processing and key vector generation, this can be accomplished on the local user's handset, allowing remote experiences to be shared locally.
It is clear that all of the above are valid for video calls as well, and that both audio and visual information can be analyzed and processed into key vectors.
Public image for individual processing
Most of the above-mentioned discussions have related images captured by personal devices such as mobile phones. However, the principles and arrangements discussed are also applicable to other images.
Consider the problem of finding a parked vehicle in a busy crowd. The owner of the parking lot can erect one or more pole-mounted cameras to capture an image of the parking lot at a very advantageous point. These images can be made available-typically (e.g., by a file or page downloaded from the Internet) or locally (e.g., by a file or page downloaded from a local wireless network). Individuals can acquire images from these cameras and analyze them for individual uses - such as finding where their vehicles are parked. For example, a user's mobile phone may download one or more images and apply image processing (machine vision) techniques as described above to recognize a person's red Honda Civic and thus find it in the parking lot , ≪ / RTI > multiple matches are found, multiple candidate locations are identified).
In variant arrangement, the user's mobile phone simply provides a template of data characterizing the desired vehicle to a web service (e. G., Operated by the owner of the parking lot). The web service then analyzes the available images to identify candidate vehicles that match the data template.
In some arrangements, the camera can be mounted on an adjustable gimble and equipped with zoom lenses to provide pan / tilt / zoom controls. (One such camera is the Axis 215 PTZ-E.) These controls can be made accessible to users - for analysis (I know that Macy's is parked) to capture images from specific parts of the parking lot Or allows users to adjust the camera to better distinguish between candidate matching vehicles if analysis of other images has been performed. (In order to prevent abuse, camera control privileges may only be provided to authorized users. One way to establish the authorization is to issue a parking slip by the user - issued when the user's vehicle first enters the parking lot. This document is displayed on the user's mobile phone camera and can be printed information (e.g., alphanumeric, bar code, watermark, etc.) indicating that the user parked in the parking lot within a certain period of time (e.g., (For example, by a server associated with a parking lot).
Instead of (or in addition to) the pawl mounted camera provided by the car park owner, a similar "My Car Find" function can be achieved through the use of crow-sourced images. If each individual user has an "My Vehicle Find" application that processes the images captured by their mobile phones to find their respective vehicles, the images captured in this way can be shared for other benefits . Thus, user "A" can roam the paths in a search for a particular vehicle, user "B" can travel around elsewhere in the parking lot, and image feeds from each can be shared. Thus, the mobile phone of user B can find the vehicle of B in the image captured by user A's mobile phone.
This collected image may be stored in an archive accessible through the local wireless network from which the image is removed after a set period of time, e.g., two hours. Preferably, the geo-location data is associated with each image so that a vehicle matched with an image captured by a different user can be physically found in the parking lot. These image archives can be maintained by the parking lot owner (or by the service contracted by the owner). Alternatively, images may be sourced from elsewhere. For example, mobile phones may post captured images automatically - or in response to user commands - for storage in one or more online archives such as Flickr, Picasa, and the like. The user's "My Vehicle Locations" application may be geographically close (such as within a user's specific distance or within a reference location, such as within 10 or 100 yards, or captured as an object-rendered image) , It may query one or more such archives for images captured within a particular prior time interval, such as within the past 10 or 60 minutes. Such analysis of the third party image can serve to find the user's vehicle.
Consider expanding the ranges a bit and now all cameras in the world where information is now publicly accessible (rapidly increasing highway cameras are just an example) are simply perceived as "data" of ever-changing web pages. (Images from the camera's secret networks, such as security cameras, can be made available with certain privacy protections or other cleansing of sensitive information.) In the past, "data" was dominated by "text" , Triggered search engines, and triggered Google. However, using highly distributed camera networks, most of the ubiquitous data forms become pixels. In addition to location and time data, along with some data structures (e.g., key vectors of standardized formats - which may include time / location) and other techniques described herein, Where providers are provided with some form of key vector classifications and time / location - both are the basis of new search components - compile and / or publish large collections of current and past public views Classification (or indexing). Text searches are significantly reduced, and new query paradigms - visual stimulus and key vector properties - are emerging in recent years.
Having described and illustrated the principles of our original work with reference to the illustrated examples, we will see that the technique is not so limited.
For example, although references have been made to cell phones, the technique will find utility with all manner of devices - both portable and fixed. PDAs, configurators, portable music players, desktop computers, laptop computers, tablet computers, laptops, ultraportables, wearable computers, servers, etc., All of the principles are available. In particular, the cell phones contemplated include the Apple iPhone, and cell phones conforming to Google's Android specification (e.g., a G1 phone manufactured by HTC Corp. for T-Mobile). The term "cell phone" (and "mobile phone") should be construed to include all such devices, even if they are not strictly cellular or telephone.
(Details of the iPhone, including the touch interface, are provided in Apple's published patent application 20080174570.)
The design of cell phones and other computers referenced in this disclosure is familiar to those skilled in the art. In general aspects, each may include one or more processors, one or more memories (e.g., RAM), a storage device (e.g., disk or flash memory), a user interface (e.g., a keypad, TFT LCD, May include software instructions for providing a graphical user interface with a display screen, a touch or other gesture sensors, a camera or other optical sensor, a compass sensor, a 3D jammer, a three-axis accelerometer, a microphone, (Such as GSM, CDMA, W-CDMA, CDMA2000, TDMA, EV-DO, HSDPA, WiFi, WiMax, or Bluetooth) , And / or can be as wired as passing an Ethernet local area network, T-1 Internet access, etc.).
The arrangements described herein may also be utilized in portable monitoring devices such as pager-sized devices that sense ambient media for viewer survey applications, such as Personal People Meters (PPMs) See, for example, Nielsen Patent Publication 20090070797 and Arbitron Patents 6,871,180 and 7,222,071). The same principles can also be applied to different types of content that may be provided to the user online. In this regard, reference is made to Nielsen patent application 20080320508 which describes a network-connected media monitoring device.
At the beginning of this specification, the assignee's prior patent applications have been acknowledged, but may be repeated. These disclosures should be interpreted as a whole and read in cooperation. Applicants intend that each feature be combined with the features of the other. Thus, for example, the signal processing disclosed in applications 12 / 271,772 and 12 / 490,980 may be implemented using the architectures and cloud arrangement described herein, and crowd-sourced databases may be used to cover flow user interfaces And other features described in the '772 and' 980 applications may be incorporated into implementations of the presently disclosed techniques. Etc. Accordingly, it should be understood that the methods, elements, and concepts disclosed in this application are combined with the methods, elements, and concepts described above in these related applications. Some have been specifically described herein, but most are not, due to the large number of permutations and large combinations. However, the implementation of all such combinations is straightforward from the teachings provided to the skilled artisan.
The elements and disclosures in the different embodiments disclosed herein are also meant to be interchanged and combined. For example, the disclosures set forth in the contexts of Figs. 1-12 may be used in the arrangements of Figs. 14-20, and vice versa.
The processes and system components described herein include microprocessors, graphics processing units (GPUs such as nVidia Tegra APX 2600), digital signal processors (e.g., Texas Instruments TMS 320 series devices), etc. Which may be implemented as instructions for a computing device including general purpose processor instructions for various programmable processors. These instructions may be implemented as software, firmware, or the like. These instructions may also be used in conjunction with programmable logic devices, FPGAs (e.g., the well-known Xilinx Virtex series devices), FPOAs (e.g., well known PicoChip devices) Including mixed analog / digital circuitry. ≪ RTI ID = 0.0 >  < / RTI > Execution of the instructions may be distributed among the processors and / or may occur in parallel across the processors in the device or across the network of devices. The conversion of the content signal data may also be distributed between different processors and memory devices. References to "processors" or "modules ", such as Fourier transform processors or FFT modules, should be understood to refer to functionality rather than requiring a particular type of implementation.
References to FFTs should also be understood to include inverse FFTs and associated transforms (e.g., DFT, DCT, their respective inverse, etc.).
Software instructions for implementing the functions described above may be readily written by the skilled artisan from the techniques provided herein and may be implemented in a variety of ways including, for example, C, C ++, VB, Java, Python, Tcl, Perl, And the like. Cell phones and other devices in accordance with certain implementations of the present technology may include software modules for performing different functions and operations. Known artificial intelligence systems and techniques can be utilized to make the above-mentioned inferences, conclusions, and other decisions.
Generally, each device includes operating system software that provides interfaces to hardware resources and general-purpose functions, and may also include application software that can be optionally invoked to perform certain tasks desired by the user have. Known browser software, communication software, and media processing software may be adapted for many of the uses detailed herein. The software and hardware configuration data / instructions are typically stored as instructions in one or more data structures carried by tangible media that can be accessed over the network, such as magnetic and optical disks, memory cards, ROM, and the like. Some embodiments may be implemented as special purpose computer systems in which the embedded system-operating system software and application software are not distinguishable to the user (e.g., because they are common in the basic cell phones). The functions described herein may be implemented in off-rating system software, application software, and / or embedded system software.
Different functions may be implemented on different devices. For example, in a system where a cell phone communicates with a server of a remote service provider, different tasks may be executed exclusively by one device or another device, or execution may be distributed among the devices. The extraction of eigenvalue data from an image is merely an example of such an operation. Thus, the description of the operation to be performed by a particular device (e.g., a cell phone) is illustrative rather than limiting; It should be understood that the execution of operations shared by other devices (e.g., a remote server) or between devices is clearly contemplated. (Further, more than two devices may be commonly utilized. For example, a service provider may refer to servers that are dedicated to such tasks, such as image retrieval, object segmentation, and / or image classification. )
(In the same way, the description of data stored on a specific device is also exemplary; data can be stored anywhere: local device, remote device, cloud, distributed, etc.).
Operations need not be exclusively performed by specially-identifiable hardware. Rather, some operations may be referenced to other services (e.g., cloud computing), and more generally participate in execution by anonymous systems. These distributed systems can be either large in size (e.g., with computing resources on earth) or local (e.g., a portable device can identify nearby devices via Bluetooth communication, Such as attribution data from local geometry - in this regard, see patent 7,254,406 to Beros.
Similarly, while specific functions have been described above as being executed by specific modules (e.g., control processor module 36, pipe manager 51, query router and response manager of Figure 7, etc.), in other implementations, Such functions may be performed by other modules or by application software (or both).
The reader notes that a particular discussion considers arrangements in which most image processing is performed on a cell phone. In such arrangements, external resources are used more as resources for data (e.g., Google) than for image processing tasks. Such arrangements can be naturally implemented using the principles discussed in other sections, and some or all of the hardcore bulk processing of image-related data is referenced to external processors (service providers).
Likewise, while this disclosure has described specific combinations of elements and certain combinations of elements in the illustrated embodiments, it is to be understood that other contemplated methods may reorder the operations and that other contemplated combinations omit some elements You can see that you can add elements.
Although disclosed as complete systems, subcombinations of the above-described arrangements are also individually considered.
In the exemplary embodiments, references to the Internet have been made. In other embodiments, other networks - including secret networks of computers - may also be utilized instead.
Readers will find that different names are sometimes used when referring to similar or identical components, processes, and the like. This is due, in part, to the development of this patent specification - using terms that have changed over the course of almost a year. Thus, for example, "visual query packet" and "key vector" It is similar for other terms.
In some modes, cell phones utilizing aspects of the present technique may be considered as observation state machines.
Although primarily described above in the context of systems that perform image capture and processing, corresponding arrangements are equally applicable to systems that capture and process audio, or capture and process both images and audio.
In an audio-based system, some processing modules may be naturally different. For example, audio processing generally depends on critical band sampling (per human auditory system). The cepstrum process (DCT of the power spectrum) is also frequently used.
An exemplary processing chain may include a bandpass filter to filter the audio captured by the microphone to leave low and high frequencies, e.g., to leave the band 300-3000 Hz. A decimation station may follow (e.g., reduce the sampling rate from 40K samples / second to 6K samples / second). The FFT can then be followed. Power spectral data can be computed by squaring the output coefficients from the FFT (these can be grouped to perform critical band segmentation). Thereafter, in order to generate the cepstrum data, the DCT can be executed. Outputs from any of these stages can be sent to the cloud for application processing such as speech recognition, language translation, anonymization (returning the same utterances with different voices), and the like. Remote systems may also be responsive to commands that are captured by the microphone and spoken by the user, such as, for example, to control other systems, to provide information for use by other processes, and the like.
It will be appreciated that the above-described processes of the content signals include the conversion of these signals in various physical forms. Images and video (types of electromagnetic waves that depict physical objects and travel through physical space) can be captured from physical objects using cameras or other capture devices, or generated by computing devices. Similarly, audio pressure waveforms traveling through a physical medium may be captured using an audio converter (e. G., A microphone) or may be converted to an electrical signal (digital or analog form). Although these signals are typically processed in electronic and digital form to implement the components and processes described above, they may also be used to capture, process, transmit, and store signals in other physical forms, including electronic, optical, magnetic, . Content signals are transformed for various uses in various forms while processing and generating various data structure representations of signals and related information. Now, the data structure signals in the memory are converted for manipulation during search, sort, read, write and retrieve. The signals are also converted to capture, transmit, store and output through a display or audio converter (e.g., speakers).
In some embodiments, an appropriate response to the captured image can be determined with reference to data stored in the device - without reference to any external resources. (The registry database used in many operating systems is where response-related data can be specified for certain inputs.) Alternatively, information can be sent to the remote system - this is to determine the response .
The drawings not specifically identified above illustrate aspects of the disclosed technique or aspects of exemplary embodiments.
The information sent from the device may be raw pixels, an image in a compressed form, a transformed copy for the image, features / metrics extracted from the image data, and the like. All can be regarded as image data. The receiving system may recognize the data type or may be unambiguously identified (e.g., bitmap, eigenvectors, Fourier-Mel-transformed data, etc.) in the receiving system and the system may determine The data type can be used.
If the transmitted data is full image data (raw or compressed), the packets received by the processing system will not have duplicates in nature - essentially all images are somewhat different. However, if the originating device performs processing on the entire image to extract features or metrics, etc., the receiving system may (or nearly so) receive the same packet that it encountered from time to time. In this case, the response to the "snap packet" (also referred to as a "pixel packet" or "key vector") can be recalled from the cache - rather than a new one being determined. (If available and applicable, the response information may be modified according to user preference information.)
In certain embodiments, it may be desirable for the capture device to include some form of biometric authentication, such as a fingerprint reader integrated with the shutter button, to ensure that a known user is manipulating the device.
Some embodiments may capture multiple images of an object from different views (e.g., a video clip). Thereafter, algorithms for synthesizing a 3D model of the imaged object may be applied. From this model, new views of the object-views that may be more suitable as stimuli for the above-described processes (e.g., avoiding obscuring the foreground object) -can be derived.
In embodiments using descriptors of text, it is sometimes desirable to augment descriptors with synonyms, sub-words (more specific terms) and / or higher terms (more general terms). These can be obtained from a variety of sources, including the WordNet database compiled by Princeton University.
Although most of the embodiments described above were in the context of a cell phone that presented image data to a service provider and triggered a corresponding response, the technique is generally more applicable - whenever processing of images and other content occurs.
The focus of this disclosure was on the image. However, these techniques are also useful for audio and video. The techniques described above are particularly useful in UGC (User Generated Content) sites such as YouTube. Videos are often downloaded with little metadata. Various techniques are applied (e.g., watermark reading; fingerprints, human reviewers, etc.) to differentiate the degree of uncertainty to identify this, and this identification metadata is stored. Other metadata is accumulated based on profiles of users viewing the video. Other metadata may be collected from a later user comment frame posted on the video. (UGC-related arrangements intended by applicants for inclusion in this technology are disclosed in published patent applications 20080208849 and 20080228733 (Digimarc), 20080165960 (TagStory), 20080162228 (Trivid), 20080178302 and 20080059211 (Attributor), 20080109369 (Google) 20080249961 (Nielsen), and 20080209502 (MovieLabs).) With the arrangements as described herein, the appropriate ad / content collections can be collected and other improvements to the user experience can be provided .
Similarly, the techniques may be used with audio captured by the user device and captured speech recognition. Information collected from any captured information (e.g., OCR'd text, decoded watermark data, recognized speech) may be used as metadata for the purposes described herein.
Multi-media applications of this technology are also contemplated. For example, the image may be pattern-matched or GPS-mounted to identify a similar set of images in the flicker. Metadata descriptors may be collected from a similar set of images or used to query metadata that includes audio and / or video. Thus, the user capturing and submitting images of the trail markers on the Appalachian Trail (FIG. 38) triggers the downloading of audio tracks from the "Appalachian Spring" orchestra of Aaron Copeland which is suitable for the user's cell phone or home entertainment system . (See, for example, Patent Publication No. 20070195987, for transmitting content to different destinations that may be associated with a user.)
Repeated references to GPS data have been made. This should be understood as a shortage of hands-free for any position-related information; It does not need to be derived from global positioning system satellite deployments. For example, other techniques suitable for generating location data are dependent on wireless signals (e.g., WiFi, cellular, etc.) that are typically exchanged between the devices. Given multiple communication devices, the signals themselves - and the incomplete digital clock signals that control them - form a reference system in which both very precise time and location can be extracted. This technique is described in published International Patent Publication No. WO08 / 073347. The technician may use location-estimation techniques based on location-based techniques and location information such as broadcast radio and television towers (as provided by Rosum) and WiFi nodes (as provided by Skyhook radio and utilized in the iPhone) Estimation-based techniques, and other position-estimation techniques, including those based on position-estimation techniques.
The geographic location data typically includes latitude and longitude data, but may alternatively include some or different data. For example, it may contain orientation information, such as the compass direction provided by the machine, or gradient information provided by the gyroscopic or other sensors. It may also include altitude information such as those provided by digital altimeter systems.
A reference to Apple's Bonjour software was made. Bonjour is Apple's implementation of the Zeroconf-Service Discovery Protocol. Bonjour locates the devices on the local network and identifies the services each provides using multicast domain name system service records. The software is created in the Apple MAC OS X operating system and is also part of the Apple "remote" application for the iPhone - used to establish connections to iTunes libraries over WiFi. Bonjour services are implemented at the mass application level using standard TCP / IP calls rather than in the operating system. Apple has created the source code for the Bonjour multicast DNS responder - a core component of the service discovery that is available as a Darwin open source project. The project provides source code for creating responder daemons for a wide range of platforms, including Mac OS X, Linux, * BSD, Solaris, and Windows. In addition, Apple offers Bonjour's user-installable set of services for Windows, as well as Java libraries. Bonjour can be used in various embodiments of the present technology to associate communications between devices and systems.
(Other software may alternatively or additionally be used to exchange data between devices. Examples include device profiles for universal plug and play (UPnP) and subsequent web services DPWS: Devices Profile for Web Services), which are other protocols for implementing zero-configuration networking services through which devices can connect, identify themselves, and advertise available capabilities to other devices You can share content, etc.).
As initially noted, artificial intelligence techniques can play an important role in embodiments of the present technology. A recent participant in this field is the Wolfram Alpha product by Wolfram Research. Alpha calculates the responses and visualizations in response to the configured input by referencing the knowledge base of the ancillary data. Information collected from metadata analysis or semantic search engines may be provided to the Wolfram Alpha product to provide the response information back to the user, as detailed herein. In some embodiments, the user is involved in presenting such information, such as by composing a query from terms collected by the system and other primitives, selecting from a menu of different queries configured by the system, and so on. Additionally or alternatively, the response information from the Alpha system may be provided as input to other systems, such as Google, to identify other response information. Wolfram's patent publications 20080066052 and 20080250347 further describe aspects of the technology.
Other recent technology introductions are Google Voice (based on the GrandCentral product of the initial venture) and provide a number of improvements to conventional telephone systems. These features may be used in conjunction with certain aspects of the present technique.
For example, the voice-to-text transcription services provided by Google Voice may use a microphone in a user's cell phone to capture ambient audio from the speaker's environment and generate corresponding digital data (e.g., ASCII information) . The system can present such data to services such as Google or Wolfram Alpha to obtain relevant information, and the system can then re-provide relevant information to the user-by screen display or by voice. Similarly, speech recognition enabled by Google Voice may be used to provide an interactive user interface to cell phone devices, whereby features of the techniques described herein herein may be selectively invoked by the spoken words to provide control .
In another aspect, when a user captures content (audio or visual) with a cell phone device and the system utilizing the presently disclosed technology returns a response, the response information can be converted from text to speech, To the voice mail account of. The user can access this data store from any information or from any computer. The stored voice mail may be reviewed in audible form, or the user may select a copy of text provided on, for example, a cell phone or a computer screen instead of reviewing.
(Aspects of Google Voice technology are described in patent application 20080259918).
More than a century ago, users became accustomed to thinking of phones as communication devices that received audio at point A and delivered audio at point B. However, aspects of the present technology can be utilized with very different effects. Audio-in and audio-out are becoming paradigms of the old age. In accordance with certain aspects of the present technique, the pawns may also receive communications (e. G., Video, voice, data, images, video, Devices.
Instead of using the technique described above as a querying device - a single phone serves both as an input and as an output, a user can cause content to be delivered to one or more destination systems in response to a query - May or may not be included. (The recipient (s) may be selected by known UI techniques, including keypad input, scrolling through recipients' menus, speech recognition, etc.)
A simple example of this usage model is a person using cell phones to capture photos of rose plants. In response to the user's instructions, the image is delivered to the user's girlfriend, which is augmented by the synthetic scent of the particular type of rose. (Arrangements for mounting computer devices for propagating programmable scents are known, for example the iSmell provision by Digiscents and the techniques described in patent documents 20080147515, 20080049960, 20060067859, WO00 / 15268 and WO00 / 15269 .) The stimuli captured by one user at one location may cause a transfer of the relevant empirical stimuli to different users at different but different locations.
As is well known, the response to visual stimuli may include one or more graphic overlays (bobbles) provided on a cell phone screen - top image data from a cell phone camera. The overlay can be geometrically registered as features in the image data and affine-distorted in response to distortions that are the affirmative of the object depicted in the image. Graphic features can be used to draw attention to the bobble, such as sparks emitted in that area or flashing / moving visual effects. Such a technique is further described, for example, in Digimarc's patent publication 20080300011.
Such a graphical overlay may include menu features, by which a user may interact to perform the desired functions. Alternatively or in the alternative, the overlay may include one or more graphical user controls. For example, several different objects can be recognized within the field of view of the camera. The overlay associated with each may be graphical, which may be touched by the user to acquire information or trigger a function associated with the respective object. Overlays can be thought of as visual flags - for example, by user tapping that position on the screen, or by interacting with such graphic features with the user, such as by twirling the area with a finger or touch Interest in the availability of information that can be accessed through. As the user changes the view of the camera, different bobbles may appear - to track the movement of different objects in the underlying realistic clock image and to prompt the user to explore the associated ancillary information. Again, it is desirable that the overlays are corrected at right angles to the affine-corrected projection on the associated real world features. (Which is imaged in the real world, where pose estimation of objects - determining the appropriate spatial registers of overlays - is preferably performed locally, but can be referenced to the cloud depending on the application).
Objects can be recognized, tracked, and fed back by the above-described operations. For example, the local processor may perform object analysis and initial object recognition (e.g., list-based proto-objects). Cloud processes can complete recognition operations and produce appropriate interaction portals that are registered at right angles on the display scene (registers can be executed by a local processor or cloud).
It will be appreciated that, in some aspects, the technique may operate as a graphical user interface in the real world - on a cell phone.
In earlier implementations, the general-purpose visual query systems of the kind described are relatively coarse and poorly viewable. However, by providing a trickle (or torrent) of key vector data back to the cloud (along with information about user behavior based on this data) for accomplishment and analysis, they can create templates and other training models When possible - visual stimuli are provided, the data base can be established - allowing subsequent generations of these systems to be highly intuitive and responsive. (These trickle sometimes catches little information about how users work with devices, what they do, what they do not, what choices they make based on which stimuli, what stimuli are involved, and so on, Which may be provided by a subroutine on the local device.
A reference has been made to touch screen interfaces in the form of gesture interfaces. Other types of gesture interfaces that may be used in certain embodiments operate by sensing movement of the cell phone and tracking the movement of features within the captured image. Other information regarding these gesture interfaces is detailed in Digimarc's patent 6,947,571.
Watermark decoding may be used in certain embodiments. A description of encoding / decoding watermarks is found, for example, in Digimarc's patents 6,614,914 and 6,122,403; 6,968,564 and 7,006,555 to Nielsen; And Arbitron's patents 5,450,490, 5,764,763, 6,862,355, and 6,845,360.
Digimarc has various other patent applications related to the subject matter. Patent publications 20070156726, 20080049971, and 20070266252, and Sharma et al. Application 12 / 125,840, filed May 22, 2008.
Google's north-scanning patent 7,508,978 details certain principles that are useful in this context. For example, the '978 patent discloses that a surface topology can be identified by projecting a reference pattern on a non-planar surface. The image captured from this surface may then be processed to re-normalize it so that it appears to originate from the flat page. Such reconditioning may also be used in conjunction with the object recognition arrangements described herein. Similarly, Google's patent application 20080271080, which details the visions for interacting with next-generation television, also details the principles that are currently available along with the techniques described above.
Examples of audio fingerprinting are described in patent publications 20070250716, 20070174059 and 20080300011 (Digimarc), 20080276265, 20070274537 and 20050232411 (Nielsen), 20070124756 (Google), 7,516,074 (Auditude), and 6,990,453 and 7,359,889 (both Shazam). Examples of image / video fingerprinting are described in patent publications 7,020,304 (Digimarc), 7,486,827 (Seiko-Epson), 20070253594 (Vobile), 20080317278 (Thomson), and 20020044659 (NEC).
While certain aspects of the above-described techniques involve processing multiple images to gather information, it is also possible that a large number of people (and / or automated processes) take into account a single image (e.g. crowd-sourcing) It will be appreciated that related results may be obtained. Larger information and utilities can be achieved by combining these two common methods.
It is to be understood that the illustrations are intended to be illustrative, not limiting. For example, when a single database can be used, they sometimes show multiple databases (and vice versa). Likewise, some links between the depicted blocks are not shown for clarity.
Context data may be used in the first half of the above embodiments to further improve operation. For example, the processing may be based on whether the originating device is a cell phone or a desktop computer; An ambient temperature of 30 degrees or 80 degrees; The location of the user and other information characterizing the user; And the like.
While the embodiments described above often provide candidate results / operations as a series of cached displays on a cell phone screen that the user can quickly switch to, in other embodiments this need not be the case. A more conventional single-screen offering that provides a menu of results can be used, and the user can select by keying the keypad digit or highlighting the desired option. Or the bandwidth may be sufficiently increased - rather than allowing the same user experience to be delivered to the cell phone when needed - so that data can be provided without caching or buffering the data locally.
Geographic-based database methods are described in, for example, Digimarc, patent publication 20030110185. Other arrangements for navigation and retrieval through an image collection are disclosed in patent publications 20080010276 (Executive Development Corp.) and 20060195475, 20070110338, 20080027985, 20080028341 (Microsoft's Photosynth work).
It is not possible to clearly list numerous variations and combinations of the techniques described herein. Applicants have recognized and intended that the concepts of this specification can be combined, substituted and exchanged between two and among them, as well as the concepts known from the cited prior art. Moreover, it will be appreciated that the techniques described above may be included with other technologies-current and coming soon-to advantage.
The reader is considered familiar with the documents referred to herein (including patent documents). In order to provide a comprehensive disclosure without unduly extending this specification, applicants refer to these referenced documents as references. (These references are also incorporated by reference in their entirety, even if cited above, together with specific disclosures.) These references may be incorporated into the herein described arrangements, Discloses techniques and disclosures in which the contents can be integrated.
As can be appreciated, this specification has described a myriad of new arrangements. (Due to practical limitations, many such arrangements have not yet been claimed at the time of the first application of this application, but applicants seek to claim these other points in subsequent applications which claim priority.) Incomplete sampling of some of the original arrangements Are reviewed in the following paragraphs:
In processing work, processing stimuli captured by a sensor of a user's mobile device, wherein some processing operations may be performed on the processing hardware of the device, and other processing operations may be performed on one processor- or multiple- Wherein at least: mobile device power considerations; Required response time; Routing constraints; The status of hardware resources within the mobile device; Connection status; Geographic considerations; Risk of pipeline stall; Information about the remoter, including its readiness, throughput, cost, and attributes that are not important to the user of the mobile device; Based on consideration of two or more different factors extracted from the set including the association of the task with other processing tasks, the determination as to whether the first task should be executed on the device hardware or on the remote processor is automated And in some circumstances the first job is executed on the device hardware and in other circumstances the first job is executed on the remote processor. Further, in such an arrangement, the decision is based on a score that depends on a combination of parameters associated with at least some of the listed considerations.
In the work arrangement, the stimuli captured by the sensor of the user's mobile device are processed, wherein some of the processing tasks may be executed on the processing hardware of the device, and other processing tasks may be performed on the processor- or the plurality of processors- Wherein at least: mobile device power considerations; Required response time; Routing constraints; The status of hardware resources within the mobile device; Connection status; Geographic considerations; Risk of pipeline stall; Information about the remoter, including its readiness, throughput, cost, and attributes that are not important to the user of the mobile device; Based on the consideration of two or more different factors extracted from the set including the relation to the other processing tasks of the task, the order in which the set of tasks should be executed is made; In some circumstances, a set of tasks is executed in a first order, and in other circumstances, a set of tasks is executed in a second, different order. Further, in such an arrangement, the decision is based on a score that depends on a combination of parameters associated with at least some of the listed considerations.
In processing work, processing stimuli captured by a sensor of a user's mobile device, wherein some processing operations may be performed on the processing hardware of the device, and other processing operations may be performed on one processor- or multiple- Wherein the packets are utilized to transfer data between processing operations, and wherein the contents of the packets are at least: mobile device power considerations; Required response time; Routing constraints; The status of hardware resources within the mobile device; Connection status; Geographic considerations; Risk of pipeline stall; Information about the remoter, including its readiness, throughput, cost, and attributes that are not important to the user of the mobile device; And a consideration of two or more different factors extracted from the set including the association to other processing tasks of the task; In some circumstances, packets may include data in a first form, and in other circumstances, packets may include data in a second form. Further, in such an arrangement, the decision is based on a score that depends on a combination of parameters associated with at least some of the listed considerations.
In work aggregation, the stage provides data services to users over the network, and the network is configured to inhibit the use of electronic imaging by users while on the stage. Also, in such an arrangement, suppression is done by restricting the transmission of data from user devices to specific data processing providers outside the network.
Wherein the mobile communication device with image capture capability comprises a pipelined processing chain for performing a first operation and the control system has a mode for testing image data by executing a second operation, The operation is computationally simpler than the first operation and the control system applies the image data to the pipelined processing chain only when the second operation produces an output of the first type.
In job arrangements, the mobile phone is equipped with a GPU for display on a mobile phone screen, for example for games, to facilitate rendering of graphics, and the GPU is also utilized for machine vision applications. Also, in such an arrangement, the machine vision application includes face detection.
In job aggregation, a plurality of social-linked mobile devices maintained by different individuals cooperate in performing machine vision operations. Also, in such an arrangement, a first one of the devices performs an operation for extracting facial features from the image, and the second one of the devices performs a template matching on the extracted facial features generated by the first device .
In work aggregation, a speech recognition operation is performed on incoming audio from the incoming video or audio from the phone call to identify the caller. Further, in this arrangement, the video recognition operation is executed only when the incoming call is not identified by the CallerID data. Also in this arrangement, the speech recognition operation includes a reference to data corresponding to one or more of the initially-stored voice messages.
In the work arrangement, the speech from the incoming video or the phone call can be recognized, and corresponding text data is generated when the call is processed. Also, in such an arrangement, an incoming call can be associated with a particular geography, and such geography can be considered to recognize speech. Also in this arrangement, text data is used to query the data structure for auxiliary information.
Work arrangements are intended to migrate overlay bobbles onto the mobile device screen, resulting from both local and cloud processing. Also in this arrangement, the overlay bobbles are tuned according to the user preference information.
In job aggregation, visual query data is processed in a distributed manner between the user's mobile device and cloud resources to generate a response, and the related information is stored in the cloud so that subsequent visual query data can generate a more intuitive response .
In job aggregation, a user may be charged (1) for a data processing service by a vendor, or alternatively (2) a free service may be provided from a vendor if the user takes a particular action in connection therewith, Credit can be received.
In job aggregation, a user is entitled to commercial advantage as an exchange to receive promotional content, as sensed by the mobile device delivered by the user.
In job aggregation, the first user allows the second party to consume the credits of the first user or create costs by the first user, by way of a social networking connection between the first user and the second party. Also, in this arrangement, the social networking web page is configured such that the second party interacts with the consumption of these credits, or at such a cost.
In an arrangement for a charity fund, a user interacts with a physical object associated with the charity organization to trigger computer-related processing that facilitates user donations to the charity.
In a portable device, it receives input from one or more physical sensors, utilizes processing by one or more local services, and further utilizes processing by one or more remote services, and the software of the device includes one or more abstraction layers Through which the sensors, local services and remote services interface to the device architecture to facilitate operation.
In a portable device, it receives input from one or more physical sensors, processes the input, packages the result in the form of a key vector, and sends the key vector form from the device. Also in such an arrangement, the device receives another processed copy of the key vector back from the remote resource to which the key vector was sent. Also in such an arrangement, the key vector form is processed according to one or more instructions embedded on the context - on a portable device or on a remote device.
In a distributed processing architecture for responding to a physical stimulus sensed by a mobile phone, the architecture utilizes local processing on the mobile phone and remote processing on the remote computer, and both processes are linked by the packet network and inter- , The architecture also includes a protocol through which different processes can communicate, which includes a message passing paradigm with message queues or conflict handling arrangements. In such an arrangement, it is also possible to provide sensor data in the form of a driver software packet for one or more physical sensor components, place the packet on an output queue uniquely associated with the sensor or in common association with a plurality of components and; Unless the packet is to be processed remotely, the local processes operate on those packets, place the resulting packets back on the queue, and if it is a remotely processed packet, it is directed by the router arrangement towards remote processing Loses.
In job aggregation, a network associated with a particular physical location refers to traffic on the network and adapts to automatically distinguish whether a set of visitors to the site has a social connection. Such an arrangement also includes distinguishing the demographic characteristics of the group. Also in such an arrangement, the network facilitates ad hoc networking amongst the visitors identified as having a social connection.
In work aggregation, a network containing computer resources in a public place is dynamically reconfigured according to a predictive model of the behavior of users visiting the site. Also, in such an arrangement, network reconfiguration is partially context-based. Also in such an arrangement, network reconfiguration involves caching specific content. Also, in such an arrangement, the reconstruction includes rendering the synthesized content and storing it in one or more computer resources to make it available quickly. In addition, such an arrangement also includes re-adjusting time-insensitive network traffic in anticipation of a temporary increase in traffic from users.
In job aggregation, an advertisement is associated with real-world content, and the billing accordingly is evaluated based on surveys of exposure to the content, as indicated by sensors in the user's mobile phones. Also, in this arrangement, billing is set through the use of automated auction arrangement.
In an arrangement comprising two objects in a public place, the illumination on the objects is charged differently - based on the attribute of the person proximity to the objects.
In work aggregation, the content is provided to people in a public place, a link exists between the provided content and the auxiliary content, and the linked auxiliary content is billed according to the demographic attributes of the person to whom the content is provided.
In job aggregation, a provisional electronic license for a particular piece of content is provided to a person associated with a person's visit to the public place.
In the work arrangement, the mobile phone includes an image sensor connected to both the human vision system processing unit and the machine vision processing unit, and the image sensor is coupled to the machine vision processing unit without going through the human vision system processing unit. Further, in such an arrangement, the human visual system processing section includes a white balance correction module, a gamma correction module, an edge enhancement module, and / or a JPEG compression module. In addition, in such an arrangement, the machine vision processing section includes an FFT module, an edge detection module, a pattern extraction module, a Fourier-Melin processing module, a texture classifier module, a color histogram module, a motion detection module, and / or a feature recognition module.
In job aggregation, a mobile phone includes a plurality of stages and an image sensor for processing image-related data, and a data driven packet architecture is utilized. Also in this arrangement, the information in the header of the packet determines the parameters to be applied by the image sensor when capturing the image data for the first time. Also in such an arrangement, the information in the header of the packet determines the processing to be performed by the plurality of stages for the image-related data carried in the body of the packet.
In job aggregation, the mobile phone cooperates with one or more remote processors to perform image-related processing. Also in this arrangement, the mobile phone packages the image - the associated image - into packets, at least some of which contain less than a single frame of image data. Also in this arrangement, the mobile phone routes specific image-related data for processing by the processor in the mobile phone and routes the specific image-related data for processing by the remote processor.
In work aggregation, the mobile phone cooperates with a remote routing system, the remote routing system is used to collect image-related data processed from the processors for processing by different remote processors and for returning to the mobile phone , And serves to distribute image-related data from the mobile phone. Also, in work aggregation, the mobile phone includes an internal routing system that serves to distribute image-related data to one or more processors within the mobile phone for processing or to a remote routing system for processing by remote processors .
In job aggregation, image-related data from a mobile phone is referred to a remote processor for processing, and the remote processor is selected by an automated evaluation that involves a plurality of remote processors. Further, in this arrangement, the evaluation includes reverse auction. Also in this arrangement, the output data from the selected remote processor is returned to the mobile phone. Also, in such an arrangement, the image-related data is processed by the processing module in the mobile phone before being transmitted to the selected processor. Further, in such an arrangement, other image-related data from the mobile phone is referred to a remote processor other than the selected processor.
In the work arrangement, the image data is stored in at least one of the planar data structures of the multi-planar data structure, and the graphical representation of the metadata associated with the image data is stored in the data structure of the other planar. Also in this arrangement, the metadata includes edge map data derived from the image data. Also in this arrangement, the metadata includes information about the faces recognized in the image data.
In job arranging, the camera-mounted mobile phone has (1) rotation from horizontal; (2) rotation after the initial time; And (3) a scale change after the initial time, and the data to be displayed is determined with reference to information from the camera.
In job arranging, the camera-mounted mobile phone includes first and second parallel processing units, the first processing unit processes image data to be rendered in a perceptual form for use by human viewers, and a demosaic processor, A white balance correction module, a gamma correction module, an emotion enhancement module, and a JPEG compression module, and the second processing unit analyzes the image data to derive semantic information therefrom.
Wherein the information of the two or more dimensions related to the object is provided on a screen of the mobile phone and the operation of the first user interface control provides a sequence of screens providing information related to the object in the first dimension, The operation of the user interface control provides a sequence of screens providing information related to the object in the second dimension. Also in such an arrangement, the object can be changed by manipulating the user interface control while the screen providing the object is being displayed. Also, in this arrangement, the object is an image and the first dimension is similar to the image in one of (1) a geographic location, (2) an appearance, or (3) a content description metadata, 1), (2), or (3).
In an arrangement to construct a text message on a camera-mounted handheld device, the device is tilted in a first direction for scrolling through a sequence of displayed icons, each representing a plurality of letters of the alphabet, Choose from letters. This arrangement also includes tilting the device in a second direction to select from among a plurality of letters. Also in this arrangement, the tilting is detected by referring to the image data captured by the camera. Also, in this arrangement, the tilts of different characters are due to different meanings.
In job aggregation, the camera-mounted mobile phone functions as a state machine, altering aspects of its functionality based on previously obtained image-related information.
In the work arrangement, the identification information corresponding to the processor-mounted device is used to identify and identify the corresponding application software, which is then used to program the operation of the mobile phone device, Lt; / RTI > Also in this arrangement, the device is a thermostat, a parking meter, an alarm clock, or a vehicle. In addition, in such an arrangement, the mobile phone device captures an image of the device, and the software allows the user interface for the device to be provided as a graphic overlay on the captured image (optionally, In a position or pose corresponding to < RTI ID = 0.0 >
In work aggregation, a screen of a mobile phone provides one or more user interface controls for an individual device, and user interface controls on a screen are provided in combination with a phone-captured image of an individual device. Also in this arrangement, the user interface control is used to issue instructions related to the control of the individual devices, while the screen signals the information corresponding to the instructions in the first manner while the instructions are pending, Is executed successfully.
In a work arrangement, a user interface provided on a screen of a mobile phone is used to initiate a transaction with an individual device when the phone is physically close to the device, and the mobile phone is later used for purposes unrelated to the individual device, Later, the user interface is recalled to the screen of the mobile phone to associate with other transactions with the device. Also, in such an arrangement, the user interface is recalled to associate with other transactions with the device when the mobile phone is remote from the device. Also in such an arrangement, the device includes a parking meter, vehicle or thermostat.
In work aggregation, a mobile phone provides a user interface that allows selection between different user interfaces corresponding to different devices, so that the phone can be used in interacting with a plurality of individual devices.
In order to do so, it involves using a mobile phone to sense information from the housing of the network-connected device and using the information to encrypt the information using the key corresponding to the device.
In job aggregation, a mobile phone is used to sense information from a device on which the wireless device is mounted and to transmit relevant information from the mobile phone, and the transmitted data serve to verify user proximity to the device. Also in such an arrangement, this proximity is required before allowing the user to interact with the device using the mobile phone. Also in this arrangement, the sensed information is analog information.
When initialized to be ready for use in an arrangement that utilizes portable electronic devices and reconfigurable hardware, the updated configuration instructions for the hardware are downloaded wirelessly from a remote source and used to configure the reconfigurable hardware.
In the work arrangement, the hardware processing component of the wireless system base station is utilized to process data related to the wireless signals exchanged between the base station and a plurality of associated remote wireless devices, and is also used for processing by the wireless base station And is utilized to process the offloaded image-related data. Also in such an arrangement, the hardware processing component includes one or more field programmable object arrays, and the remote wireless devices include mobile phones.
In the work arrangement, the optical distortion function is characterized and the optically distorted image is used to define the geometry of the corresponding virtual correction surface to be projected, and the geometry neutralizes the distortion of the projected image. Also, in one arrangement, the image is projected onto a virtual surface whose topology is configured to neutralize distortion present in the image. Also, in such arrangements, the distortion includes distortions introduced by the lens.
In the work arrangement, the wireless station receives a service reservation message from the mobile device, the message comprising one or more parameters of a future service requesting that the mobile device not be immediately available and available at a future time; Determining, based at least in part on the service reservation message received from the first mobile device, for the service provided to the second mobile device, such that the resource allocation of the wireless station comprises the improvement information about the expected services to be provided to the first mobile device ≪ / RTI >
In the work arrangement, a thermoelectric cooling device is coupled to the image sensor of the mobile phone and is selectively activated to reduce noise in the captured image data.
In one arrangement, the mobile phone includes first and second wirelessly linked portions, the first portion including an optical sensor and lens assembly adapted to be carried in a first position relative to the user ' s body, Includes a display and a user interface, and is adapted to be carried in a second, different location. Also in such an arrangement, the second portion is adapted to be releasably received in the first portion.
In the work arrangement, the mobile phone includes first and second wirelessly linked portions, and the first wirelessly linked portion includes LED light assembled and detachably coupled to the second wirelessly linked portion , The second wirelessly linked portion includes a display, a user interface, an optical sensor and a lens, wherein the first wirelessly linked portion can be detached from the second wirelessly linked portion, and the second wirelessly linked portion Lt; / RTI > of the object to be imaged by the photosensor of FIG.
In job arrangement, a camera-mounted mobile phone processes image data through selection of a plurality of processing stages, and the selection of one processing stage depends on the properties of the processed image data output from the previous processing stage.
In job aggregation, conditional branching among different image processing stages is utilized in camera-mounted mobile phones. Also in this arrangement, the stages respond to the packet data, and the conditional branching instructions are carried in the packet data.
In job aggregation, a GPU of a camera-mounted mobile phone is utilized to process image data captured by a camera.
In job aggregation, a camera-mounted mobile device senses temporal variations in illumination and considers such variations in its operation. Also in this arrangement, the camera predicts the marginal state of the temporarily changing illumination and captures a large image camera where the illumination is expected to have a desired state.
In job arranging, camera-mounted mobile phones are equipped with two or more cameras.
In the work arrangement, the mobile phone is equipped with two or more projectors. Further, in this arrangement, the projectors alternately project the pattern on the surface, and the projected patterns are sensed by the camera portion of the mobile phone and used to identify the topology information.
In job arranging, a camera-mounted mobile phone is equipped with a project to project a pattern on a surface that is then captured by the camera, and the mobile phone can identify information about the topology of the surface. Also in such an arrangement, it is used to help identify objects. Also in such an arrangement, the identified topology information is used to normalize the image information captured by the camera.
In the work arrangement, the camera and projector portions of the mobile phone share at least one optical component. Also in this arrangement, the camera and projector portions share a lens.
In job aggregation, camera-mounted mobile phones utilize a packet architecture to route image-related data between a plurality of processing modules. Also, in such an arrangement, the packets further carry instructions that the processing modules respond to. Also in this arrangement, the image sensor of the phone responds to packets carrying image capture commands thereto.
In job arranging, an image capture system of a camera-mounted mobile phone outputs first and second sets of different types of sequences according to the automated instructions provided to it. Also, in such an arrangement, the sequence sets are different in size, color, or resolution.
In job aggregation, the camera-mounted mobile phone captures a sequence of visual data sets, and the parameters used to capture one of the sets depend on the analysis of the pre-captured data set.
In job aggregation, a camera-equipped mobile phone transmits image-related data to one of a plurality of competing cloud-based services for analysis. Also in this arrangement, the analysis includes face recognition, optical character recognition or FFT operation. Also included in such an arrangement is choosing a service from a plurality of competing services based on a set of rules.
In job aggregation, the camera-equipped mobile phone transmits image-related data to the cloud-based service for processing, and the phone pre-warms the service or communication channel in anticipation of the transmission of image-related data. Also, in such an arrangement, the pre-warmed service or channel is identified by prediction based on environments.
In job arrangements, the camera-mounted mobile phone has a plurality of modes that the user can select, and one of the modes is a face recognition mode, an optical character recognition mode, a mode associated with purchasing the imaged item, A mode associated with selling, or a mode (e.g., from Wikipedia, a manufacturer's website, a social network site) that determines information about an imaging item, scene, or person. Also in this arrangement, the user selects a mode before capturing an image.
In job alignment, a terminology dictionary of visual codes is defined, which is recognized by the mobile phone and serves to trigger associated functions.
In job aggregation, a camera-mounted mobile phone is used as an aid in name recognition, and the camera captures an image containing a face, which is associated with reference data determined by a remote resource such as Facebook, Picasa or iPhoto And is processed by face recognition processing.
In job arranging, an image of an object captured by a camera-mounted mobile phone is used to link to information related to the object, such as spare parts or images of objects with manual commands, similar appearances, and the like.
In job aggregation, an image is stored in association with a set of data or attributes that serve as implicit or explicit links to operations and / or other content. Also in this arrangement, the user navigates from one image to the next image - similar to navigating between nodes on the network. Also in such an arrangement, such links are analyzed to identify additional information.
In job aggregation, an image is processed to identify the associated semantic information - in accordance with information from the data store. Also, in such an arrangement, the identified semantic information is processed to identify another related semantic information-in accordance with information from the data store.
The job aggregation includes a plurality of mobile phones in a network cluster. Also in this arrangement, the networked cluster includes a peer-to-peer network.
In job aggregation, the default rule governs the sharing of content in the network, and the default rule specifies that content of the first time period is not to be shared. Also, in such an arrangement, the default rule specifies that the content in the range of the second period can be shared. Also, in this arrangement, the default rule specifies that the content of the second time period content can only be shared if it is in a social link.
In work aggregation, empirical data associated with a location becomes available to users at that location. Also in such an arrangement, mobile phones at that location form an ad hoc network where empirical data is shared.
In job arranging, an image sensor of a camera-mounted mobile phone is formed on a substrate, and on the substrate is also used for processing image-related data to serve automated visual queries (e.g., object recognition) One or more modules are formed.
In job aggregation, an image is captured by a party and made available to a plurality of users for analysis, such as object recognition applications (e.g., my vehicle search).
In job aggregation, the image feed from the distributed camera network becomes available for public search.
Also, the arrangements corresponding to those described above relate to the audio captured by the microphone rather than the visual input captured by the image sensor (including, for example, video recognition for face recognition, etc.).
The invention also relates to methods, systems and sub-combinations corresponding to those described above and to computer-readable storage media having instructions for configuring a processing system to perform some or all of these methods.
10, 81, 530: cell phone 12: image sensor
16: Cloud 32, 544: Camera
34: Setup module 35: Synchronization processor
36: control processor module 38: processing module
51: Pipeline manager 52: Data pipe
72, 73, 74: hardware module
79, 524, 534, 546, 590: memory 82: lens
84: beam splitter
86: Micro-mirror projector system 110: Portable device
111, 582: display 112: keypad
114: controller 124: roller wheel
512, 530: thermostat 514: temperature sensor
516, 542: processor
520: LCD display screen 526: WiFi transceiver
528: antenna 532: processor
552b: remote server 554: router
556b: Server 584: Physical UI
586: Control processor
- An image processing apparatus having an input for receiving image data from an image sensor and providing first and second different parallel processing paths, the first processing path comprising: a human visual system processing section having a first processing module; 2 processing path includes a machine vision processing section having a second processing module;
Wherein the human visual system processing section generates first processed image data provided to a display device seen by a human being;
The machine vision processing section generates second processed image data to be analyzed by a recognition stage to generate object recognition data or characteristic recognition data;
Wherein the image data from the image sensor is connected to the display device via the human visual system processing unit without going through the second processing module;
Wherein the image data from the image sensor is connected to the recognition stage via the machine vision processing section without going through the first processing module.
- 90. The method of claim 89,
Wherein the first processing module comprises a white balance correction module.
- 90. The method of claim 89,
Wherein the first processing module comprises a gamma correction module.
- 90. The method of claim 89,
Wherein the first processing module comprises a de-mosaicing module.
- 90. The method of claim 89,
Wherein the first processing module comprises a JPEG compression module.
- 90. The method of claim 89,
Wherein the machine vision processing unit employs a processing circuit integrated on the common substrate and the image sensor.
- 90. The method of claim 89,
The machine vision processor may include an FFT module, an edge detection module, a pattern extraction module, a Fourier-Mellin module, a texture classifier module, a color histogram module, An operation detection module, or a feature recognition module.
- 90. The method of claim 89,
Wherein the machine vision processing section includes an FFT module.
- 90. The method of claim 89,
Wherein the machine vision processing section includes an edge detection module.
- 90. The method of claim 89,
Wherein the machine vision processing unit includes a pattern extraction module.
- 90. The method of claim 89,
Wherein the machine vision processor comprises a Fourier-Melin module.
- 90. The method of claim 89,
Wherein the machine vision processor comprises a texture classifier module.
- 90. The method of claim 89,
Wherein the machine vision processor comprises a color histogram module.
- 90. The method of claim 89,
Wherein the machine vision processing section includes an operation detection module.
- 90. The method of claim 89,
Wherein the machine vision processing section includes a feature recognition module.
- A microphone, a wireless transceiver, an image sensor, and a device according to claim 89.
- 1. An image processing method comprising:
Receiving image data from an image sensor and supplying the received image data to first and second different parallel processing paths, wherein the first processing path includes a human visual system processing portion having a first processing module And the second processing path includes a machine vision processing section having a second processing module, receiving the image data and supplying the received image data;
Generating, by the human visual system processing unit, first processed image data provided to a display device seen by a human being;
Generating, by the machine vision processing unit, second processed image data to be analyzed by a recognition stage to generate object recognition data or characteristic recognition data;
Wherein the image data from the image sensor is connected to the display device via the human visual system processing unit without going through the second processing module;
Wherein the image data from the image sensor is connected to the recognition stage via the machine vision processing section without going through the first processing module.
- 108. The method of claim 109,
Wherein the first processing module is selected from the group of a white balance correction module, a gamma correction module, a demosaicing module, or a JPEG compression module.
- 112. The method of claim 110,
Wherein the human visual system processing unit performs a demosaicing operation.
- 112. The method of claim 110,
Wherein the human visual system processing unit performs a JPEG compression operation.
- 112. The method of claim 110,
Wherein the machine vision processing unit performs at least one of an FFT transform, an edge detection operation, a pattern extraction operation, a Fourier-Mel-Line transformation, a texture classifier operation, a color histogram operation, a motion detection operation, Processing method.
- 90. The method of claim 89,
Wherein the machine vision processing section includes a Fourier transform module and the image data from the image sensor is connected to the display device without going through the Fourier transform module.
Priority Applications (27)
|Application Number||Priority Date||Filing Date||Title|
|US12/271,692 US8520979B2 (en)||2008-08-19||2008-11-14||Methods and systems for content processing|
|US12/484,115 US8385971B2 (en)||2008-08-19||2009-06-12||Methods and systems for content processing|
|US12/498,709 US20100261465A1 (en)||2009-04-14||2009-07-07||Methods and systems for cell phone interactions|
|PCT/US2009/054358 WO2010022185A1 (en)||2008-08-19||2009-08-19||Methods and systems for content processing|
|Publication Number||Publication Date|
|KR20110043775A KR20110043775A (en)||2011-04-27|
|KR101680044B1 true KR101680044B1 (en)||2016-11-28|
Family Applications (2)
|Application Number||Title||Priority Date||Filing Date|
|KR1020117006167A KR101680044B1 (en)||2008-08-19||2009-08-19||Methods and systems for content processing|
|KR1020167032337A KR101763132B1 (en)||2008-08-19||2009-08-19||Methods and systems for content processing|
Family Applications After (1)
|Application Number||Title||Priority Date||Filing Date|
|KR1020167032337A KR101763132B1 (en)||2008-08-19||2009-08-19||Methods and systems for content processing|
Country Status (5)
|EP (1)||EP2313847A4 (en)|
|KR (2)||KR101680044B1 (en)|
|CN (1)||CN102216941B (en)|
|CA (1)||CA2734613A1 (en)|
|WO (1)||WO2010022185A1 (en)|
Families Citing this family (63)
|Publication number||Priority date||Publication date||Assignee||Title|
|US8520979B2 (en)||2008-08-19||2013-08-27||Digimarc Corporation||Methods and systems for content processing|
|US20120135744A1 (en)||2009-07-21||2012-05-31||Kota Enterprises, Llc||Systems and methods for generating and managing communication rules associated with geographic locations|
|US8121618B2 (en)||2009-10-28||2012-02-21||Digimarc Corporation||Intuitive computing methods and systems|
|US8175617B2 (en) *||2009-10-28||2012-05-08||Digimarc Corporation||Sensor-based mobile search, related methods and systems|
|CA2792336C (en) *||2010-03-19||2018-07-24||Digimarc Corporation||Intuitive computing methods and systems|
|US20110320560A1 (en) *||2010-06-29||2011-12-29||Microsoft Corporation||Content authoring and propagation at various fidelities|
|US8774267B2 (en) *||2010-07-07||2014-07-08||Spinella Ip Holdings, Inc.||System and method for transmission, processing, and rendering of stereoscopic and multi-view images|
|JP2012027263A (en) *||2010-07-23||2012-02-09||Sony Corp||Imaging apparatus, control method and program thereof|
|JP2012058838A (en)||2010-09-06||2012-03-22||Sony Corp||Image processor, program, and image processing method|
|US9484046B2 (en) *||2010-11-04||2016-11-01||Digimarc Corporation||Smartphone-based methods and systems|
|US9183580B2 (en) *||2010-11-04||2015-11-10||Digimarc Corporation||Methods and systems for resource management on portable devices|
|US8762852B2 (en)||2010-11-04||2014-06-24||Digimarc Corporation||Smartphone-based methods and systems|
|US9235277B2 (en) *||2010-12-03||2016-01-12||Razer (Asia-Pacific) Pte Ltd.||Profile management method|
|CN102812497B (en) *||2011-03-03||2016-06-08||松下知识产权经营株式会社||The image experiencing image subsequently can be provided to provide device, image to provide method|
|CN102170471A (en) *||2011-04-14||2011-08-31||宋健||A real-time audio and video signal transmission method and system replacing satellite network|
|US8516607B2 (en) *||2011-05-23||2013-08-20||Qualcomm Incorporated||Facilitating data access control in peer-to-peer overlay networks|
|KR20170135977A (en) *||2011-05-27||2017-12-08||돌비 레버러토리즈 라이쎈싱 코오포레이션||Scalable systems for controlling color management comprising varying levels of metadata|
|US8861798B2 (en)||2011-06-30||2014-10-14||Shenzhen Junshenghuichuang Technologies Co., Ltd.||Method for authenticating identity of handset user|
|US8627096B2 (en)||2011-07-14||2014-01-07||Sensible Vision, Inc.||System and method for providing secure access to an electronic device using both a screen gesture and facial biometrics|
|CN103018162B (en) *||2011-09-22||2016-07-06||致茂电子股份有限公司||A kind of system and method processed for the image data tested|
|ES2632440T3 (en)||2011-10-14||2017-09-13||Siemens Aktiengesellschaft||Procedure and system for powering at least one mobile component in a wireless communication system, in particular RFID tags of an RFID system|
|TWI455579B (en)||2011-10-26||2014-10-01||Ability Entpr Co Ltd||Image processing method and processing circuit and image processing system and image capturing device using the same|
|CN103095977B (en) *||2011-10-31||2016-08-10||佳能企业股份有限公司||Image acquisition method and apply its image processing system and image capture unit|
|KR20130048035A (en) *||2011-11-01||2013-05-09||엘지전자 주식회사||Media apparatus, contents server, and method for operating the same|
|JP5851046B2 (en) *||2011-11-10||2016-02-03||エンパイア テクノロジー ディベロップメント エルエルシー||Remote display|
|US9940118B2 (en)||2012-02-23||2018-04-10||Dahrwin Llc||Systems and methods utilizing highly dynamic wireless ad-hoc networks|
|US8774147B2 (en)||2012-02-23||2014-07-08||Dahrwin Llc||Asynchronous wireless dynamic ad-hoc network|
|KR101375962B1 (en) *||2012-02-27||2014-03-18||주식회사 팬택||Flexible terminal|
|US8959092B2 (en) *||2012-06-27||2015-02-17||Google Inc.||Providing streams of filtered photographs for user consumption|
|US9164552B2 (en)||2012-09-27||2015-10-20||Futurewei Technologies, Inc.||Real time visualization of network information|
|US8811670B2 (en) *||2012-09-28||2014-08-19||The Boeing Company||Method and system for using fingerprints to track moving objects in video|
|CN103714089B (en) *||2012-09-29||2018-01-05||上海盛大网络发展有限公司||A kind of method and system for realizing cloud rollback database|
|JP2016502199A (en) *||2012-11-29||2016-01-21||エドラ シーオー エルティーディー||How to provide different contents corresponding to widgets that change visually on the smart device screen|
|US9589314B2 (en) *||2013-04-29||2017-03-07||Qualcomm Incorporated||Query processing for tile-based renderers|
|CA2885874A1 (en) *||2014-04-04||2015-10-04||Bradford A. Folkens||Image processing system including image priority|
|US9843623B2 (en)||2013-05-28||2017-12-12||Qualcomm Incorporated||Systems and methods for selecting media items|
|KR101480065B1 (en) *||2013-05-29||2015-01-09||(주)베라시스||Object detecting method using pattern histogram|
|US9443355B2 (en)||2013-06-28||2016-09-13||Microsoft Technology Licensing, Llc||Reprojection OLED display for augmented reality experiences|
|CN104424485A (en) *||2013-08-22||2015-03-18||北京卓易讯畅科技有限公司||Method and device for obtaining specific information based on image recognition|
|CN103442218B (en) *||2013-08-27||2016-12-28||宁波海视智能系统有限公司||A kind of multi-mode Activity recognition and the preprocessing method of video signal of description|
|KR101502841B1 (en) *||2013-08-28||2015-03-16||현대미디어 주식회사||Outline forming method of Bitmap font, and computer-readable recording medium for the same|
|JP2016536647A (en) *||2013-09-16||2016-11-24||トムソン ライセンシングＴｈｏｍｓｏｎ Ｌｉｃｅｎｓｉｎｇ||Color detection method and apparatus for generating text color|
|AT514861A3 (en) *||2013-09-20||2015-05-15||Asmag Holding Gmbh||Authentication system for a mobile data terminal|
|CN103530649A (en) *||2013-10-16||2014-01-22||北京理工大学||Visual searching method applicable mobile terminal|
|US9354778B2 (en)||2013-12-06||2016-05-31||Digimarc Corporation||Smartphone-based methods and systems|
|CN110263202A (en) *||2013-12-20||2019-09-20||西-奥特有限公司||Image search method and equipment|
|US20150286873A1 (en) *||2014-04-03||2015-10-08||Bruce L. Davis||Smartphone-based methods and systems|
|CN103996209B (en) *||2014-05-21||2017-01-11||北京航空航天大学||Infrared vessel object segmentation method based on salient region detection|
|CN105303506B (en) *||2014-06-19||2018-10-26||Tcl集团股份有限公司||A kind of data parallel processing method and system based on HTML5|
|KR101487461B1 (en) *||2014-06-26||2015-01-28||우원소프트 주식회사||Security control system by face recognition with private image secure function|
|CN104967790B (en)||2014-08-06||2018-09-11||腾讯科技（北京）有限公司||Method, photo taking, device and mobile terminal|
|CN104267808A (en) *||2014-09-18||2015-01-07||北京智谷睿拓技术服务有限公司||Action recognition method and equipment|
|KR101642602B1 (en) *||2014-12-02||2016-07-26||서진이엔에스(주)||System and method of detecting parking by software using analog/digital closed-circuit television image|
|CN105095398B (en) *||2015-07-03||2018-10-19||北京奇虎科技有限公司||A kind of information providing method and device|
|CN105046256B (en) *||2015-07-22||2018-10-16||福建新大陆自动识别技术有限公司||QR codes coding/decoding method based on distorted image correction and system|
|JP6493264B2 (en) *||2016-03-23||2019-04-03||横河電機株式会社||Maintenance information sharing apparatus, maintenance information sharing method, maintenance information sharing program, and recording medium|
|CN106213968A (en) *||2016-08-04||2016-12-14||轩脉家居科技（上海）有限公司||A kind of intelligent curtain based on human action identification|
|KR20180070082A (en) *||2016-12-16||2018-06-26||(주)태원이노베이션||Vr contents generating system|
|CN107465868B (en) *||2017-06-21||2018-11-16||珠海格力电器股份有限公司||Object identification method, device and electronic equipment based on terminal|
|CN107231547A (en) *||2017-07-07||2017-10-03||广东中星电子有限公司||Video monitoring system and method|
|US10360832B2 (en)||2017-08-14||2019-07-23||Microsoft Technology Licensing, Llc||Post-rendering image transformation using parallel image transformation pipelines|
|WO2019182907A1 (en) *||2018-03-21||2019-09-26||Nulman Yanir||Design, platform, and methods for personalized human interactions through digital communication devices|
|CN108830594B (en) *||2018-06-22||2019-05-07||山东高速信联支付有限公司||Multi-mode electronic fare payment system|
|Publication number||Priority date||Publication date||Assignee||Title|
|US20040263663A1 (en) *||2003-06-25||2004-12-30||Sunplus Technology Co., Ltd.||Digital camera image controller apparatus for a mobile phone|
|US20060012677A1 (en) *||2004-02-20||2006-01-19||Neven Hartmut Sr||Image-based search engine for mobile phones with camera|
|JP2007336538A (en) *||2006-06-09||2007-12-27||Lg Innotek Co Ltd||Camera module and mobile communication terminal having the same|
Family Cites Families (7)
|Publication number||Priority date||Publication date||Assignee||Title|
|GB0226014D0 (en) *||2002-11-08||2002-12-18||Nokia Corp||Camera-LSI and information device|
|US20050094730A1 (en) *||2003-10-20||2005-05-05||Chang Li F.||Wireless device having a distinct hardware video accelerator to support video compression and decompression|
|JP2007028326A (en) *||2005-07-19||2007-02-01||Alps Electric Co Ltd||Camera module and mobile phone terminal|
|US7797740B2 (en) *||2006-01-06||2010-09-14||Nokia Corporation||System and method for managing captured content|
|US8184166B2 (en) *||2006-07-06||2012-05-22||Nokia Corporation||Method, device, mobile terminal and computer program product for a camera motion detection based scheme for improving camera input user interface functionalities|
|US20080094466A1 (en)||2006-10-18||2008-04-24||Richard Eric Helvick||Target use video limit notification on wireless communication device|
|US7656438B2 (en) *||2007-01-04||2010-02-02||Sharp Laboratories Of America, Inc.||Target use video limit enforcement on wireless communication device|
- 2009-08-19 CN CN200980141567.8A patent/CN102216941B/en active IP Right Grant
- 2009-08-19 EP EP09808792.7A patent/EP2313847A4/en not_active Ceased
- 2009-08-19 KR KR1020117006167A patent/KR101680044B1/en active IP Right Grant
- 2009-08-19 WO PCT/US2009/054358 patent/WO2010022185A1/en active Application Filing
- 2009-08-19 CA CA2734613A patent/CA2734613A1/en active Pending
- 2009-08-19 KR KR1020167032337A patent/KR101763132B1/en active IP Right Grant
Patent Citations (3)
|Publication number||Priority date||Publication date||Assignee||Title|
|US20040263663A1 (en) *||2003-06-25||2004-12-30||Sunplus Technology Co., Ltd.||Digital camera image controller apparatus for a mobile phone|
|US20060012677A1 (en) *||2004-02-20||2006-01-19||Neven Hartmut Sr||Image-based search engine for mobile phones with camera|
|JP2007336538A (en) *||2006-06-09||2007-12-27||Lg Innotek Co Ltd||Camera module and mobile communication terminal having the same|
Also Published As
|Publication number||Publication date|
|CN102822817B (en)||For the Search Results of the action taked of virtual query|
|KR100641791B1 (en)||Tagging Method and System for Digital Data|
|JP5866728B2 (en)||Knowledge information processing server system with image recognition system|
|US9595059B2 (en)||Image-related methods and arrangements|
|Bao et al.||Movi: mobile phone based video highlights via collaborative sensing|
|US7929809B2 (en)||Method for assembling a collection of digital images|
|KR101123217B1 (en)||Scalable visual search system simplifying access to network and device functionality|
|JP5427859B2 (en)||System for image capture and identification|
|US8165409B2 (en)||Mobile device identification of media objects using audio and image recognition|
|CN105930311B (en)||Execute method, mobile device and the readable medium with the associated action of rendered document|
|US8180396B2 (en)||User augmented reality for camera-enabled mobile devices|
|US20070159522A1 (en)||Image-based contextual advertisement method and branded barcodes|
|CN102017661B (en)||Data access based on content of image recorded by a mobile device|
|US8436911B2 (en)||Tagging camera|
|US8769437B2 (en)||Method, apparatus and computer program product for displaying virtual media items in a visual media|
|CN102523519B (en)||Automatic multimedia slideshows for social media-enabled mobile devices|
|US20140019264A1 (en)||Framework for product promotion and advertising using social networking services|
|CN103080951B (en)||For the method and apparatus identifying the object in media content|
|KR101109157B1 (en)||Method, system, computer program, and apparatus for augmenting media based on proximity detection|
|US20120311623A1 (en)||Methods and systems for obtaining still images corresponding to video|
|KR100980748B1 (en)||System and methods for creation and use of a mixed media environment|
|CN102945276B (en)||Generation and update based on event playback experience|
|US20110096992A1 (en)||Method, apparatus and computer program product for utilizing real-world affordances of objects in audio-visual media data to determine interactions with the annotations to the objects|
|US8605141B2 (en)||Augmented reality panorama supporting visually impaired individuals|
|US20130311329A1 (en)||Image-related methods and arrangements|
|A201||Request for examination|
|E902||Notification of reason for refusal|
|E902||Notification of reason for refusal|
|E701||Decision to grant or registration of patent right|
|A107||Divisional application of patent|
|GRNT||Written decision to grant|
|FPAY||Annual fee payment||
Payment date: 20190924
Year of fee payment: 4