Recently, there have been many cases of solutions (software & hardware) that follow a similar pattern. The following map organizes a selection of these recent solutions (such as Siri, Watson, Project Glass, Kinect) and aims to uncover the connection between them and help in imagining new solutions along the same lines. The common pattern that can be seen among the majority of the solutions which appear in the map is their unstructured input which is turned algorithmically into a 'semantic model' that enables different kinds of processing, visualization and realization - or in short 'semantic solutions'.
One can argue that all human-made software is basically semantic but I think that these recent solutions take this characteristic to a whole new level. Even though they seem unrelated the semantic pattern as was described earlier unifies them.The callouts in this map are marked in green - existing solutions, yellow - solutions not yet widely available and red - a selection of my own ideas for the future (without doing a thorough research on their originality). Many of the callouts link to the solutions' websites (the red arrows). Some exceptions to the pattern are marked with star(s). A special placeholder callout in the map asks: what is your favorite semantic IDE feature?.
Editing and representing photos (and other media types) at a semantic level, not at the pixel level, has big potential. My previous post from a year ago discussed the approach and its strengths and weaknesses while suggesting its realization would depend on future basic research developments (like object recognition and real-time scene rendering).
So what is new? here is a selection of videos that demonstrate some of the big progress people made in these areas and showing, in my opinion, a common trend.
Assisted recognition and editing of a photographed scene:
Note that the described algorithm still needs human assistance but a very limited one and using a single photo as input. Note also the video parts which show not only lighting but also physical interactions between inserted objects and the scene.
Assisted recognition of 3D space enabling semantic video editing:
In this video from Adobe MAX 2011, Sylvain Paris shows a sneak peak of a feature for video editing, including the ability to create 3D fly-throughs of 2D videos and change of focus and depth of field.
Note again the need for limited human assistance and in addition the real-time editing capabilities of this software.It would be interesting to know whether it was a special input video that enabled the change of focus (similar to the approach of the Lytro camera) or purely a post-production powerful semantic editing based on the 3D characteristics of the scene, using the original input video. I would guess the latter.
Non-assisted recognition of buildings and cityscape from images:
An additional videoabout the technical details of this product describes it as fully-automated with processing time of 5 times the flight time. Note also the real-time rendering capabilities of the end result.
Real-Time Scene Rendering:
Demonstration of a rendering technology of the kind needed for real-time semantic photo editing: Motiva COLIMO is a software tool that provides real-time 3D post-production capabilities.
This kind of technology could be coupled with recognition software to produce automatic semantic photo editing end-to-end.
And finally, here are two examples of adding a layer of computation to an input image:
Google Goggles Sudoku feature demonstrating not just recognition of objects as before but now also solving a photographed puzzle.
Real-time language translation in videos on mobile phones without an internet connection via Word Lens:
Both of these last examples hint at the potential of this trend for going beyond physical interactions.
To summarize, some would say that the recognition techniques above are still restricted more to text, scene geometry and lighting than to recognizing real-world objects and may require some human assistance. Nevertheless, this progress by itself opens up a lot of opportunities, it stimulates the imagination and is blazing the trail towards full semantic media editing. I expect that as we'll see even more advances in object recognition and real-time rendering and overcoming the challenge of combining the two areas efficiently, semantic editing will provide even more interesting opportunities.
The book ‘Moral Machines: Teaching Robots Right from Wrong’ discusses a world which depends more and more on (ro)bots' decision making in places like self-driving cars, stock markets and war zones. The authors highlight the complexity of the issues involved in the creation of Artificial Moral Agents from the limitations of Asimov's laws of robotics to the plethora of modern approaches and the need for a richer understanding of human morality itself. Despite the topic's complexity I liked the thoughtful call for a practical, step by step approach for its realization, and better sooner than later.
Ever imagined looking at a photo of a room and then simply switching the light in that room, inside the photo? or interactively rearranging the fruits on the table? or changing the time that the clock wall shows? all inside the photo.
Idea in a Nutshell
This post proposes a (new?*) approach for editing and representing photos. No more editing of photos only at the pixel level but editing at the semantic level. Photos in photo editing applications can be much more than just a series of pixels. They can be represented as a collection of semantic objects, objects with meaning: objects with a specific 3D structure, physical characteristics and functional features. This way, photo editing can become more similar to editing a scene in a 3D software. The focus in this post is on photos but SEED could be applied to other types of media like video or audio as well. Furthermore, this post discusses additional opportunities opening up by semantic representation of media.
What do I mean by an 'approach'?
The suggested approach is called Semantic Editing, Encoding and Decoding (SEED). It is NOT a new object recognition algorithm nor a scene rendering algorithm. It is simply a theoretical approach with roots in current technology. It discusses the 'What' rather than the 'How'. I tend to believe that this approach could be implemented in the future based on future big advances in certain key basic research areas, like object recognition and real time scene rendering.
First, a good-enough semantic representation should be extracted from a photo and saved - after the raw photo is taken, an object recognition algorithm runs to extract the semantic info from it and save it in a semantic file format. An example of a photo semantic representation would be: a book with a specific ISBN catalog number, opened on page 4, located at a specific position and angle in the room which is lighted by a light bulb with specific lighting characteristics. Then, the saved semantic file can then be re-rendered algorithmically by the photo editing application and let the fun begin...
Semantic Representation Opportunities
Searching inside photos. Example: be able to ask what are the next road signs in a certain photo or video of a road. Search the web or archives by asking questions in a language similar to a natural language (Semantic Query Language).
Allowing for different on-demand image sizes and resolutions by re-rendering of the semantic content. Example: retargeting a photo to look good on different screen sizes and resolutions (cinemas, large TVs, mobile phones, tablets etc.) after the photo was already taken.
Compression. Example: <auto:car manufacturer="abc" model="bcd" coordinates="xyz" /> can represent a photo of a car which could be manipulated or zoomed in for as much as one wants based on prior information about the car in the photo editing application's engine (or the receiving end of a communication channel). This could be applied to video compression as well. This kind of compression can potentially achieve savings of orders of magnitude of data yet result with much higher quality.
Using as an assistive technology for people who are blind or has low vision. Digital photos could be described in words and turned from photo to text to speech. Furthermore, if decoding is fast enough then a real time wearable device could warn from obstacles and identify streets, shops etc.
Using it to provide additional content-relevant material such as ads.
Enabling Augmented Reality applications.
Enabling accumulation of knowledge (examples: landscapes, landmarks and products) by integrating photos from many different users.
Facilitating algorithmic creation of 'photos' without using cameras at all.
Facilitating a new kind of photography equipment. In addition to cameras optimized for optical quality, create cameras optimized for obtaining semantic qualities of objects and environments. Those qualities could then be used by semantic encoding and decoding algorithms. Examples:
New equipment for obtaining detailed object information: 3D scene analysis using lasers, photo a scene from (possibly slightly different) multiple positions, angels and zoom levels. Cameras which focus on sources of high density of information like logos, barcodes, light sources, mirrors or other reflections.
New sensors for obtaining environment characteristics. Examples: recording detailed light conditions (using new flashes and filters), humidity, temperature and highly detailed space and time coordinates.
New formats for photos' metadata.
Creative Photo Editing
'Moving in space' - enabling a change of point of view, of the way we look at the photo\ scene. We can now view it from different angles and distances.
Enabling high zoom levels. Example: zooming in on a photographed object in the scene which SEED has identified and has complete structural knowledge about, including information of environment variables like lighting characteristics.
Creating new objects or deleting existing objects from a scene. Again, at the semantic level, not at the pixel level. Example: removing a book to reveal the repeating patterns on the sofa where it was previously located. Or better yet, instead of identifying a repeating pattern on the sofa, the new fill might be a known pattern from the identified sofa manufactures's catalog.
Physical editing. Examples: moving, rotating or scaling of objects.
Editing the environment. Example: adding light to a dark photo (better than a camera flash would create and applied after the photo was taken), changing the time of the day, adding rain to a scene.
Functional editing. Example: changing the opened application windows on a computer screen in the photo.
'Moving in time'. Example: A short video clip which shows a bouncing ball could be extended. We can move in time to create video clips of the ball before the original clip was taken or after the original clip was taken. This could be done by re-rendering using the same object, the same environment parameters, camera characteristics and the same physics rules.
Combination of some or all of the above - Example: switching the light on in a photographed room (changing the physical position of the knob in the photo to 'on' and re-rendering the scene after adding new lighting implied by a specific kind of light bulb which was identified from the original photo in a specific place in the room).
Potential abuse - this should be handled by means of education, awareness, ethics and law.
Confusion between reality and virtual reality - this should be handled as well via education and awareness.
Feasibility - SEED is pending on future advances in object recognition techniques. Some might even say this level of recognition will never be achieved but I personally tend to believe it will.
Processing power - SEED is also pending on advances in efficient real time rendering of complex scenes. This aspect might benefit from advances in cloud computing and bandwidth availability.
Authentic representation of reality - Is SEED lossy or lossless? do we lose information when we use it or not? I say that the first SEED implementations would be very lossy but as SEED advances it might become even a 'gainy' method. By 'gainy' I mean enabling additional freedom of manipulating images with a basis in reality. Example - enhanced resolution or higher levels of zoom based on knowledge of the 'real' structure and colors of a photographed object. However, there are obvious inherent limitations to SEED for providing authentic representation of reality (such as if we delete an object what do we see instead?).
Scalability. In order for the idea to be useful the knowledge base, the semantic databases, regarding different kind of objects, environments etc. should be enormous in scale. It should also span many different categories (such as commercial products, nature elements, urban and natural landscapes). There exists an option of having a hybrid scene especially for the first implementations of SEED. A hybrid scene would be made partly of pixels and partly of semantic information. In addition, the initial database of identifiable objects can be small and specific to a certain domain but then gradually grow (to include more categories and more objects).
Some photos are hard to represent semantically - like one of a kind product or a specific pattern of colors. The hybrid approach could support these use cases. Once encoding algorithms become more sophisticated larger and larger parts of photos could be changed from pixel-based to semantic-based encoding. The amount of objects which are hard to represent semantically might be surprisingly smaller than expected following industrialization processes and improvements in data-sharing and in modeling algorithms.
This post was written during the very last couple of days as part of contemplating my personal wishlist for Adobe MAX Sneak Peeks - so thanks Adobe for the inspiration. It was also inspired by the video accompanying Arik Shamir and Shai Avidan's paper, "Seam Carving for Content-Aware Image Resizing" so thanks Arik and Shai.
* Disclaimer: I do not have any background knowledge in the field of object recognition and the like. I consider this as both a bad and a good thing. Bad ,as the idea presented therein might not be refreshing at all without me knowing about it; and good, as it might be extremely refreshing just because I'm not constrained by existing paradigms. So I decided to let go, post as-is in a draft format and hear your valuable feedback. Comment here or find me at Adobe MAX 2010 next week and let me know what you think.