SEED: Semantic Editing, Encoding and Decoding

Intro

   Ever imagined looking at a photo of a room and then simply switching the light in that room, inside the photo? or interactively rearranging the fruits on the table? or changing the time that the clock wall shows? all inside the photo.

Idea in a Nutshell

   This post proposes a (new?*) approach for editing and representing photos. No more editing of photos only at the pixel level but editing at the semantic level. Photos in photo editing applications can be much more than just a series of pixels. They can be represented as a collection of semantic objects, objects with meaning: objects with a specific 3D structure, physical characteristics and functional features. This way, photo editing can become more similar to editing a scene in a 3D software. The focus in this post is on photos but SEED could be applied to other types of media like video or audio as well. Furthermore, this post discusses additional opportunities opening up by semantic representation of media.

What do I mean by an 'approach'?

    The suggested approach is called Semantic Editing, Encoding and Decoding (SEED). It is NOT a new object recognition algorithm nor a scene rendering algorithm. It is simply a theoretical approach with roots in current technology. It discusses the 'What' rather than the 'How'. I tend to believe that this approach could be implemented in the future based on future big advances in certain key basic research areas, like object recognition and real time scene rendering.

SEED explained

   First, a good-enough semantic representation should be extracted from a photo and saved - after the raw photo is taken, an object recognition algorithm runs to extract the semantic info from it and save it in a semantic file format. An example of a photo semantic representation would be: a book with a specific ISBN catalog number, opened on page 4, located at a specific position and angle in the room which is lighted by a light bulb with specific lighting characteristics. Then, the saved semantic file can then be re-rendered algorithmically by the photo editing application and let the fun begin...

 

SEED Strengths

  • Semantic Representation Opportunities
    • Searching inside photos. Example: be able to ask what are the next road signs in a certain photo or video of a road. Search the web or archives by asking questions in a language similar to a natural language (Semantic Query Language).
    • Allowing for different on-demand image sizes and resolutions by re-rendering of the semantic content. Example: retargeting a photo to look good on different screen sizes and resolutions (cinemas, large TVs, mobile phones, tablets etc.) after the photo was already taken.
    • Compression. Example: <auto:car manufacturer="abc" model="bcd" coordinates="xyz" /> can represent a photo of a car which could be manipulated or zoomed in for as much as one wants based on prior information about the car in the photo editing application's engine (or the receiving end of a communication channel). This could be applied to video compression as well. This kind of compression can potentially achieve savings of orders of magnitude of data yet result with much higher quality.
    • Using as an assistive technology for people who are blind or has low vision. Digital photos could be described in words and turned from photo to text to speech. Furthermore, if decoding is fast enough then a real time wearable device could warn from obstacles and identify streets, shops etc.
    • Using it to provide additional content-relevant material such as ads.
    • Enabling Augmented Reality applications.
    • Enabling accumulation of knowledge (examples: landscapes, landmarks and products) by integrating photos from many different users.
    • Facilitating algorithmic creation of 'photos' without using cameras at all.
    • Facilitating a new kind of photography equipment. In addition to cameras optimized for optical quality, create cameras optimized for obtaining semantic qualities of objects and environments. Those qualities could then be used by semantic encoding and decoding algorithms. Examples:
      • New equipment for obtaining detailed object information: 3D scene analysis using lasers, photo a scene from (possibly slightly different) multiple positions, angels and zoom levels. Cameras which focus on sources of high density of information like logos, barcodes, light sources, mirrors or other reflections.
      • New sensors for obtaining environment characteristics. Examples: recording detailed light conditions (using new flashes and filters), humidity, temperature and highly detailed space and time coordinates.
      • New formats for photos' metadata.
  • Creative Photo Editing
    • 'Moving in space' - enabling a change of point of view, of the way we look at the photo\ scene. We can now view it from different angles and distances.
    • Enabling high zoom levels. Example: zooming in on a photographed object in the scene which SEED has identified and has complete structural knowledge about, including information of environment variables like lighting characteristics.
    • Creating new objects or deleting existing objects from a scene. Again, at the semantic level, not at the pixel level. Example: removing a book to reveal the repeating patterns on the sofa where it was previously located. Or better yet, instead of identifying a repeating pattern on the sofa, the new fill might be a known pattern from the identified sofa manufactures's catalog.
    • Physical editing. Examples: moving, rotating or scaling of objects.
    • Editing the environment. Example: adding light to a dark photo (better than a camera flash would create and applied after the photo was taken), changing the time of the day, adding rain to a scene.
    • Functional editing. Example: changing the opened application windows on a computer screen in the photo.
    • 'Moving in time'. Example: A short video clip which shows a bouncing ball could be extended. We can move in time to create video clips of the ball before the original clip was taken or after the original clip was taken. This could be done by re-rendering using the same object, the same environment parameters, camera characteristics and the same physics rules.
    • Combination of some or all of the above - Example: switching the light on in a photographed room (changing the physical position of the knob in the photo to 'on' and re-rendering the scene after adding new lighting implied by a specific kind of light bulb which was identified from the original photo in a specific place in the room).

SEED Weaknesses

  • Potential abuse - this should be handled by means of education, awareness, ethics and law.
  • Confusion between reality and virtual reality - this should be handled as well via education and awareness.
  • Feasibility - SEED is pending on future advances in object recognition techniques. Some might even say this level of recognition will never be achieved but I personally tend to believe it will.
  • Processing power - SEED is also pending on advances in efficient real time rendering of complex scenes. This aspect might benefit from advances in cloud computing and bandwidth availability.
  • Authentic representation of reality - Is SEED lossy or lossless? do we lose information when we use it or not? I say that the first SEED implementations would be very lossy but as SEED advances it might become even a 'gainy' method. By 'gainy' I mean enabling additional freedom of manipulating images with a basis in reality. Example - enhanced resolution or higher levels of zoom based on knowledge of the 'real' structure and colors of a photographed object. However, there are obvious inherent limitations to SEED for providing authentic representation of reality (such as if we delete an object what do we see instead?).
  • Scalability. In order for the idea to be useful the knowledge base, the semantic databases, regarding different kind of objects, environments etc. should be enormous in scale. It should also span many different categories (such as commercial products, nature elements, urban and natural landscapes). There exists an option of having a hybrid scene especially for the first implementations of SEED. A hybrid scene would be made partly of pixels and partly of semantic information. In addition, the initial database of identifiable objects can be small and specific to a certain domain but then gradually grow (to include more categories and more objects).
  • Some photos are hard to represent semantically - like one of a kind product or a specific pattern of colors. The hybrid approach could support these use cases. Once encoding algorithms become more sophisticated larger and larger parts of photos could be changed from pixel-based to semantic-based encoding. The amount of objects which are hard to represent semantically might be surprisingly smaller than expected following industrialization processes and improvements in data-sharing and in modeling algorithms.
  • Standardization

 

Acknowledgements

    This post was written during the very last couple of days as part of contemplating my personal wishlist for Adobe MAX Sneak Peeks - so thanks Adobe for the inspiration. It was also inspired by the video accompanying Arik Shamir and Shai Avidan's paper, "Seam Carving for Content-Aware Image Resizing" so thanks Arik and Shai.

 

* Disclaimer: I do not have any background knowledge in the field of object recognition and the like. I consider this as both a bad and a good thing. Bad ,as the idea presented therein might not be refreshing at all without me knowing about it; and good, as it might be extremely refreshing just because I'm not constrained by existing paradigms. So I decided to let go, post as-is in a draft format and hear your valuable feedback. Comment here or find me at Adobe MAX 2010 next week and let me know what you think.