Semantic Photo Editing Gains Momentum (Videos)

Editing and representing photos (and other media types) at a semantic level, not at the pixel level, has big potential. My previous post from a year ago discussed the approach and its strengths and weaknesses while suggesting its realization would depend on future basic research developments (like object recognition and real-time scene rendering).


So what is new? here is a selection of videos that demonstrate some of the big progress people made in these areas and showing, in my opinion, a common trend.


Assisted recognition and editing of a photographed scene: 

In Rendering Synthetic Objects into Legacy Photographs, Kevin Karsch et al. show a method to realistically insert synthetic objects into existing photographs without access to the original scene.

Note that the described algorithm still needs human assistance but a very limited one and using a single photo as input. Note also the video parts which show not only lighting but also physical interactions between inserted objects and the scene.



Assisted recognition of 3D space enabling semantic video editing:

In this video from Adobe MAX 2011, Sylvain Paris shows a sneak peak of a feature for video editing, including the ability to create 3D fly-throughs of 2D videos and change of focus and depth of field.

Note again the need for limited human assistance and in addition the real-time editing capabilities of this software. It would be interesting to know whether it was a special input video that enabled the change of focus (similar to the approach of the Lytro camera) or purely a post-production powerful semantic editing based on the 3D characteristics of the scene, using the original input video. I would guess the latter.



Non-assisted recognition of buildings and cityscape from images:

By a company acquired by Apple.

An additional video about the technical details of this product describes it as fully-automated with processing time of 5 times the flight time. Note also the real-time rendering capabilities of the end result.



Real-Time Scene Rendering:

Demonstration of a rendering technology of the kind needed for real-time semantic photo editing: Motiva COLIMO is a software tool that provides real-time 3D post-production capabilities. 

This kind of technology could be coupled with recognition software to produce automatic semantic photo editing end-to-end.


And finally, here are two examples of adding a layer of computation to an input image:

Google Goggles Sudoku feature demonstrating not just recognition of objects as before but now also solving a photographed puzzle.  


Real-time language translation in videos on mobile phones without an internet connection via Word Lens:


Both of these last examples hint at the potential of this trend for going beyond physical interactions.


To summarize, some would say that the recognition techniques above are still restricted more to text, scene geometry and lighting than to recognizing real-world objects and may require some human assistance. Nevertheless, this progress by itself opens up a lot of opportunities, it stimulates the imagination and is blazing the trail towards full semantic media editing. I expect that as we'll see even more advances in object recognition and real-time rendering and overcoming the challenge of combining the two areas efficiently, semantic editing will provide even more interesting opportunities.


To read more on this subject, check out my previous post.