Demo page for the paper Multi-view Video Summarization to appear in IEEE Transactions on Multimedia.


Multi-view Video Summarization

Yanwei Fu1, Yanwen Guo*1, Yanshu Zhu1, Feng Liu2, Chuanming Song1 and Zhi-Hua Zhou1,

1National Key Lab for Novel Software Technology, Nanjing University, Nanjing

2Department of Computer Sciences, University of Wisconsin-Madison

Multi-view Video Summarization, IEEE Trans. on Multimedia  Accepted As a Regular Paper 2010. (Corresponding to Yanwen Guo)



We conducted experiments on several multi-view videos, including typical indoor and outdoor environments. Some multi-view videos are semi-synchronous or non-synchronous. For instance, lengths of three views of the office2 videos are 180 minutes 41 seconds, 170 minutes 46 seconds and 176 minutes 43 seconds separately. Most multi-view videos are captured by three or four cameras with 360 degree coverage of the scene. To further verify our method, we also deliberately shot an outdoor scene by four cameras with only 180 degree coverage. Note that, all of the videos are captured using the web cameras or hand-held video cameras by non-specialists, making some of them unstable and obscure. Moreover, some videos have quite different brightness across multi-views. These issues pose great challenges to the multi-view video summarization.


说明: 说明: G:\multi-view video summarization data\Multi-view Video Summarization_files\未命名.bmp

      Fig.  Multi-view video storyboard. Without losing generality, the multi-view office1 videos with 4 views are given for illustration. The blue rectangles denote original multi-view videos. Each shot in summary is represented with a yellow box, by clicking on which the corresponding shot can be displayed. Each shot in summary is assigned a number indicating its order in those shots resulting from the video parsing process. Here, we give the numbers for the convenience of further discussion. Dashed lines connect those shots with strong correlations. The middle frames of a few resulting shots, which allow the quick browse of the summary, are demonstrated here.


Dataset: (1)office1 (2)campus (3)road  (4)badminton  (5)office2 (6)office lobby


Comparison with Mono-view Summarization

We compare our method with previous mono-view video summarization method. We realized the video summarization method, together with the visual attention model presented in [11] cited by our manuscript. The method was applied to each view of the multi-view office1 and campus videos. For each multi-view video, we combined the resulting shots along the timeline to form a single video summary. The single video summaries produced by the mono-view summarization method and our algorithm can be found in the demo website. It is obvious that the summaries produced by the mono-view summarization method contain much redundant information. There exist significant temporal overlaps among summarized multi-views shots. Most events are simultaneously recorded by four views’ summaries [Office1: Mono-view] [Campus: Mono-view] [office lobby: Mono-view]. For a fair comparison, we also use the above method to summarize the single video formed by combining the multi-view videos along the timeline, and generate a dynamic single video summary[Office1:Mono-view2] [Campus:Mono-view2] [office lobby: Mono-view2].

By using our multi-view summarization method, such redundancy is much reduced in contrast [Office1: Multi-view] [Campus: Multi-view] [office lobby: Multi-view]. Some event is recorded by the most informative summarized shot, while the most important events are reserved in multi-view summaries. Some events that are ignored by previous method, for instance the events recorded from 1st to 5th second, 14th to 18th second, and 39th to 41st second in our office1 single video summary, are reserved by our method in contrast. This is determined by our shot clustering algorithm and multi-objective optimization operated on the spatio-temporal shot graph. Such property of our method facilitates generating a short-length, yet highly informative summary.

We further compare our algorithm against a graph-based summarization method. A single video is first formed by combining the multi-view videos along the timeline. We then construct the graph according the method given in [10]. Final summary is produced by using normalized cut based event clustering and highlight detection [14][Office1:Graph-compared] [Campus: Graph-compared] [office lobby: Graph-compared].

Sift matching

Email: , update March.26th, 2010.

Copyright© Department of Computer Science & Technology. Nanjing University.  All Rights Reserved.