Cricket is one of the most popular sports in the world after soccer. Played globally in more than a dozen countries, it is followed by over a billion people. One would therefore expect, following other trends in global sports, that there would be a meaningful analysis and mining of cricket videos. There has been some interesting work in this area (for example, [1, 2]). However, by and large, the amount of computer vision research does not seem to be commensurate with the interest and the revenue in the game. Possible reasons could be the complex nature of the game, and the variety of views one can see, as compared to games such as tennis, and soccer. Further, the long duration of the game might inhibit the use of inefficient algorithms. Segmenting a video into its meaningful units is very useful for the structure analysis of the video, and in applications such as content based retrieval. Onemeaningful unit corresponds to the semantic notion of ``deliveries'' or ``balls'' (virtually all cricket games are made up of 6-ball overs).


Fig. 1. Typical views in the game of cricket (a) Pitch View (b) Ground View (c) Non-Field View. Note that the ground view may include non-trivial portions of the pitch.

Indeed, the problem of segmenting a cricket video into meaningful scenes is addressed in [1]. Specifically, the method uses the manual commentaries available for cricket video, to segment a cricket video into its constituent balls. Once segmented, the video is annotated by the text for higher-level content access. A hierarchical framework and algorithms for cricket event detection and classification is proposed in [2]. The authors uses a hierarchy of classifier to detect various views present in the game of cricket. The views with which they are concerned are real-time, replay, field view, non-field view, pitch-view, long-view, boundary view, close-up and crowd etc.

Despite the interesting methods in these works - useful in their own right - there are some challenges. The authors in [2] have only worked on view classification without addressing the problem of video segmentation. Our work closely resembles that of the method in [1], and is inspired from it from a functional point of view. It differs dramatically in the complexity of the solution, and the basis for the temporal segmentation. Specifically, as the title in [1] indicates, the work is text-driven. Our work is purely in the image domain, and demonstrates that it is possible to get a segmentation in many situations using a simpler modeling framework. Further, in our work we identify key players of each ball to help in indexing the match.