- L. Xie, S.-F. Chang, A. Divakaran, H. Sun, Unsupervised discovery of multilevel statistical video structures using hierarchical hidden markov models, in: Multimedia and Expo, 2003. ICME '03. Proceedings. 2003 International Conference on, Vol. 3, 2003, pp. III-29-32 vol.3.
- G. Xu, Y.-F. Ma, H.-J. Zhang, S.-Q. Yang, An hmm-based framework for video semantic analysis, Circuits and Systems for Video Technology, IEEE Transactions on 15 (11) (2005) 1422-1433. doi:10.1109/TCSVT.2005.856903.
- J. C. Niebles, H. Wang, L. Fei-Fei, Unsupervised learning of human action categories using spatial-temporal words, International Journal of Computer Vision 79 (3) (2008) 299-318. doi:http://dx.doi.org/10.1007/s11263-007-0122-4.
- S.-F. Wong, T.-K. Kim, R. Cipolla, Learning motion categories using both semantic and structural information, in: Computer Vision and Pattern Recognition, 2007. CVPR '07. IEEE Conference on, 2007, pp. 1-6.
- I. Laptev, On space-time interest points, International Journal of Computer Vision 64 (2-3) (2005) 107-123. doi:http://dx.doi.org/10.1007/s11263-005-1838-7.
- S.-F. Wong, R. Cipolla, Extracting spatiotemporal interest points using global information, in: Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference on, 2007, pp. 1-8.
- J. Jeon, V. Lavrenko, R. Manmatha, Automatic image annotation and retrieval using cross-media relevance models, in: SIGIR '03: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, ACM, New York, NY, USA, 2003, pp. 119-126.
- H. Tong, J. He, M. Li, C. Zhang, W.-Y. Ma, Graph based multi-modality learning, in: MULTIMEDIA '05: Proceedings of the 13th annual ACM international conference on Multimedia, ACM, New York, NY, USA, 2005, pp. 862-871.
- S.-F. Chang, W.-Y. Ma, A. Smeulders, Recent advances and challenges of semantic image/video search, in: Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE International Conference on, Vol. 4, 2007, pp. IV-1205-IV-1208.