Bollywood Movie Dataset


This webpage contains movie transcripts from 18 different bollywood movies. This is a Hindi-English code-mixed conversational dataset.

All the data available on this website must be used for non-commercial and research purposes only.


Dataset Description

Each movie's transcript is segregated according to scenes in the movie. Each scene in turn comprises of a varied number of dialogues. These scripts have been scraped from a blog and been further processed upon, using simple regular expressions, to extract characters, dialogues and scenes. Click here to download the dataset.


Format Description

Each example in the provided dataset is a scene from a movie. We provide 5 key-value pairs distinguishing each such scene, within Final_Key.json in the dataset, as described below :

An example scene, with it's metadata, can be found below:

ID : "850"
Movie : "M_8"
Scene : "S_34"
Dialogues : "D_236-D_238"
Conversation : DEEPAK

Pepsi laao guru ...
WORKER
Pepsi nahi hai ..
DEEPAK
Arre jo hai laao ! Badi bottle.