Lazy Fat Pandas and SCIRPy
Pandas is widely used in data science and analytics for its simplicity, but it struggles with large datasets that exceed memory limits. Existing scalable frameworks like Dask, Modin, and Pandas on Spark require users to rewrite their code, making adoption difficult. To address this, we present Lazy Fat Pandas (LaFP)—an optimization framework that enables seamless scalability while preserving the familiar Pandas API. LaFP uses a combination of static program analysis and lazy evaluation to optimize memory usage and execution time. With minimal code modifications, users can leverage multiple backend engines (Pandas, Dask, Modin, and Pandas on Spark). Performance evaluations demonstrate that LaFP not only outperforms Pandas but also delivers significant improvements over direct use of scalable frameworks. LaFP comprises two modules: SCIRPy, a rewriter that applies static optimizations to restruc- ture Pandas programs, and a lazy-evaluation based runtime API. LaFP builds a task graph to represent dataframe operations dynamically, optimizes the task graph at runtime, and then executes the task graph on the chosen backend.Publications
- Efficient Dataframe Systems: Lazy Fat Pandas on a Diet
.pdf
Bhushan Pal Singh, P Kumar, C Bhattacharya, S Sudarshan arXiv preprint arXiv:2501.08207, Jan 2025 -
Optimizing Data Science Applications using Static Analysis .pdf
Bhushan Pal Singh, Mudra Sahu, S. Sudarshan: DBPL 2021: 23-27
Talks and Posters
- Efficient Dataframe Systems: Lazy Fat Pandas on a Diet talk (2025)
People
- S Sudarshan, IIT Bombay
- Bhushan Singh, ISRO and IIT Bombay
- Priyesh Kumar, IIT Bombay (currently at Dream 11)
- Chiranmoy Bhattacharya, IIT Bombay (currently at Fujitsu India)
- Pranab Paul, IIT Bombay