For Fuck's Sake Do Not Use toPandas

13 Jul, 2022

This rant / opinion is strictly on the point of not bringing data to driver when literally not needed.

What is it running?

toPandas() runs collect().
collect() transfers all the data to the driver node and is expensive.
It is an action that is used to retrieve all the elements of the dataframe (from all executors) to the driver. Retrieving larger datasets results in OutOfMemory error.

To put it more simply, if you are getting a big dataframe that you usually would not get it loaded in memory, do expect to get memory issues.

toPandas & error is unavoidable?

Simply fetching small data (by limiting etc) will not give you the error. If you cannot get rid of toPandas / collect & the error that comes with it & it serves some purpose then it will always be a bottleneck in your code and the only direct way would be to scale-up the driver node.

Spark is optimized for distributing datasets across a cluster and running computations in parallel. The distributed nature of Spark makes computations that collect results on a single node slow.

Don't figure out multiprocessing and distributed yourself when using spark. Use spark instead.

There's also another excellent - article on explaining what toPandas does and ways to improve it https://bryancutler.github.io/toPandas/

ok, but the whole point is I want to use Pandas, how do I use it for a spark DF ?

so what you do instead is, write to a temp location via spark
read this temp location from spark, do repartition(1) and write it to another temp location (this will give u 1 csv)
via gsfuse/gsutil/azcopy or whatever cloud of your choice's relevant tool, bring this file to local, and read this file from pandas from local instead of converting to pandas