Handling Medium to Large Datasets in R

A detailed look at using Arrow and Parquet for large dataset management in R, with a comparison to CSV and RDS formats.
R
Benchmarking
Author

Patryk Wiśniewski

Published

December 1, 2024

Introduction

Since I took on the role of Data Coordinator within the etilab, one of the most common complaints from new Research Assistants or Interns is how frustrating SAS can be to work with. Most individuals prefer using R or Python, but many datasets within CASD are available only in SAS format. To get around this, people initially started exporting SAS data from the NAS to CSVs in their project directories within the shared virtual machine. While relatiovelty convenient, this practice has some significant drawbacks:

  • It disrupts workflow, requiring frequent pauses to convert data.
  • It drastically increases storage usage as each user stores redundant copies of datasets.

Implementing Shared Resources

To address these issues, we implemented a shared folder containing datasets already converted into a more user-friendly format. This prompted the essential question: which data format should we choose? Existing usage primarily involved RDS and CSVs, but I suggested considering modern, feature-rich formats like Parquet and Feather (v2), essentially Apache Arrow on disk. Here’s a brief comparison of these formats:

Feature Parquet Feather (v2) RDS CSV
Human Readable
Filtering on Disk
Compression 🟡 🟡
Nested Structures
Data Type Fidelity
Python Support
R Support
Stata Support
Table 1: Data formats comparison

Parquet and Feather stand out due to their feature-rich nature, particularly their capability to work with data directly from disk without loading everything into memory. Using R’s Arrow implementation, we can even utilize dplyr verbs:

paris_wage <- open_dataset("workers_panel.feather", format="arrow") |>
  filter(city == "Paris") |>
  group_by(year) |>
  summarise(average_wage = mean(wage)) |>
  collect()

This is especially beneficial with large datasets and supports simultaneous operations on multiple files. Note that Feather supports optional compression (not default).

RDS falls short primarily due to limited language support and CSV is human readable readable but lacks most of the important features of a modern file format like data type fidelity.

Benchmarking Performance

Beyond features, we also benchmarked speed and file sizes. We sampled data comprising 20 columns (12 numeric, 2 logical, 6 character columns) with 10 million rows from the DADS dataset. Read/write performance metrics were collected over 100 samples in R, clearing the session between each pass. Packages used were Arrow for Feather and Parquet, data.table/readr for CSV, and base R for RDS.

Below is a visualization of the median read/write speeds for each format (Figure 1). The bars represent median speeds, and error bars indicate the 5th to 95th percentiles. File sizes are also labeled directly on the bars.

Figure 1: Read/Write speeds of different file formats

Figure 1: Read/Write speeds of different file formats

For readability I report time up to 50s, the cliped median write speed for rdata being 125s.

We remark that rdata yields the smallest file sizes with only 620MB followed by parquet, feather and finally csv. Though maybe we could have achieved slighlty smaller sizes tweaking parquet compression settings.

In terms of raw performance, feather v2 clearly outperformed other formats, achieving a blazing fast median read time of just 0.74 seconds. Parquet was also impressive at 5.06 seconds, significantly faster than both CSV implementations. Feather’s optional compression notably reduced file sizes by around 30%, with minimal impact on speed.

Conclusions

After evaluating both features and performance, we decided on Feather v2 with zstd compression as our standard data storage format. Although RDS is convenient for R-centric workflows, its slow read/write speeds made it unsuitable to this application. Parquet offered good file size management but was not compelling enough for our moderately large datasets.

Overall, our adoption of Feather v2 has significantly improved productivity and efficiency, cutting our storage use by half and greatly enhancing the experience for new users who no longer face the SAS experience.