paris_wage <- open_dataset("workers_panel.feather", format="arrow") |>
filter(city == "Paris") |>
group_by(year) |>
summarise(average_wage = mean(wage)) |>
collect()Introduction
Since I took on the role of Data Coordinator within the etilab, one of the most common complaints from new Research Assistants or Interns is how frustrating SAS can be to work with. Most individuals prefer using R or Python, but many datasets within CASD are available only in SAS format. To get around this, people initially started exporting SAS data from the NAS to CSVs in their project directories within the shared virtual machine. While relatiovelty convenient, this practice has some significant drawbacks:
- It disrupts workflow, requiring frequent pauses to convert data.
- It drastically increases storage usage as each user stores redundant copies of datasets.
Benchmarking Performance
Beyond features, we also benchmarked speed and file sizes. We sampled data comprising 20 columns (12 numeric, 2 logical, 6 character columns) with 10 million rows from the DADS dataset. Read/write performance metrics were collected over 100 samples in R, clearing the session between each pass. Packages used were Arrow for Feather and Parquet, data.table/readr for CSV, and base R for RDS.
Below is a visualization of the median read/write speeds for each format (Figure 1). The bars represent median speeds, and error bars indicate the 5th to 95th percentiles. File sizes are also labeled directly on the bars.
For readability I report time up to 50s, the cliped median write speed for rdata being 125s.
We remark that rdata yields the smallest file sizes with only 620MB followed by parquet, feather and finally csv. Though maybe we could have achieved slighlty smaller sizes tweaking parquet compression settings.
In terms of raw performance, feather v2 clearly outperformed other formats, achieving a blazing fast median read time of just 0.74 seconds. Parquet was also impressive at 5.06 seconds, significantly faster than both CSV implementations. Feather’s optional compression notably reduced file sizes by around 30%, with minimal impact on speed.
Conclusions
After evaluating both features and performance, we decided on Feather v2 with zstd compression as our standard data storage format. Although RDS is convenient for R-centric workflows, its slow read/write speeds made it unsuitable to this application. Parquet offered good file size management but was not compelling enough for our moderately large datasets.
Overall, our adoption of Feather v2 has significantly improved productivity and efficiency, cutting our storage use by half and greatly enhancing the experience for new users who no longer face the SAS experience.