Talk

Datanomy: Understanding the Anatomy of Arrow and Parquet

Thursday, May 28

16:15 - 16:45
RoomTortellini
LanguageEnglish
Audience levelIntermediate
Elevator pitch

Apache Arrow and Apache Parquet are the de-facto standards for in-memory and file formats for columnar data but they are quite unknown. The talk will try to demistify presenting the formats in a simpler visual way.

Abstract

Apache Arrow and Apache Parquet are the de-facto standards for in-memory and file formats for columnar data. We use them on our data workflows daily, sometimes even without noticing. This talk will demystify these formats by going over the specifications and showing some real world examples of how the data actually looks on our systems.

For Arrow, we will briefly explore some of the batteries included for exchanging data like Arrow IPC, Arrow Flight or Arrow ADBC and how they relate to the core In-memory format.

The main libraries shown will be PyArrow with its Arrow and Parquet implementation and datanomy a new tool created for visualizing data formats.

TagsData Engineering, Data Science & Data Visualisation
Participant

Raúl Cumplido

I started working with Python in 2008 with Python 2.5 and since then it became my language of choice. I have been involved in the Spanish Python community being one of the co-founders of the Python Spanish Association. I have been involved in the organization of EuroPython in Bilbao, several PyCon ES (Spain) and the Barcelona meetup. A couple of years ago I started working in Apache Arrow and since then I have become a committer and a PMC member. I want to share what we have done and what we are doing in the Project.