Pandas vs Polars¶
Flights Dataset
Polar Expressions
pip install polars
import pandas as pd
import polars as pl
import time
Pandas test¶
start = time.time()
df = pd.read_csv('flights.csv', dtype={"TAIL_NUMBER": "string", "ORIGIN_AIRPORT": "string", "DESTINATION_AIRPORT": "string"})
df = df[(df['MONTH'] == 12) & (df['ORIGIN_AIRPORT'] == 'SEA') & (df['DESTINATION_AIRPORT'] == 'DFW')]
end = time.time()
print(end - start)
df
6.601531744003296
| YEAR | MONTH | DAY | DAY_OF_WEEK | AIRLINE | FLIGHT_NUMBER | TAIL_NUMBER | ORIGIN_AIRPORT | DESTINATION_AIRPORT | SCHEDULED_DEPARTURE | ... | ARRIVAL_TIME | ARRIVAL_DELAY | DIVERTED | CANCELLED | CANCELLATION_REASON | AIR_SYSTEM_DELAY | SECURITY_DELAY | AIRLINE_DELAY | LATE_AIRCRAFT_DELAY | WEATHER_DELAY | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5339849 | 2015 | 12 | 1 | 2 | AA | 1230 | N3KSAA | SEA | DFW | 5 | ... | 538.0 | -17.0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 5339922 | 2015 | 12 | 1 | 2 | AA | 1228 | N3AMAA | SEA | DFW | 500 | ... | 1056.0 | 2.0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 5341203 | 2015 | 12 | 1 | 2 | AS | 658 | N302AS | SEA | DFW | 640 | ... | 1210.0 | -10.0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 5341275 | 2015 | 12 | 1 | 2 | AA | 1308 | N3BDAA | SEA | DFW | 645 | ... | 1234.0 | -4.0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 5343046 | 2015 | 12 | 1 | 2 | AA | 143 | N3FCAA | SEA | DFW | 830 | ... | 1425.0 | 2.0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 5810434 | 2015 | 12 | 31 | 4 | AA | 1489 | N3HTAA | SEA | DFW | 1015 | ... | 1633.0 | 28.0 | 0 | 0 | NaN | 6.0 | 0.0 | 22.0 | 0.0 | 0.0 |
| 5811917 | 2015 | 12 | 31 | 4 | AA | 1402 | N3GEAA | SEA | DFW | 1155 | ... | 1743.0 | 1.0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 5813075 | 2015 | 12 | 31 | 4 | AS | 662 | N469AS | SEA | DFW | 1310 | ... | 1902.0 | 12.0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 5813463 | 2015 | 12 | 31 | 4 | AA | 1512 | N3DLAA | SEA | DFW | 1340 | ... | 1923.0 | -7.0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 5819038 | 2015 | 12 | 31 | 4 | AS | 660 | N407AS | SEA | DFW | 2350 | ... | 536.0 | 6.0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
319 rows Γ 31 columns
Polars test¶
El mΓ©todo read_csv() utiliza el modo de ejecuciΓ³n inmediata lo que significa que cargarΓ‘ directamente todo el conjunto de datos antes de realizar cualquier filtrado. En este aspecto, este bloque de cΓ³digo es similar al que utiliza Pandas pero mucho mΓ‘s eficiente
start = time.time()
df = pl.read_csv('flights.csv').filter(
(pl.col('MONTH') == 12) & (pl.col('ORIGIN_AIRPORT') == 'SEA') & (pl.col('DESTINATION_AIRPORT') == 'DFW'))
end = time.time()
print(end - start)
display(df)
0.6479246616363525
| YEAR | MONTH | DAY | DAY_OF_WEEK | AIRLINE | FLIGHT_NUMBER | TAIL_NUMBER | ORIGIN_AIRPORT | DESTINATION_AIRPORT | SCHEDULED_DEPARTURE | DEPARTURE_TIME | DEPARTURE_DELAY | TAXI_OUT | WHEELS_OFF | SCHEDULED_TIME | ELAPSED_TIME | AIR_TIME | DISTANCE | WHEELS_ON | TAXI_IN | SCHEDULED_ARRIVAL | ARRIVAL_TIME | ARRIVAL_DELAY | DIVERTED | CANCELLED | CANCELLATION_REASON | AIR_SYSTEM_DELAY | SECURITY_DELAY | AIRLINE_DELAY | LATE_AIRCRAFT_DELAY | WEATHER_DELAY |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| i64 | i64 | i64 | i64 | str | i64 | str | str | str | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | str | i64 | i64 | i64 | i64 | i64 |
| 2015 | 12 | 1 | 2 | "AA" | 1230 | "N3KSAA" | "SEA" | "DFW" | 5 | 9 | 4 | 10 | 19 | 230 | 209 | 194 | 1660 | 533 | 5 | 555 | 538 | -17 | 0 | 0 | null | null | null | null | null | null |
| 2015 | 12 | 1 | 2 | "AA" | 1228 | "N3AMAA" | "SEA" | "DFW" | 500 | 459 | -1 | 16 | 515 | 234 | 237 | 198 | 1660 | 1033 | 23 | 1054 | 1056 | 2 | 0 | 0 | null | null | null | null | null | null |
| 2015 | 12 | 1 | 2 | "AS" | 658 | "N302AS" | "SEA" | "DFW" | 640 | 644 | 4 | 12 | 656 | 220 | 206 | 190 | 1660 | 1206 | 4 | 1220 | 1210 | -10 | 0 | 0 | null | null | null | null | null | null |
| 2015 | 12 | 1 | 2 | "AA" | 1308 | "N3BDAA" | "SEA" | "DFW" | 645 | 640 | -5 | 24 | 704 | 233 | 234 | 197 | 1660 | 1221 | 13 | 1238 | 1234 | -4 | 0 | 0 | null | null | null | null | null | null |
| 2015 | 12 | 1 | 2 | "AA" | 143 | "N3FCAA" | "SEA" | "DFW" | 830 | 830 | 0 | 25 | 855 | 233 | 235 | 200 | 1660 | 1415 | 10 | 1423 | 1425 | 2 | 0 | 0 | null | null | null | null | null | null |
| … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
| 2015 | 12 | 31 | 4 | "AA" | 1489 | "N3HTAA" | "SEA" | "DFW" | 1015 | 1037 | 22 | 21 | 1058 | 230 | 236 | 201 | 1660 | 1619 | 14 | 1605 | 1633 | 28 | 0 | 0 | null | 6 | 0 | 22 | 0 | 0 |
| 2015 | 12 | 31 | 4 | "AA" | 1402 | "N3GEAA" | "SEA" | "DFW" | 1155 | 1154 | -1 | 13 | 1207 | 227 | 229 | 207 | 1660 | 1734 | 9 | 1742 | 1743 | 1 | 0 | 0 | null | null | null | null | null | null |
| 2015 | 12 | 31 | 4 | "AS" | 662 | "N469AS" | "SEA" | "DFW" | 1310 | 1308 | -2 | 16 | 1324 | 220 | 234 | 209 | 1660 | 1853 | 9 | 1850 | 1902 | 12 | 0 | 0 | null | null | null | null | null | null |
| 2015 | 12 | 31 | 4 | "AA" | 1512 | "N3DLAA" | "SEA" | "DFW" | 1340 | 1340 | 0 | 15 | 1355 | 230 | 223 | 203 | 1660 | 1918 | 5 | 1930 | 1923 | -7 | 0 | 0 | null | null | null | null | null | null |
| 2015 | 12 | 31 | 4 | "AS" | 660 | "N407AS" | "SEA" | "DFW" | 2350 | 2352 | 2 | 11 | 3 | 220 | 224 | 206 | 1660 | 529 | 7 | 530 | 536 | 6 | 0 | 0 | null | null | null | null | null | null |
En la siguiente versiΓ³n sustituye el mΓ©todo read_csv() por scan_csv() que sΓ utiliza ejecuciΓ³n diferida.
scan_csv() retrasa la ejecuciΓ³n hasta que se llama al mΓ©todo collect(), analiza todas las operaciones intermedias e intenta optimizar la operaciΓ³n
start = time.time()
df = pl.scan_csv('flights.csv').filter(
(pl.col('MONTH') == 11) & (pl.col('ORIGIN_AIRPORT') == 'SEA') & (pl.col('DESTINATION_AIRPORT') == 'DFW')).collect()
end = time.time()
print(end - start)
display(df)
0.7065532207489014
| YEAR | MONTH | DAY | DAY_OF_WEEK | AIRLINE | FLIGHT_NUMBER | TAIL_NUMBER | ORIGIN_AIRPORT | DESTINATION_AIRPORT | SCHEDULED_DEPARTURE | DEPARTURE_TIME | DEPARTURE_DELAY | TAXI_OUT | WHEELS_OFF | SCHEDULED_TIME | ELAPSED_TIME | AIR_TIME | DISTANCE | WHEELS_ON | TAXI_IN | SCHEDULED_ARRIVAL | ARRIVAL_TIME | ARRIVAL_DELAY | DIVERTED | CANCELLED | CANCELLATION_REASON | AIR_SYSTEM_DELAY | SECURITY_DELAY | AIRLINE_DELAY | LATE_AIRCRAFT_DELAY | WEATHER_DELAY |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| i64 | i64 | i64 | i64 | str | i64 | str | str | str | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | str | i64 | i64 | i64 | i64 | i64 |
| 2015 | 11 | 1 | 7 | "AS" | 658 | "N307AS" | "SEA" | "DFW" | 640 | 640 | 0 | 16 | 656 | 220 | 223 | 195 | 1660 | 1211 | 12 | 1220 | 1223 | 3 | 0 | 0 | null | null | null | null | null | null |
| 2015 | 11 | 1 | 7 | "AA" | 1308 | "N3CSAA" | "SEA" | "DFW" | 645 | 643 | -2 | 15 | 658 | 238 | 211 | 189 | 1660 | 1207 | 7 | 1243 | 1214 | -29 | 0 | 0 | null | null | null | null | null | null |
| 2015 | 11 | 1 | 7 | "AA" | 143 | "N3JXAA" | "SEA" | "DFW" | 828 | 826 | -2 | 12 | 838 | 237 | 204 | 188 | 1660 | 1346 | 4 | 1425 | 1350 | -35 | 0 | 0 | null | null | null | null | null | null |
| 2015 | 11 | 1 | 7 | "AA" | 1489 | "N3LRAA" | "SEA" | "DFW" | 1008 | 1742 | 454 | 16 | 1758 | 242 | 204 | 183 | 1660 | 2301 | 5 | 1610 | 2306 | 416 | 0 | 0 | null | 0 | 0 | 0 | 416 | 0 |
| 2015 | 11 | 1 | 7 | "AA" | 1402 | "N3BWAA" | "SEA" | "DFW" | 1152 | 1237 | 45 | 13 | 1250 | 236 | 207 | 184 | 1660 | 1754 | 10 | 1748 | 1804 | 16 | 0 | 0 | null | 0 | 0 | 5 | 11 | 0 |
| … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
| 2015 | 11 | 30 | 1 | "AA" | 1489 | "N3DCAA" | "SEA" | "DFW" | 1015 | 1012 | -3 | 15 | 1027 | 230 | 227 | 201 | 1660 | 1548 | 11 | 1605 | 1559 | -6 | 0 | 0 | null | null | null | null | null | null |
| 2015 | 11 | 30 | 1 | "AA" | 1402 | "N3AXAA" | "SEA" | "DFW" | 1155 | 1513 | 198 | 14 | 1527 | 227 | 228 | 202 | 1660 | 2049 | 12 | 1742 | 2101 | 199 | 0 | 0 | null | 1 | 0 | 9 | 189 | 0 |
| 2015 | 11 | 30 | 1 | "AS" | 662 | "N487AS" | "SEA" | "DFW" | 1305 | 1301 | -4 | 12 | 1313 | 225 | 223 | 198 | 1660 | 1831 | 13 | 1850 | 1844 | -6 | 0 | 0 | null | null | null | null | null | null |
| 2015 | 11 | 30 | 1 | "AA" | 1512 | "N3GDAA" | "SEA" | "DFW" | 1340 | 1500 | 80 | 18 | 1518 | 230 | 227 | 195 | 1660 | 2033 | 14 | 1930 | 2047 | 77 | 0 | 0 | null | 0 | 0 | 76 | 1 | 0 |
| 2015 | 11 | 30 | 1 | "AA" | 2310 | "N3NAAA" | "SEA" | "DFW" | 1521 | 1518 | -3 | 13 | 1531 | 224 | 222 | 200 | 1660 | 2051 | 9 | 2105 | 2100 | -5 | 0 | 0 | null | null | null | null | null | null |
Formato Parquet¶
La verdadera potencia de LazyEval se pone de manifiesto cuanto mayores son los conjuntos de datos y, especialmente, si estΓ‘n un un formato columnar. Apache Parquet es un formato de archivo de almacenamiento en columnas optimizado para su uso en escenarios de procesamiento masivo de Big Data. A diferencia de los formatos de almacenamiento basados en filas, como CSV, Parquet almacena los datos en columnas, lo que lo hace muy eficiente para operaciones con gran volumen de lectura, especialmente cuando se trata de grandes conjuntos de datos
Este formato en columnas permite una mejor compresiΓ³n y codificaciΓ³n, ya que los tipos de datos similares se almacenan juntos. Por ejemplo, si se tiene un conjunto de datos con millones de filas y solo se necesita consultar unas pocas columnas, Parquet permite leer solo las columnas relevantes (Predicate Pushdown), lo que reduce significativamente las operaciones de E/S y mejora el rendimiento. Parquet tambiΓ©n admite estructuras de datos anidadas complejas, lo que lo hace ideal para datos semiestructurados como JSON
Ejemplo: poda del Γ‘rbol de bΓΊsqueda (Predicate Pushdown) en Polars¶
lf = pl.scan_parquet("events/*.parquet")
fast = (
lf.filter(pl.col("country") == "IN")
.filter(pl.col("ts") >= pl.datetime(2025, 9, 1))
.select(["ts", "country", "path"])
.collect()
)
Ejemplo con datos particionados (Polars Hive Partitioning)¶
NYC Taxi Dataset
El dataset ocupa 325Mb. Scan multiples ficheros con lazy evaluation
TLC Trip Record Data
# Planificar acceso
q1 = (
pl.scan_parquet("Parquet/**/*.parquet")
.filter(pl.col("passenger_count") > 0) # Filter
.with_columns([
# Derive date/time features
(pl.col("tpep_dropoff_datetime") - pl.col("tpep_pickup_datetime"))
.dt.total_minutes()
.alias("trip_duration_minutes")
])
.group_by("passenger_count")
.agg([
pl.mean("trip_duration_minutes"),
pl.mean("total_amount").alias("avg_fare")
])
.sort("passenger_count")
)
# Ejecutar consulta
r1 = q1.collect()
print(r1)
βββββββββββββββββββ¬ββββββββββββββββββββββββ¬βββββββββββββ
β passenger_count β trip_duration_minutes β avg_fare β
β --- β --- β --- β
β i64 β f64 β f64 β
βββββββββββββββββββͺββββββββββββββββββββββββͺβββββββββββββ‘
β 1 β 15.547899 β 27.008371 β
β 2 β 17.243722 β 30.352485 β
β 3 β 16.934745 β 29.482887 β
β 4 β 17.764046 β 32.436805 β
β 5 β 20.555213 β 26.760919 β
β 6 β 23.465055 β 26.841099 β
β 7 β 18.3 β 83.639 β
β 8 β 9.911765 β 102.342059 β
β 9 β 11.666667 β 62.724444 β
βββββββββββββββββββ΄ββββββββββββββββββββββββ΄βββββββββββββ
Particionado estructurado¶
Si la estructura de particionado es conocida y acorde con los patrones de bΓΊsqueda mΓ‘s frecuentes la poda mejora los tiempos de respuesta drΓ‘sticamente debido a que la bΓΊsqueda se focaliza ΓΊnicamente en las particiones conteniendo los datos
lf = pl.scan_parquet("path/YYYY=2025/MM=10/*/*.parquet")
today = lf.filter(pl.col("DD") == "08").collect()
Join de datos en diferentes formatos: parquet y csv¶
El siguiente ejemplo incluye un join que muestra la capacidad para analizar datasets en diferentes formatos
q2 = (
pl.scan_parquet("Parquet/*/*/*.parquet")
.join(pl.scan_csv("taxi_zone_lookup.csv"), left_on="PULocationID", right_on="LocationID")
.filter(pl.col("total_amount") > 25)
.group_by("Zone")
.agg(
(pl.col("total_amount") /
(pl.col("tpep_dropoff_datetime") - pl.col("tpep_pickup_datetime")).dt.total_minutes()
).mean().alias("cost_per_minute")
).sort("cost_per_minute",descending=True)
)
# Collect the results (executes the query)
r2 = q2.collect()
print(r2)
βββββββββββββββββββββββββββββββββββ¬ββββββββββββββββββ
β Zone β cost_per_minute β
β --- β --- β
β str β f64 β
βββββββββββββββββββββββββββββββββββͺββββββββββββββββββ‘
β Co-Op City β inf β
β Whitestone β inf β
β UN/Turtle Bay South β inf β
β Eltingville/Annadale/Prince's β¦ β inf β
β Pelham Bay Park β inf β
β β¦ β β¦ β
β New Dorp/Midland Beach β 3.958707 β
β Governor's Island/Ellis Islandβ¦ β 3.238929 β
β Rikers Island β 2.802784 β
β Green-Wood Cemetery β 1.832354 β
β Crotona Park β 1.514066 β
βββββββββββββββββββββββββββββββββββ΄ββββββββββββββββββ
ConversiΓ³n de datos en diferentes formatos: Parquet y CSV¶
El siguiente ejemplo muestra una transformaciΓ³n CSVβParquet con dos singularidades esenciales:
- No requiere de almacenamiento intermedio alguno
- Tipos de datos explΓcitos (minimiza problemas tΓpicos de los procesos de migraciΓ³n/integraciΓ³n)
lf = pl.scan_csv("raw/*.csv", dtypes={"ts": pl.Datetime, "user_id": pl.Int64})
(
lf.filter(pl.col("ts") >= pl.datetime(2025, 10, 1))
.with_columns(pl.col("path").str.slice(0, 64).alias("path_64"))
.sink_parquet("curated/2025-10.parquet")