Polars Joins
SuperPower Polars SQL Join

Dataset Eurotrains
Consulta CSV en archivo local¶
pip install polars connectorx
import polars as pl
DefiniciΓ³n explΓcita del esquema de los datos externos¶
Esta lectura podrΓa parecer anΓ‘loga al ejemplo de escritura en la tabla del laboratorio anterior pero entre ambas hay dos diferencias esenciales:
- El esquema de los datos se define de manera explΓcita
- La lectura de datos es diferida ya que scan_csv()* no se ejecuta hasta la llamada a *collect() en la penΓΊltima lΓnea
# DefiniciΓ³n de metadatos
schema = {
"StopCode": pl.String,
"StopName": pl.String,
"Country": pl.String,
"StopLat": pl.Float64,
"StopLng": pl.Float64,
}
# DeclaraciΓ³n de la operaciΓ³n
lf = pl.scan_csv(
"Eurostar/stations.csv",
schema=schema,
ignore_errors=False,
null_values=["", "NA", "null"]
)
# EjecuciΓ³n
df = lf.select(["StopName", "StopCode", "StopLat", "StopLng"]).collect()
print(df)
βββββββββββββββββββββββββ¬βββββββββββ¬ββββββββββββ¬βββββββββββ
β StopName β StopCode β StopLat β StopLng β
β --- β --- β --- β --- β
β str β str β f64 β f64 β
βββββββββββββββββββββββββͺβββββββββββͺββββββββββββͺβββββββββββ‘
β 's-Hertogenbosch β HT β 51.69048 β 5.29362 β
β 's-Hertogenbosch Oost β HTO β 51.700554 β 5.318333 β
β 't Harde β HDE β 52.409168 β 5.893611 β
β Aachen Hbf β AHBF β 50.7678 β 6.091499 β
β Aachen West β AW β 50.78036 β 6.070715 β
β β¦ β β¦ β β¦ β β¦ β
β Praha hl.n. β PRAHA β 50.085 β 14.441 β
β Praha-Holesovice β PRAHOL β 50.11 β 14.439 β
β Baden-Baden β RBB β 48.7905 β 8.1908 β
β SSN Stoomdepot β RTNG β 51.9363 β 4.5098 β
β Aschaffenburg Hbf β NAH β 49.98028 β 9.14361 β
βββββββββββββββββββββββββ΄βββββββββββ΄ββββββββββββ΄βββββββββββ
MΓ‘s Info Polars Lazy Eval
SQL Join con datos externos a la base de datos¶
WebDAV (Web-based Distributed Authoring and Versioning)
Conjunto de extensiones del protocolo HTTP que permite gestionar, editar, compartir y transferir archivos de forma remota y colaborativa desde servidores web (alojados en la web). ActΓΊa como una unidad de red que convierte un servidor remoto en un disco duro local accesible desde cualquier lugar y sistema operativo (Windows, MacOS, Linux, etc.)
MΓ‘s info: π Utilizar WevDAV en Linux
Join entre Parquet y CSV alojados en la Web¶
Q1: Horarios de salida en orden temporal creciente de los servicios Eurostar en cada estaciΓ³n la semana n (13)
schema = {
"StopCode": pl.String,
"StopName": pl.String,
"Country": pl.String,
"StopLat": pl.Float64,
"StopLng": pl.Float64,
}
# WebDAV UPMdrive
stations_csv = "https://drive.upm.es/public.php/dav/files/X99My74a3LKyAK4"
# Usar scan_csv para lectura perezosa
lazy_stations = pl.scan_csv(
stations_csv,
schema=schema,
ignore_errors=False,
null_values=["", "NA", "null"]
)
# WebDAV UPMdrive
eurotrain_parquet = "https://drive.upm.es/public.php/dav/files/XTzftsPGJMARfDj"
# Usar scan_parquet para lectura perezosa
join = (
pl.scan_parquet(eurotrain_parquet)
.filter((pl.col("ServiceType") == 'Eurostar') & ((pl.col("ns") == 13)) & (pl.col("StopDepartureTime").is_not_null()))
.join(lazy_stations, left_on="StopCode", right_on="StopCode").filter(pl.col("StopName") == 'Rotterdam Centraal')
.select(["ServiceDate","ServiceID","StopArrivalTime","StopDepartureTime"])
.sort("ServiceID","StopArrivalTime","StopDepartureTime")
)
# Collect (ejecutar la consulta)
res = join.collect()
print(res)
βββββββββββββββ¬ββββββββββββ¬ββββββββββββββββββ¬ββββββββββββββββββββ
β ServiceDate β ServiceID β StopArrivalTime β StopDepartureTime β
β --- β --- β --- β --- β
β str β str β time β time β
βββββββββββββββͺββββββββββββͺββββββββββββββββββͺββββββββββββββββββββ‘
β 25-03-24 β 13224525 β 06:55:00 β 06:58:00 β
β 25-03-24 β 13224733 β 07:55:00 β 07:58:00 β
β 25-03-24 β 13224820 β 10:02:00 β 10:04:00 β
β 25-03-24 β 13224865 β 08:25:00 β 08:28:00 β
β 25-03-24 β 13224878 β 11:32:00 β 11:34:00 β
β β¦ β β¦ β β¦ β β¦ β
β 25-03-30 β 13265290 β 16:24:00 β 16:28:00 β
β 25-03-30 β 13265556 β 19:02:00 β 19:05:00 β
β 25-03-30 β 13266081 β 20:32:00 β 20:35:00 β
β 25-03-30 β 13266299 β 18:54:00 β 18:58:00 β
β 25-03-30 β 13266369 β 21:02:00 β 21:05:00 β
βββββββββββββββ΄ββββββββββββ΄ββββββββββββββββββ΄ββββββββββββββββββββ
Join entre tabla MariaDB y CSV alojado en la Web¶
Q2: Itinerario (secuencia de estaciones en orden temporal creciente) de un servicio concreto (13224525)
schema = {
"StopCode": pl.String,
"StopName": pl.String,
"Country": pl.String,
"StopLat": pl.Float64,
"StopLng": pl.Float64,
}
# URL archivo. Lee directamente desde la Web
stations_csv = "https://drive.upm.es/public.php/dav/files/X99My74a3LKyAK4"
# Usar scan_csv para lectura perezosa
lazy_stations = pl.scan_csv(
stations_csv,
schema=schema,
ignore_errors=False,
null_values=["", "NA", "null"]
)
# ConexiΓ³n con la base de datos
uri = "mysql://eurostar:eurostar@mariadb:3306/eurostar"
sql = """
SELECT tts, StopCode, ServiceDate, ServiceType, ServiceID, StopArrivalTime, StopDepartureTime
FROM eurostar.EuroTrain
"""
# Usar read_database_uri para acceder a la tabla
lazy_join = (
pl.read_database_uri(query=sql, uri=uri).lazy()
.filter(pl.col("ServiceID") == '13224525')
.join(lazy_stations, left_on="StopCode", right_on="StopCode")
.sort("tts")
.select(["StopName","ServiceDate","ServiceType","StopArrivalTime","StopDepartureTime"])
)
# Invocar collect provoca la ejecuciΓ³n de lazy_stations y lazy_join
res = lazy_join.collect()
print(res)
βββββββββββββββββββββββββββββ¬ββββββββββββββ¬ββββββββββββββ¬ββββββββββββββββββ¬ββββββββββββββββββββ
β StopName β ServiceDate β ServiceType β StopArrivalTime β StopDepartureTime β
β --- β --- β --- β --- β --- β
β str β str β str β time β time β
βββββββββββββββββββββββββββββͺββββββββββββββͺββββββββββββββͺββββββββββββββββββͺββββββββββββββββββββ‘
β Amsterdam Centraal β 25-03-24 β Eurostar β null β 05:48:00 β
β Rotterdam Centraal β 25-03-24 β Eurostar β 06:55:00 β 06:58:00 β
β Antwerpen-Centraal β 25-03-24 β Eurostar β 07:30:00 β 07:33:00 β
β Brussel-Zuid Midi β 25-03-24 β Eurostar β 08:08:00 β 08:20:00 β
β Airport Charles de Gaulle β 25-03-24 β Eurostar β 09:38:00 β 09:43:00 β
β Marne-la-VallΓ©e-Chessy β 25-03-24 β Eurostar β 09:53:00 β null β
βββββββββββββββββββββββββββββ΄ββββββββββββββ΄ββββββββββββββ΄ββββββββββββββββββ΄ββββββββββββββββββββ