Safe Biking in Paris: Modern Data Engineering Pipeline

762

0

Description

This project presents a production-ready data engineering pipeline for analyzing bicycle safety in France, particularly in Île-de-France and Paris, using real-world datasets.

By setting up this data engineering and analysis pipeline, you will get:

  • Clean, analysis-ready datasets in BigQuery
  • Automated EDA reports with key insights and visualizations
  • ML-ready features for predictive modeling

Data Engineering Stack:

  • Infrastructure as Code: Terraform for reproducible cloud environments
  • Containerized Pipeline: Docker for consistent deployment across environments
  • Data Transformation: dbt with version control and built-in testing
  • Cloud-Native Architecture: Google Cloud Platform (BigQuery + GCS)

Data Quality Issues:

For those looking to reuse this dataset, the project addresses and corrects several data quality issues, including:

  • Inconsistent File Naming: e.g., carcteristiques-2021.csv contains spelling errors

  • Join Complexity: Requires choosing a granularity (e.g., one row per accident vs. per user)

  • Data Quality Fixes:

    • Mixed data types (e.g., latitude column contains strings)
    • Non-breaking spaces disrupting numeric parsing
    • Cross-year schema changes (e.g., vehicle identifiers changed format from 2020 to 2021)
    • Inconsistent time formats (e.g., 4-character time codes with invalid patterns)
Thématique
Transports et mobilité
Type
Application
Mots clés
Aucun mot clé
Dernière mise à jour
10 juin 2025
Date de création
10 juin 2025

Vues

1 jeu de données associé

aucune API associée

0 réutilisations du même créateur

Il n'y a pas d'autres réutilisations du même créateur.