Post

ETL Pipeline Project: NYC Taxi Data

Designed and implemented an end-to-end ETL for New York City Taxi Data using Python, Docker, and PostgreSQL database.

ETL Pipeline Project: NYC Taxi Data

Introduction

I’ve always had a deep appreciation for structure and cleanliness — not just in the physical world, but especially in data. During my final thesis, I developed an NLP-based application and had to deal with messy, unstructured text from various sources. Surprisingly, instead of feeling overwhelmed, I truly enjoyed the process of cleaning, transforming, and organizing the raw data into something usable and reliable. That experienced sparked my passion for data engineering. There’s unique satisfaction in turning chaos into structure, and ever since, I’ve been determined to pursued this path. This project marks the beginning of my journey to strengthen my foundation as a Data Engineer. Through every step of process, I hope to keep learning, growing, and building better data systems.

Project Overview

This project demonstrates a complete ETL (Extract, Transform, Load) pipeline using Python, PostgreSQL and Docker to process the New York City Taxi Trip dataset. The main goal is to extract raw data from a public URL, perform necessary transformations to clean the data, load the processed data into a data warehouse and visualize it. I used GitHub Codespaces as my primary development environment which provided a fast and consistent setup, especially for working with Docker and PostgreSQL in cloud-based workspace.

Key components of this project:

  1. Data Extraction: Downloading raw CSV data from the NYC Taxi public dataset using Python.
  2. Data Transformation: Cleaning the data.
  3. Data Loading: Saving the transformed data into PostgreSQL
  4. Containerization: All steps are wrapped and executed in a Dockerized environment to ensure reproducibility and isolation.
  5. Data Visualization: Connecting Power BI to the PostgreSQL database to create interactive dashboard and visual insights.

The diagram below shows the workflow of this project.
Workflow

Dataset

For this project, I worked with the Yellow and Green Taxi Trip datasets of January 2025 and Taxi Zone dataset, available as public parquet and CSV files from The New York City Taxi and Limousine Commission (TLC) website. The datasets contain millions of rows of real-world taxi trip data. The datasets schema can also be viewed these pages Yellow and Green

Tech stack and Tools

  1. GitHub Codespaces (for cloud-based development)
  2. Python
  3. Docker
  4. PostgreSQL & PgAdmin 4
  5. Power BI

Project Workflow

Here’s a breakdown of how i approached this project—from getting messy taxi data to turning it into clean insights.

1. Deploying PostgreSQL database and PgAdmin 4

In this project, I used Docker to deploy a PostgreSQL database and PgAdmin 4. PostgreSQL serves as the data warehouse for storing the cleaned taxi trip data, while PgAdmin4 provides a visual interface to interact with the database. Here’s the docker-compose configuration I used:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
services:
  postgres:
    image: postgres:13
    container_name: nyc_postgres
    environment:
      - POSTGRES_USER=*****
      - POSTGRES_PASSWORD=********
      - POSTGRES_DB=ny_taxi
    volumes:
      - postgres_data:/var/lib/postgresql/data
    networks:
      - project1-network
    ports:
      - "5432:5432"
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U admin -d ny_taxi"]
      interval: 1m30s
      timeout: 30s
      retries: 5
      start_period: 30s

  pgadmin:
    image: dpage/pgadmin4
    container_name: nyc_pgadmin
    environment: 
      - PGADMIN_DEFAULT_EMAIL=****@****.com
      - PGADMIN_DEFAULT_PASSWORD=*****
    volumes:
      - pgadmin_data:/var/lib/pgadmin
    networks:
      - project1-network
    ports:
      - "8080:80"
    depends_on:
      postgres:
        condition: service_healthy
    
networks:
  project1-network:
    name: project1-network
    driver: bridge

volumes:
  postgres_data:
  pgadmin_data:

Run this command in the terminal to start the services (detachment mode):

1
docker-compose up -d

Wait until the services are up and running. Run docker ps in the terminal to make sure they’re running. Here’s what it looks like. Services status

Once the containers are running, accessed PgAdmin 4 via browser at localhost:8080 and log in using the email and password defined earlier in docker-compose.yaml file.

After logging in, create a new database server for our project, fill the server name, host-name (same as in postgresql docker container name), port, username (same as in POSTGRES_USER), and password(same as in POSTGRES_PASSWORD).

If everything is setup correctly, we will be able to see our PostgreSQL database inside PgAdmin and ready to receive data.

Here’s the PgAdmin 4 interface after successfully connecting to the PostgreSQL database. PgAdmin 4 interface

2. ETL Process

In this step, I used a Python script to perform the ETL (Extract, Transform, Load) process as part of this project pipeline. The first stage is Extract, where we download raw data from url.

Extract

Below is the Python code used for the extract process:

1
2
3
4
5
6
7
def extract(url):
    """Download file from URL and return file name"""
    
    file_name = url.rsplit('/', 1)[-1].strip()
    print(f'Downloading {file_name} ...')
    os.system(f'curl {url.strip()} -o {file_name}')
    return file_name

The extract() function is responsible for downloading a file from the given URL. It first extracts the filename by splitting the URL. Then, it uses the curl command to download the file. After the download is complete, the function returns the filename so it can be used in the next step of ETL pipeline

Transform

Once the raw dataset is extracted, the next step is transforming the data. This involves reading data from the downloaded files and preparing it for loading into PostgreSQL database. Here’s the function I used to perform the transformation:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
def transform(file_name):
    """Read data from CSV or Parquet and return an iterator of DataFrames"""
    
    if file_name.endswith('.csv'):
        df = pd.read_csv(file_name, nrows=10)
        df_iter = pd.read_csv(file_name, iterator=True, chunksize=100000)
    elif file_name.endswith('.parquet'):
        file = pq.ParquetFile(file_name)
        df = next(file.iter_batches(batch_size=10)).to_pandas()
        df_iter = file.iter_batches(batch_size=100000)
    else:
        print('Error: Only .csv or .parquet files are allowed.')
        sys.exit()
    return df, df_iter

This transform() function is designed to handle both .csv and .parquet file formats:

  • For CSV files, it reads a small preview (first 10 rows) and then prepared an iterator that reads the file in chunks of 100.000 rows. This is useful for processing large datasets in manageable parts.
  • For Parquet files, it uses PyArrow to read the data in batches. It also returns a preview of the first 10 rows and prepares a batch iterator for chunking processing.
  • If the file format is unsupported, the function will print an error and stop the script.
Load

The final step in the ETL process is loading the transformed data into the PostgreSQL database. This step takes the cleaned and chunked data and writes it into a target table inside the data warehouse. Below is the Python code used to load the data:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
def load(df, df_iter, engine, table_name):
    """Load transformed data into the database"""
    
    df.head(0).to_sql(name=table_name, con=engine, if_exists='replace')
    count = 0
    
    for batch in df_iter:
        count += 1
        batch_df = batch.to_pandas() if hasattr(batch, 'to_pandas') else batch
        print(f'Inserting batch {count}...')
        batch_df.to_sql(name=table_name, con=engine, if_exists='append')
        print(f'Inserted!')
    
    print(f'Completed inserting {count} batches.')

This load() function first creates the table schema by using df.head(0).to_sql() and replaces if it already exists. It then loops through each chunk of data (df_iter) and inserts it into the database using the to_sql() function.

3. Create the Docker image

To ensure that the ETL process runs in a consistent environment, I wrap etl.py inside a Docker container using a Dockerfile. It looks like this.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
FROM python:3.9.1

RUN apt-get update
RUN apt-get install wget

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

RUN mkdir /app
WORKDIR /app

COPY . .

ENTRYPOINT [ "python", "etl.py" ]

This image installs Python dependencies, copies the script and sets is as the default command. When this container runs it, it will automatically execute etl.py with arguments passed in.

Run this command in the terminal to build the image

1
docker build -t ny-taxi-etl:v01 .

4. Create a Bash script to automate the pipeline

Once the Docker image is ready, I create a bash script to automate the ETL process for multiple datasets. This script reads a .txt file (table_urls.txt) that contains the list of datasets and the target table names.

yellow_tripdata_jan2025 https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2025-01.parquet
green_tripdata_jan2025 https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2025-01.parquet
taxi_zone https://d37ci6vzurychx.cloudfront.net/misc/taxi_zone_lookup.csv
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
#!/bin/bash

# Ensure the script exits if any command fails
set -e

# Path to the file containing table names and URLs
FILE="table_urls.txt"

# Debug: Display the file contents before looping
echo "Reading table_urls.txt..."
cat "$FILE"
echo "------------------------------------"
echo ""

# Record the start time of the entire process
START_TOTAL=$(date +%s)

# Loop through each line in the file
while IFS=" " read -r TABLE_NAME URL || [[ -n "$TABLE_NAME" ]]
do
    echo "Processing table: $TABLE_NAME"

    # Record the start time for this table's process
    START_TIME=$(date +%s)

    docker run --rm -t \
        --network=project1-network \
        --name "ny-taxi-container-$TABLE_NAME" \
        ny-taxi-etl:v01 \
            --user=admin \
            --password=admin123 \
            --host=nyc_postgres \
            --port=5432 \
            --db=ny_taxi \
            --table_name="$TABLE_NAME" \
            --url="$URL"

    # Calculate the duration for this table's process
    END_TIME=$(date +%s)
    DURATION=$((END_TIME - START_TIME))

    echo "Finished processing $TABLE_NAME"
    echo "Total time: $DURATION seconds"
    echo "------------------------------------"

done < "$FILE"

# Calculate the total duration for all tables
END_TOTAL=$(date +%s)
TOTAL_DURATION=$((END_TOTAL - START_TOTAL))

echo "All tables processed!"
echo "Total execution time: $TOTAL_DURATION seconds"

For each entry of table_urls.txt, the script runs a Docker container using the image ny-taxi-etl-v01, passing necessary parameters like database credentials, table name, and data URL. The ETL process inside the container will:

  • Extract the data file from the provided URL using extract() function from etl.py.
  • Transform the file by reading it in chunks
  • Load the cleaned and transformed data into the specified table in PostgreSQL database.

This process is repeated for each line in the table_urls.txt file. At the end of each run, the script logs the time it took to process each dataset and then prints the total execution time once all datasets are processed.

5. Run Everything

Run this command in the terminal to execute the file

1
bash run_etl.sh

Below is the initial output when the process start running.
Running process

Once the process is complete, you’ll see a summary of the execution, including how long it took to process each table and the total execution time for all datasets. As shown in the image below.
End of process

To verify whether the data has been successfully loaded into the database, we can check it using PgAdmin 4.
Tables
The image above shows that the tables have been successfully created.

That alone is not enough — we also need to check the data inside those tables. Let’s check the total number of records in each table. The SQL query to check will be like this.

1
2
3
4
5
SELECT 'yellow_tripdata' AS table_name, COUNT(*) FROM yellow_tripdata_jan2025
UNION ALL
SELECT 'green_tripdata', COUNT(*) FROM green_tripdata_jan2025
UNION ALL
SELECT 'taxi_zone', COUNT(*) FROM taxi_zone;

And the result:
Total number of records in each table

Finally, the data verified succeccfully extracted, transformed, and loaded into the PostgreSQL database.

5. Visualize with Power BI

Once the data is successfully loaded into PostgreSQL database, it can be visualize using Power BI to uncover meaningful insights and trends. Since this project uses GitHub Codespaces and PostgreSQL is running inside a Docker container, there are a few steps required to connect Power BI on your local computer to the PostgreSQL database.

Steps to connect Power BI to PostgreSQL (docker container):

  1. Install GitHub CLI (if not already)
    👉 https://cli.github.com/
  2. Authenticate & start a port forwarding session. Run this command in local terminal
    1
    2
    
    gh auth login
    gh codespace ports forward 5432:5432
    
  3. Open Power BI Desktop
    • Choose PostgreSQL database as the data source.
    • Set server to localhost and database to ny_taxi
  4. Load tables and start building dashboard.

The snapshot of the dashboard I have created is given below.
Overall data Overall Taxi Data

Yellow Taxi Yellow Taxi

Green Taxi Green Taxi

You can check the interactive version here

Conclusion

This project demonstrates a complete end-to-end ETL pipeline built using Docker, PostgreSQL, PgAdmin 4, and Python. The raw NYC Taxi datasets were successfully extracted from external URLs, transformed into clean and structured formats, and finally loaded into a PostgreSQL database hosted in a containerized environment. The final data was verified using PgAdmin.

With this project, I learned how to integrate multiple technologies into a cohesive data engineering solution. This setup can be extended further to include orchestration tools like Apache Airflow.

The full code for this project can be accessed in my GitHub repository here.

Let’s learn and grow together. Thank you for reading! 🌷

This post is licensed under CC BY 4.0 by the author.