Skip to main content

Dagster & Iceberg

warning

This feature is considered in a preview stage and is under active development. It can change significantly, or be removed completely. It is not considered ready for production use.

This library provides I/O managers for reading and writing Apache Iceberg tables. It also provides a Dagster resource for accessing Iceberg tables.

Installation

uv add dagster-iceberg

dagster-iceberg defines the following extras for interoperability with various DataFrame libraries:

  • daft for interoperability with Daft DataFrames
  • pandas for interoperability with pandas DataFrames
  • polars for interoperability with Polars DataFrames
  • spark for interoperability with PySpark DataFrames (specifically, via Spark Connect)

pyarrow is a core package dependency, so the io_manager.arrow.PyArrowIcebergIOManager is always available.

Example

import pyarrow as pa
from dagster_iceberg.config import IcebergCatalogConfig
from dagster_iceberg.io_manager.arrow import PyArrowIcebergIOManager

import dagster as dg


@dg.asset
def my_table() -> pa.Table:
n_legs = pa.array([2, 4, 5, 100])
animals = pa.array(["Flamingo", "Horse", "Brittle stars", "Centipede"])
names = ["n_legs", "animals"]
return pa.Table.from_arrays([n_legs, animals], names=names)


warehouse_path = "/tmp/warehouse"

defs = dg.Definitions(
assets=[my_table],
resources={
"io_manager": PyArrowIcebergIOManager(
name="default",
config=IcebergCatalogConfig(
properties={
"type": "sql",
"uri": f"sqlite:///{warehouse_path}/pyiceberg_catalog.db",
"warehouse": f"file://{warehouse_path}",
}
),
namespace="default",
)
},
)

About Apache Iceberg

Iceberg is a high-performance format for huge analytic tables. It brings the reliability and simplicity of SQL tables to big data, while making it possible for multiple engines to safely work with the same tables, at the same time.