Dagster & Iceberg
This feature is considered in a preview stage and is under active development. It can change significantly, or be removed completely. It is not considered ready for production use.
This library provides I/O managers for reading and writing Apache Iceberg tables. It also provides a Dagster resource for accessing Iceberg tables.
Installation
- uv
- pip
uv add dagster-iceberg
pip install dagster-iceberg
dagster-iceberg
defines the following extras for interoperability with various DataFrame libraries:
daft
for interoperability with Daft DataFramespandas
for interoperability with pandas DataFramespolars
for interoperability with Polars DataFramesspark
for interoperability with PySpark DataFrames (specifically, via Spark Connect)
pyarrow
is a core package dependency, so the io_manager.arrow.PyArrowIcebergIOManager
is always available.
Example
import pyarrow as pa
from dagster_iceberg.config import IcebergCatalogConfig
from dagster_iceberg.io_manager.arrow import PyArrowIcebergIOManager
import dagster as dg
@dg.asset
def my_table() -> pa.Table:
n_legs = pa.array([2, 4, 5, 100])
animals = pa.array(["Flamingo", "Horse", "Brittle stars", "Centipede"])
names = ["n_legs", "animals"]
return pa.Table.from_arrays([n_legs, animals], names=names)
warehouse_path = "/tmp/warehouse"
defs = dg.Definitions(
assets=[my_table],
resources={
"io_manager": PyArrowIcebergIOManager(
name="default",
config=IcebergCatalogConfig(
properties={
"type": "sql",
"uri": f"sqlite:///{warehouse_path}/pyiceberg_catalog.db",
"warehouse": f"file://{warehouse_path}",
}
),
namespace="default",
)
},
)
About Apache Iceberg
Iceberg is a high-performance format for huge analytic tables. It brings the reliability and simplicity of SQL tables to big data, while making it possible for multiple engines to safely work with the same tables, at the same time.