Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
Introduced in Spark 4.x, Python Data Source API allows you to create PySpark Data Sources leveraging long standing python libraries for handling unique file types or specialized interfaces with spark read, readStream, write and writeStream APIs.
| Data Source Name | Purpose |
| --- | --- |
| [dxf](dxf/README.md) | Read AutoCAD DXF drawing exchange files |
| [zipdcm](zipdcm/README.md) | Read DICOM files from Zip file archives |

## Documentation
Expand All @@ -27,5 +28,6 @@ Please see our [installation guide](./INSTALL.md)

| Datasource | Package | Purpose | License | Source |
| ---------- | ---------- | --------------------------------- | ----------- | ------------------------------------ |
| dxf | ezdxf | Python library for DXF files | MIT | https://github.com/mozman/ezdxf |
| zipdcm | pydicom | Python api for DICOM files | MIT | https://github.com/pydicom/pydicom |
| zipdcm | pylibjpeg | Decoding / Encoding pixel formats | GPLv3 & MIT | https://github.com/pydicom/pylibjpeg |
28 changes: 28 additions & 0 deletions dxf/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
.PHONY: dev test unit style check

all: clean style test

clean: ## Remove build artifacts and cache files
rm -rf build/
rm -rf dist/
rm -rf *.egg-info/
rm -rf htmlcov/
rm -rf .coverage
rm -rf coverage.xml
rm -rf .pytest_cache/
rm -rf .mypy_cache/
rm -rf .ruff_cache/
find . -type d -name __pycache__ -delete
find . -type f -name "*.pyc" -delete

test:
pip install -r requirements.txt
pytest .

dev:
pip install -r requirements.txt

style:
pre-commit run --all-files

check: style test
75 changes: 75 additions & 0 deletions dxf/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
# DXF Data Source

A PySpark Python Data Source for reading AutoCAD DXF (Drawing Exchange Format) files using the [ezdxf](https://ezdxf.readthedocs.io/en/stable/) library.

DXF is an open exchange format for CAD systems. This data source extracts geometric entities from DXF files into a tabular format suitable for analysis in Spark.

## Schema

| Column | Type | Description |
|--------|------|-------------|
| file_path | STRING | Path to the source DXF file |
| entity_type | STRING | DXF entity type (LINE, CIRCLE, ARC, TEXT, etc.) |
| layer | STRING | The layer the entity belongs to |
| handle | STRING | Unique entity handle within the DXF file |
| attributes | STRING | JSON string containing entity-specific attributes |

## Supported Entity Types

| Entity Type | Attributes |
|-------------|-----------|
| LINE | start_x/y/z, end_x/y/z |
| CIRCLE | center_x/y/z, radius |
| ARC | center_x/y/z, radius, start_angle, end_angle |
| POINT | location_x/y/z |
| TEXT | text, insert_x/y/z, height, rotation |
| MTEXT | text, insert_x/y/z, height |
| LWPOLYLINE | points (array), is_closed |
| POLYLINE | vertices (array), is_closed |
| SPLINE | control_points (array), degree, is_closed |
| ELLIPSE | center_x/y/z, major_axis_x/y/z, ratio, start_param, end_param |
| INSERT | block_name, insert_x/y/z, x/y/z_scale, rotation |
| DIMENSION | dimtype, defpoint_x/y/z |
| HATCH | pattern_name, solid_fill |

## Usage

```python
# Register the data source
from dxf_datasource import DXFDataSource
spark.dataSource.register(DXFDataSource)

# Read all entities from DXF files
df = spark.read.format("dxf").option("path", "/path/to/files").load()

# Filter by layer at read time
df = spark.read.format("dxf") \
.option("path", "/path/to/files") \
.option("layerFilter", "walls") \
.load()

# Extract attributes from JSON
from pyspark.sql.functions import get_json_object, col

circles = df.filter(col("entity_type") == "CIRCLE") \
.select(
"layer",
get_json_object(col("attributes"), "$.center_x").alias("cx"),
get_json_object(col("attributes"), "$.center_y").alias("cy"),
get_json_object(col("attributes"), "$.radius").alias("radius"),
)
```

## Options

| Option | Default | Description |
|--------|---------|-------------|
| path | (required) | Path to DXF file(s) or directory |
| pathGlobFilter | `*.dxf` | Glob pattern for file matching |
| numPartitions | 4 | Number of partitions to split files across |
| recursiveFileLookup | false | Recursively search subdirectories |
| layerFilter | (all layers) | Filter entities by layer name |

## Dependencies

- [ezdxf](https://ezdxf.readthedocs.io/en/stable/) - Python library for reading/writing DXF files
1 change: 1 addition & 0 deletions dxf/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
ezdxf==1.4.2
Empty file added dxf/src/__init__.py
Empty file.
Loading