TreesFormer: Multimodal Grammar-Based 3D Tree Reconstruction from Sparse Geodata

Submission 1008
Publication
Paper
Code
TreesFormer
Core Model
Visualizer
L-System Tools
GeoTree3D
Dataset Gen
Data
Dataset
Synthetic (GeoTree3D)
Google Drive
Pretrained
Model weights
TreesFormer Overview

Abstract

We present TreesFormer, the first grammar-based framework for reconstructing hierarchical 3D tree structures directly from sparse top-down geodata using only a single orthophoto and its corresponding Digital Surface Model (DSM). It employs a multimodal autoregressive transformer that predicts compact parametric L-system grammars from DSM point clouds and orthophoto features, jointly predicting symbolic structure and geometric parameters while enforcing grammar constraints during decoding. To enable supervision in the absence of real-world grammar annotations, we introduce a synthetic multimodal dataset of procedurally generated trees with aligned aerial inputs and ground-truth L-system labels. Experiments show that DSMs drive overall geometric accuracy and crown shape, while orthophoto conditioning improves structural regularity and branching depth; their combination consistently outperforms either modality alone. The model generalizes to real-world Austrian and French aerial data, producing interpretable branching structures suitable for large-scale rural 3D mapping.

Methodology

Network Architecture

Figure 2: Overall architecture of the multimodal L-system generator. A multimodal visual encoder conditions an autoregressive backbone decoder factorized into a structural branch for token prediction and a parameter branch for geometric prediction, with supervision (green).

Qualitative Results

Reconstructions from real-world Austrian landmark trees and French IGN data

Typical Reconstructions Difficult Cases
DSM Orthophoto Target Output DSM Orthophoto Target Output
1 DSM 25 Ortho 25 Target 25 Output 25 1 DSM 19 Ortho 19 Target 19 Output 19
2 DSM 3 Ortho 3 Target 3 Output 3 2 DSM 68 Ortho 68 Target 68 Output 68
3 DSM 37 Ortho 37 Target 37 Output 37 3 DSM 44 Ortho 44 Target 44 Output 44
4 DSM 71 Ortho 71 Target 71 Output 71 4 DSM 8 Ortho 8 Target 8 Output 8

Table 1: Reconstructed landmark trees (left) and difficult cases (right). Difficult cases: (1) DSM underestimates tree size, (2) cluttered orthophoto, (3) dead tree, (4) failure -- simultaneous modality ambiguity.

State-of-the-Art Comparison & Modality Ablation

Inputs Ground
Truth
Networks
DSM Ortho Tree
D-Fusion
SVDTree Latent
L-Systems
TreeON Ours
[Lee et al. 2024] [10656708] [10.1145/3627101] [Grammatikaki et al. 2026]
DSM 191 Ortho 191 GT 191 TreeD 191 SVD 191 Latent 191 TreeON 191 Ours Rendered 191 Ours Leaves 191
DSM 1 Ortho 1 GT 1 TreeD 1 SVD 1 Latent 1 TreeON 1 Ours Rendered 1 Ours Leaves 1
DSM 1445 Ortho 1445 GT 1445 TreeD 1445 SVD 1445 Latent 1445 TreeON 1445 Ours Rendered 1445 Ours Leaves 1445

(a) Visual outputs for test trees.

Method NCD F1 COV
Tree D-Fusion 0.66 0.01 16%
SVDTree 0.31 0.58 72%
Latent L-Systems 0.33 0.57 44%
TreeON 0.24 0.76 86%
Ours 0.21 0.81 75%

(b) Quantitative results.

Table 2: Comparison with state-of-the-art methods.

Inputs Predictions
Target Target DSM DSM Orthophoto Orthophoto DSM only DSM only Ortho only Ortho only DSM + Ortho DSM + Ortho

Figure 2: Qualitative modality ablation on a representative tree.

Generalization on French IGN Data (LiDAR)

DSM Orthophoto LiDAR Ours
1 DSM 4f Ortho 4 LiDAR 4 Ours 4
2 DSM 7 Ortho 7 LiDAR 7 Ours 7

Table 3: Qualitative comparison against IGN LiDAR point clouds (37 and 25 pts/m²) in the French Pyrenees.