I built a small deterministic gridworld benchmark to isolate a specific decision behavior: selecting between viable policies where standard scalar or constrained formulations select differently.
The example is intentionally minimal (fixed policies, no learning, two domains: energy and safety).
In the benchmark, both policies reach the goal and remain viable, but one preserves safety while the other reduces it from 10 → 4. The PV–PP system selects the former; a scalar baseline selects the latter.
Known limitations:
• selection rule is partially benchmark-specific
• no general trade-off mechanism yet
GitHub:
https://github.com/AmundsenLance/pvpp-gridworld-safe-benchmark.git
I’m looking for critique on whether this behavior can always be represented as a scalar or lexicographic ordering, or whether it requires a genuinely non-scalar selection rule.