Minimal benchmark: non-scalar policy selection in gridworld (seeking critique)

Question

Minimal benchmark: non-scalar policy selection in gridworld (seeking critique)

asked 3 months ago in General Economics Questions by Lance_Amundsen (510 points)

I built a small deterministic gridworld benchmark to isolate a specific decision behavior: selecting between viable policies where standard scalar or constrained formulations select differently.

The example is intentionally minimal (fixed policies, no learning, two domains: energy and safety).

In the benchmark, both policies reach the goal and remain viable, but one preserves safety while the other reduces it from 10 → 4. The PV–PP system selects the former; a scalar baseline selects the latter.

Known limitations:
• selection rule is partially benchmark-specific
• no general trade-off mechanism yet

GitHub: https://github.com/AmundsenLance/pvpp-gridworld-safe-benchmark.git

I’m looking for critique on whether this behavior can always be represented as a scalar or lexicographic ordering, or whether it requires a genuinely non-scalar selection rule.

1 Answer

Benjamin_Hardman · Answer 1 · 2026-05-24T21:33:00+0000

I dont know the answer but this is what AI said.

This is a useful benchmark, but I would be careful about the claim that the behavior cannot be represented by a scalar ordering.

For a small deterministic gridworld with fixed policies, the answer is probably: yes, the observed choice can always be represented by some scalar or lexicographic rule. With only two policies, that is almost trivial. You can always define a scoring function that assigns a higher value to the policy you want selected.

So I do not think this benchmark, by itself, proves that a non-scalar selection rule is necessary.

But I also do not think that makes the benchmark uninteresting. The better question is not whether a scalar function can be built after the fact. It is whether a stable, non-ad hoc scalar function can recover the same behavior across a family of cases without smuggling in the PV-PP structure.

In your benchmark, the important behavior seems to be this:

Both policies reach the goal and remain viable, but one unnecessarily degrades the safety domain. PV-PP chooses the policy that preserves the safety domain, while the scalar baseline chooses the faster or higher-scoring path.

That can be represented lexicographically:

1. First require goal achievement and viability.
2. Then prefer policies that avoid unnecessary safety degradation.
3. Then use energy, speed, or path length as a tie-breaker.

It can also be represented as a scalar with a very large penalty for safety degradation or hazard exposure. But that scalar representation may be doing the wrong explanatory work. The reason for the choice is not really “the safety penalty happened to be large enough.” The reason is that safety preservation has structural priority once both policies are viable.

That is where your framework may have value. It makes the priority structure explicit instead of hiding it inside a tuned penalty weight.

The strongest critique I would offer is this: the benchmark currently shows divergence from one scalar baseline, not from scalarization generally. To make the stronger claim, you would need to test against a family of scalar baselines or prove that no fixed scalar weighting scheme can reproduce the PV-PP choices across a wider class of gridworlds.

For example, vary:

* the energy cost of the safe route,
* the safety loss from hazards,
* the number of hazards,
* the initial safety reserve,
* the goal reward,
* and whether the safety loss remains above or falls below an adequacy threshold.

Then ask whether one stable scalar objective can reproduce the PV-PP choices across all cases. If it cannot, unless the scalar model adds hard constraints, lexicographic priority, discontinuous penalties, or environment-specific weights, then you have a much stronger argument.

So my answer would be:

This benchmark does not yet show that scalar representation is impossible. A scalar or lexicographic model can probably represent this particular choice. But it does show something potentially important: the scalar representation may be derivative, while the real selection logic is domain-structured viability plus preservation of safety.

That is a serious distinction. The next step is to move from “this scalar baseline chooses differently” to “no stable scalar baseline can recover the PV-PP rule across a structured class of cases without importing the non-scalar rule in disguise.”

Minimal benchmark: non-scalar policy selection in gridworld (seeking critique)

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.