GraduatedSanction-v0

Category: Reciprocity Environment (TR-4) Agents: 6 Difficulty: Advanced Source: coopetition_gym/envs/reciprocity_envs.py

Overview

GraduatedSanction-v0 implements a six-agent common-pool resource game with TR-4 graduated reciprocity sanctions. Agents share a common resource and decide how much to contribute. Reciprocity manifests as graduated sanctions: mild response to first defection, escalating with repeated violations.

The environment captures Ostrom’s (1990) insight about proportional punishment, effective governance relies on graduated rather than draconian responses to rule violations.

MARL Classification

Property	Value
Game Type	6-player Markov Game (general-sum)
Cooperation Structure	Common-pool resource dilemma
Observability	Full
Communication	Implicit
Agent Symmetry	Symmetric (identical capabilities)
Reward Structure	Integrated utility with graduated reciprocity
Action Space	Continuous: $A_i = [0, 100]$
State Dynamics	Deterministic
Horizon	Finite, T = 200 steps
Canonical Comparison	Common-pool resource; Ostrom (1990); Fehr & Gächter (2000)

Formal Specification

Common-Pool Resource Structure

Six symmetric agents with uniform interdependence:

\[D_{ij} = 0.35 \quad \text{for all } i \neq j\]

Higher baselines ($b_i = 40$) reflect the social expectation in a commons setting.

Graduated Sanction Mechanism

Graduated sanctions emerge from the interaction of TR-4 parameters: 1. Lower $\kappa = 0.8$: The bounded response $\varphi(x) = \tanh(0.8x)$ has a gentler slope, producing proportional (not binary) reactions to defection

Long memory $k = 10$: Extended memory window tracks behavioral patterns over time, enabling escalation based on repeated violations
High $\lambda_R = 1.8$: Strong reciprocity weight amplifies the aggregate sanction effect across 5 partners
High $\omega = 1.0$: Maximum dependency amplification in trust gating

Reciprocity Sensitivity

With $\rho_0 = 0.6$, $\eta = 1.5$, and $D_{ij} = 0.35$:

\[\rho_{ij} = 0.6 \cdot 0.35^{1.5} \approx 0.124 \quad \text{per pair}\]

Low per-pair sensitivity, but summed over 5 partners with $\lambda_R = 1.8$:

\[\text{Maximum aggregate} = 5 \times 1.8 \times T_{ij} \times (1 + 1.0 \times 0.35) \times 0.124 \times 1.0 \approx 1.50\]

Substantial aggregate effect when all 5 partners sanction simultaneously.

Distinction from PublicGoods-v0 (TR-3)

Aspect	PublicGoods-v0	GraduatedSanction-v0
Mechanism	Static TR-3 collective action modifiers	Adaptive TR-4 history-dependent reciprocity
Sanctions	Fixed free-rider penalties	Graduated proportional sanctions
Adaptation	Loyalty score adjusts slowly	Memory window ($k=10$) enables rapid response
Escalation	No escalation, penalty is constant	Repeated defection compounds via memory
Key Equation	Loyalty modifier (TR-3 Eq 5)	Reciprocity modifier (TR-4 Eq 44)
Agents	5 (default)	6

Game-Theoretic Background

Ostrom’s Design Principles

Ostrom (1990) identified graduated sanctions as a key institutional design principle for sustainable commons governance: 1. Proportional monitoring: All agents observe each other’s contributions

Graduated sanctions: First offenses receive mild punishment
Escalation: Repeated violations trigger increasingly severe responses
Low-cost enforcement: Reciprocity provides decentralized sanctions

Strategic Implications

Free-Riding Temptation:

Each agent faces incentive to under-contribute to the commons
With 6 agents, each free-rider benefits from 5 others’ contributions

Reciprocity as Governance:

5 partners simultaneously detect and respond to free-riding
Graduated response ($\kappa = 0.8$) avoids over-punishment of small deviations
Long memory ($k = 10$) ensures persistent free-riders face escalating sanctions

Environment Specification

Basic Usage

import coopetition_gym
import numpy as np

# Create environment
env = coopetition_gym.make("GraduatedSanction-v0")

obs, info = env.reset(seed=42)

# All agents contribute at baseline
for step in range(200): actions = np.full(6, 50.0)
    obs, rewards, terminated, truncated, info = env.step(actions)

    if terminated or truncated: break

print(f"Mean trust: {info['mean_trust']:.3f}")

Parameters

Parameter	Default	Description
`max_steps`	200	Extended horizon for graduated dynamics
`render_mode`	None	Rendering mode

TR-4 Parameters

Parameter	Symbol	Value	Rationale
Base reciprocity	$\rho_0$	0.6	Lower per-pair (but 5 pairs)
Dependency elasticity	$\eta$	1.5	Superlinear dependency effect
Response sensitivity	$\kappa$	0.8	Gradual response (graduated)
Memory window	$k$	10	Long memory for escalation
Reciprocity weight	$\lambda_R$	1.8	Strong aggregate reciprocity
Dependency amplification	$\omega$	1.0	Maximum dependency boost

Spaces

Observation Space

Type: Box Dtype: float32

Includes actions, trust matrix (6×6), reputation (6×6), interdependence (6×6), and step info.

Action Space

Type: Box Shape: (6,) Dtype: float32 Range: [0.0, 100.0] for each agent

Metrics and Info

The info dictionary contains:

Key	Type	Description
`step`	int	Current timestep
`mean_trust`	float	Average trust level
`cooperation_signals`	dict	Per-pair $s_{ij}$ values (30 pairs)
`reciprocity_effects`	dict	Per-pair reciprocity contributions
`memory_averages`	dict	Per-pair memory averages
`tr4_memory_window`	int	Memory window $k = 10$

Key Dynamics

Graduated Response Profile

The $\kappa = 0.8$ parameter creates a proportional response:

Defection Magnitude	$\varphi(s)$	Response Level
Small ($s \approx -5$)	$\approx -0.97$	Near-maximum
Moderate ($s \approx -2$)	$\approx -0.92$	Strong
Minor ($s \approx -0.5$)	$\approx -0.38$	Mild
Negligible ($s \approx -0.1$)	$\approx -0.08$	Minimal

With standard $\kappa = 1.0$, these responses would be sharper. The lower $\kappa = 0.8$ provides the graduated quality.

Escalation Through Memory

First defection: Memory average barely changes → mild sanction
Repeated defection: Memory average drops → cooperation signal becomes more negative
Persistent defection: Full memory window contaminated → maximum sanction from all 5 partners

Research Applications

GraduatedSanction-v0 is suitable for studying:

Commons Governance: Can decentralized reciprocity sustain cooperation?
Graduated Sanctions: Does proportional punishment outperform binary?
Scalability: How does 6-agent reciprocity compare to 2-agent?
Institutional Design: Ostrom’s principles in computational settings
Free-Rider Detection: Multi-agent monitoring and enforcement

PublicGoods-v0: Static TR-3 collective action variant
CoalitionFormation-v0: Dynamic exclusion mechanism
IndirectReciprocity-v0: 4-agent reputation dynamics

References

Pant, V. & Yu, E. (2026). Computational Foundations for Strategic Coopetition: Formalizing Sequential Interaction and Reciprocity. arXiv:2604.01240. Link
Ostrom, E. (1990). Governing the Commons: The Evolution of Institutions for Collective Action. Cambridge University Press.
Ostrom, E., Walker, J. & Gardner, R. (1992). Covenants With and Without a Sword: Self-Governance Is Possible. American Political Science Review.
Fehr, E. & Gächter, S. (2000). Cooperation and Punishment in Public Goods Experiments. American Economic Review.

Coopetition-Gym

Multi-agent reinforcement learning environments for studying mixed-motive coopetitive dynamics. Twenty environments organised into four mechanism classes, with reward-type ablation methodology and four validated case studies.

GraduatedSanction-v0

Overview

MARL Classification

Formal Specification

Common-Pool Resource Structure

Graduated Sanction Mechanism

Reciprocity Sensitivity

Distinction from PublicGoods-v0 (TR-3)

Game-Theoretic Background

Ostrom’s Design Principles

Strategic Implications

Environment Specification

Basic Usage

Parameters

TR-4 Parameters

Spaces

Observation Space

Action Space

Metrics and Info

Key Dynamics

Graduated Response Profile

Escalation Through Memory

Research Applications

References

GraduatedSanction-v0

Overview

MARL Classification

Formal Specification

Common-Pool Resource Structure

Graduated Sanction Mechanism

Reciprocity Sensitivity

Distinction from PublicGoods-v0 (TR-3)

Game-Theoretic Background

Ostrom’s Design Principles

Strategic Implications

Environment Specification

Basic Usage

Parameters

TR-4 Parameters

Spaces

Observation Space

Action Space

Metrics and Info

Key Dynamics

Graduated Response Profile

Escalation Through Memory

Research Applications

Related Environments

References