TrustDilemma-v0

Category: Dyadic Environment Agents: 2 Difficulty: Intermediate Source: coopetition_gym/envs/dyadic_envs.py

Overview

TrustDilemma-v0 implements a continuous iterated Prisoner’s Dilemma where payoffs evolve based on a hidden trust state. Unlike the classic discrete Prisoner’s Dilemma, agents choose continuous cooperation levels, and the reward structure is dynamically modulated by the current trust between agents.

This environment tests whether reinforcement learning agents can learn long-horizon impulse control,resisting the temptation of short-term defection gains to maintain the trust that enables higher long-term payoffs.

MARL Classification

Property	Value
Game Type	Markov Game (2-player, general-sum)
Cooperation Structure	Mixed-Motive (cooperation creates value, competition captures it)
Observability	Full (all state variables observable to both agents)
Communication	Implicit (through actions only)
Agent Symmetry	Symmetric (identical endowments, baselines, capabilities)
Reward Structure	Mixed (individual + interdependence-weighted partner rewards)
Action Space	Continuous, bounded: $A_i = [0, 100]$
State Dynamics	Deterministic (given actions, next state is deterministic)
Horizon	Finite, T = 100 steps (or early termination on trust collapse)
Canonical Comparison	Continuous-action Iterated Prisoner’s Dilemma with state-dependent payoffs; cf. Lerer & Peysakhovich (2017) “Maintaining Cooperation in Complex Social Dilemmas”

Formal Specification

This environment is formalized as a 2-player Markov Game M = (N, S, {A_i}, P, {R_i}, T).

Agents

N = {1, 2} (symmetric dyad)

State Space

S ⊆ ℝ¹⁷ with components:

Component	Symbol	Dimension	Range	Description
Actions	a	2	[0, 100]	Previous cooperation levels
Trust Matrix	$\tau$	4	[0, 1]	Pairwise trust $\tau_{ij}$
Reputation Damage	$R$	4	[0, 1]	Accumulated damage $R_{ij}$
Interdependence	D	4	[0, 1]	Structural dependencies
Metadata	m	3	varies	Timestep, auxiliary info

Total dimension: d = 17

Action Space

For each agent $i \in {1, 2}$:

\[A_i = [0, e_i] = [0, 100] \subset \mathbb{R}\]

where $e_i = 100$ is the endowment. Actions represent cooperation level (investment in joint value creation).

Uniaxial Treatment: This environment uses the single-dimension action space characteristic of Coopetition-Gym v1.x. Competition emerges through structural parameters (interdependence, bargaining shares) rather than explicit competitive actions.

Transition Dynamics

Trust Update (TR-2 dynamics):

τ_ij(t+1) = clip(τ_ij(t) + Δτ_ij, 0, Θ_ij)

where the trust ceiling $\Theta_{ij} = 1 - R_{ij}$ and the update is:

Δτ_ij = λ⁺ · max(0, σ_j) · (1 - τ_ij) - λ⁻ · max(0, -σ_j) · τ_ij

with cooperation signal:

σ_j = κ · (a_j - b_j) / b_j

Reputation Update:

R_ij(t+1) = R_ij(t) · (1 - $\delta_R$) + $\mu_R$ · 𝟙[σ_j < -threshold]

Reward Function

Agent i receives integrated utility:

r_i(s, a) = U_i(a) = π_i(a) + Σ_j D_ij · π_j(a)

where private payoff π_i is:

π_i(a) = (e_i - a_i) + f(a_i) + α_i · G(a)

with:

Retained resources: e_i - a_i
Individual value: f(a_i) = θ · ln(1 + a_i), θ = 20.0
Synergy share: $\alpha_i \cdot G(a)$ where $G(a) = (a_1 \cdot a_2)^{1/2} \cdot (1 + \gamma \cdot C(a))$
Complementarity: C(a) = min(a_1/e_1, a_2/e_2), γ = 0.70

Episode Structure

Horizon: T = 100 steps
Truncation: t ≥ T
Termination: mean(τ) < 0.05 (trust collapse)
Discount: γ = 1.0 (undiscounted finite horizon)

Initial State

$\tau_{ij}(0) = 0.50$ for all $i \neq j$
$R_{ij}(0) = 0.00$ for all $i, j$
$a(0) = (0, 0)$

Game-Theoretic Background

The Trust Dilemma

In many real-world partnerships, the classic Prisoner’s Dilemma structure is complicated by: 1. Continuous choices: Partners don’t simply “cooperate” or “defect”,they choose how much effort, investment, or commitment to contribute

Dynamic payoffs: The value of cooperation depends on the relationship’s current state
Trust sensitivity: Past behavior affects future opportunities through trust accumulation

TrustDilemma-v0 captures these dynamics by:

Using continuous action spaces (cooperation levels from 0% to 100% of endowment)
Modulating rewards by trust level (higher trust amplifies cooperative gains)
Implementing asymmetric trust dynamics (trust builds slowly but erodes quickly)

Strategic Implications

Short-term incentive: Defecting (low cooperation) captures immediate gains while free-riding on partner’s contributions.

Long-term incentive: Maintaining cooperation preserves trust, which:

Amplifies future payoffs from joint value creation
Prevents the trust ceiling from limiting recovery options
Sustains access to synergistic benefits

The core challenge is learning that today’s defection constrains tomorrow’s possibilities.

Theoretical Foundations

Relationship to Classical Game Theory

TrustDilemma-v0 extends the classical Iterated Prisoner’s Dilemma (IPD) by incorporating: 1. Continuous action spaces: Rather than discrete {Cooperate, Defect}, agents choose cooperation intensity $a_i \in [0, 100]$

State-dependent payoffs: Rewards are modulated by endogenous trust, creating a Markov Game rather than a repeated normal-form game
Asymmetric dynamics: The 3:1 negativity bias (λ⁻/λ⁺) captures empirically-observed trust asymmetry (Slovic, 1993)
Reputation hysteresis: Cumulative damage creates irreversibility absent from classical models

Key Theoretical Results

Stage-Game Analysis: In the single-shot version (ignoring trust dynamics), the environment resembles a continuous public goods game:

Nash equilibrium (myopic): $a^* \approx 35$ (baseline contribution)
- At this level, marginal cost of contribution equals marginal private benefit
Pareto-optimal outcome: $a^* = 100$ (full cooperation)
- Social welfare maximized when both agents fully invest
Price of Anarchy: PoA ≈ 2.3
- Ratio of optimal social welfare to welfare at Nash equilibrium

Repeated Game Considerations: With T = 100 repetitions and trust dynamics, the Folk Theorem applies conditionally:

Folk Theorem applicability: Partial
- Finite horizon limits exact Folk Theorem results
- However, trust dynamics create state-dependent continuation values that support cooperation
Subgame-perfect equilibria:
- Mutual defection (a = 35) is always an SPE
- Cooperative equilibria (a > 35) sustainable when trust amplification exceeds defection temptation
- Critical threshold: τ* ≈ 0.45 for cooperation to be self-enforcing

Trust-Mediated Cooperation: The trust dynamics create a novel mechanism for cooperation:

Cooperation sustainable iff: ∂U/∂τ × ∂τ/∂a > temptation_gain

Where the left side captures the long-term value of trust investment.

Connections to Prior Work

Concept	TrustDilemma-v0	Classical Reference
Continuous cooperation	$a_i \in [0, e_i]$	Discrete {C, D} in Axelrod (1984)
State-dependent payoffs	Trust modulation $\tau$	Stateless in classical IPD
Asymmetric dynamics	$\lambda^- = 3 \times \lambda^+$	Symmetric in standard models
Reputation effects	Ceiling $\Theta = 1 - R$	No reputation in basic IPD
Complementarity	$\gamma$-weighted synergy	Not present in classical

Literature Connections

Axelrod (1984): The foundational work on IPD tournaments. TrustDilemma-v0 extends this by:

Continuous actions (allows graduated responses)
Trust state (enables history-dependent but Markovian strategies)
Complementarity (rewards coordinated high cooperation)

Lerer & Peysakhovich (2017): “Maintaining Cooperation in Complex Social Dilemmas” studies deep RL in social dilemmas. TrustDilemma-v0 provides:

Similar mixed-motive structure
Richer state dynamics (trust, reputation)
Continuous action complexity

Leibo et al. (2017): Sequential social dilemmas in grid-worlds. TrustDilemma-v0 differs by:

Direct strategic interaction (not spatially mediated)
Explicit trust dynamics (not emergent from grid mechanics)
Continuous rather than discrete action timing

Equilibrium Analysis

Stage-Game Nash Equilibrium

In the single-shot game (ignoring trust dynamics), we analyze best responses:

Best Response Functions: For agent $i$ with utility $U_i = (e_i - a_i) + \theta \cdot \ln(1 + a_i) + \alpha_i \cdot G(\mathbf{a}) + D_{ij} \cdot \pi_j$:

∂U_i/∂a_i = -1 + θ/(1 + a_i) + α_i·∂G/∂a_i = 0

Solving for interior solutions:

a_i* ≈ θ - 1 + α_i·(∂G/∂a_i)

Nash Equilibrium (Symmetric Case): With $\theta = 20$, $\alpha = 0.50$, and moderate complementarity:

Myopic NE: $a^* \approx 35$ (baseline level)
Both agents contribute at the minimum expected level
Neither has unilateral incentive to deviate

Interpretation: The stage-game NE represents mutual caution, each agent invests just enough to avoid being seen as defecting.

Pareto Frontier

The set of Pareto-optimal action profiles satisfies:

max_{a_1, a_2} W = U_1 + U_2
subject to: a_i ∈ [0, 100]

Pareto-optimal outcomes:

Full cooperation: (a_1, a_2) = (100, 100)
- Total welfare: W* ≈ 285
Pareto frontier: All symmetric profiles (a, a) where a > 35

Social Welfare Comparison:

Profile	Agent 0 Utility	Agent 1 Utility	Total	Trust Δ
(35, 35)	72.3	72.3	144.6	-0.02
(50, 50)	85.1	85.1	170.2	+0.01
(70, 70)	98.2	98.2	196.4	+0.04
(100, 100)	112.5	112.5	225.0	+0.06

Price of Anarchy: PoA = W*/W_NE ≈ 225/145 ≈ 1.55

Repeated Game Equilibria

With T = 100 repetitions and trust dynamics, richer equilibria emerge:

Grim Trigger Strategy:

Play a_high if no defection observed
Play a_low forever after any defection

Grim trigger supports cooperation when:

δ·V_coop/(1-δ) > V_defect + δ·V_punishment/(1-δ)

Where δ captures effective discount rate accounting for trust dynamics.

Trust-Augmented Trigger: The trust dynamics provide a natural “soft” trigger:

Defection erodes trust (λ⁻ = 0.45)
Eroded trust reduces future payoffs
Creates self-enforcing cooperation without explicit punishment

Cooperative Equilibrium Conditions: Cooperation (a > baseline) is sustainable in equilibrium when:

τ > τ* where τ* ≈ 0.45

Below τ*, the trust-mediated payoff amplification is insufficient to deter defection.

Trust-Mediated Equilibrium Dynamics

The environment creates a novel equilibrium structure:

Trust Threshold Effects: 1. High trust regime (τ > 0.70):

Cooperation strongly reinforced
High payoffs sustain investment
Robust to small deviations

Medium trust regime (0.30 < τ < 0.70):
- Multiple equilibria possible
- Coordination challenge
- History-dependent outcomes
Low trust regime (τ < 0.30):
- Defection dominates
- Recovery difficult (3:1 negativity bias)
- Approaching termination threshold

Basin of Attraction: Starting from τ₀ = 0.50:

Sustained cooperation → converges to high-trust equilibrium
Early defection → converges to low-trust/collapse

The separating trajectory depends on initial cooperation and response to early deviations.

Equilibrium Selection in MARL

For RL agents learning in this environment:

Expected Learning Dynamics: 1. Self-play with exploration: May converge to either equilibrium

High exploration → samples cooperative outcomes → possible convergence to cooperation
Greedy exploitation → myopic defection → convergence to low equilibrium

Curriculum considerations:
- Starting with high initial trust helps discover cooperative equilibrium
- Training with trust “warm-starts” improves convergence

Algorithm Implications:

Algorithm	Expected Equilibrium	Notes
Independent PPO	Mixed/Low	Coordination challenge
MAPPO (shared)	High possible	Shared critic helps
MADDPG	Medium	Centralized training helps
LOLA	High likely	Models opponent adaptation

Environment Specification

Basic Usage

import coopetition_gym
import numpy as np

# Create environment
env = coopetition_gym.make("TrustDilemma-v0")

# Reset with seed for reproducibility
obs, info = env.reset(seed=42)

# Run episode
done = False
total_rewards = np.zeros(2)

while not done:
    # Both agents choose cooperation levels
    actions = np.array([60.0, 55.0])  # 60% and 55% of endowments
    obs, rewards, terminated, truncated, info = env.step(actions)
    total_rewards += rewards
    done = terminated or truncated

print(f"Episode rewards: {total_rewards}")
print(f"Final trust: {info['mean_trust']:.3f}")

Parameters

Parameter	Default	Description
`max_steps`	100	Maximum timesteps per episode
`render_mode`	None	Rendering mode (“human”, “ansi”, or None)

Spaces

Observation Space

Type: Box Shape: (17,) for 2 agents Dtype: float32

Index	Component	Shape	Description
0-1	Actions	(2,)	Last cooperation levels
2-5	Trust Matrix	(2,2)	Pairwise trust levels [0,1]
6-9	Reputation Matrix	(2,2)	Pairwise reputation damage [0,1]
10-13	Interdependence	(2,2)	Structural dependencies
14-16	Step Info	(3,)	Normalized timestep and metadata

Action Space

Type: Box Shape: (2,) for joint actions Dtype: float32 Range: [0.0, 100.0] for each agent

Actions represent the cooperation level (investment amount) for each agent. Higher values indicate more cooperation.

Reward Structure

Integrated Utility

Rewards are computed using the integrated utility framework (TR-1):

U_i = private_payoff_i + Σ_j (D_ij × private_payoff_j)

Where:

private_payoff_i = (endowment - action) + f(action) + α_i × synergy
D_ij is the interdependence weight (how much agent i values agent j’s outcomes)
synergy is the collaborative surplus from joint investment

Trust Modulation

Payoffs are amplified by the current trust level:

effective_payoff = base_payoff × (1 + κ × trust_level)

This means:

High trust (0.8+): Cooperation yields amplified returns
Low trust (0.2-): Even cooperative actions yield diminished returns
Trust collapse (<0.05): Episode terminates

Trust Dynamics

Parameters

Parameter	Symbol	Value	Description
Trust Building Rate	λ⁺	0.15	Rate of trust increase from cooperation
Trust Erosion Rate	λ⁻	0.45	Rate of trust decrease from defection
Reputation Damage	$\mu_R$	0.50	Damage coefficient from violations
Reputation Decay	$\delta_R$	0.02	Rate of reputation forgiveness
Interdependence Amp.	ξ	0.60	Amplification from dependencies
Signal Sensitivity	κ	1.5	Action-to-signal conversion
Initial Trust	τ₀	0.50	Starting trust level

Update Mechanism

Trust updates follow TR-2 dynamics:

# Compute signal from actions
signal = (action - baseline) / baseline  # Positive = cooperative

# Asymmetric update
if signal > 0: delta_trust = λ⁺ × signal × (1 - trust)  # Bounded by ceiling
else: delta_trust = λ⁻ × signal × trust  # Faster erosion

# Apply with reputation ceiling
trust = min(trust + delta_trust, 1 - reputation_damage)

Key Properties

Negativity Bias (3:1): Trust erodes 3× faster than it builds
Trust Ceiling: Reputation damage creates a permanent ceiling on trust recovery
Hysteresis: Once trust is damaged, full recovery becomes impossible

Value Function

Logarithmic Specification

TrustDilemma-v0 uses logarithmic value creation:

V(a₁, a₂) = θ × ln(a₁ + a₂) × (1 + γ × complementarity)

Parameter	Value	Description
θ	20.0	Logarithmic scale factor
γ	0.70	Complementarity coefficient

Complementarity Effect

When both agents cooperate highly, the complementarity bonus amplifies returns:

complementarity = min(a₁/e₁, a₂/e₂)  # Bottleneck by lower cooperator
bonus = γ × complementarity

This creates strong incentives for mutual high cooperation.

Episode Dynamics

Termination Conditions

The episode ends when: 1. Truncation: Maximum steps (100) reached

Termination: Mean trust falls below 0.05 (trust collapse)

Typical Trajectories

Mutual Cooperation:

Trust rises toward ceiling (~0.8-0.9)
Rewards increase over time
Episode completes at max steps

Mutual Defection:

Trust declines rapidly
Rewards remain low and flat
May terminate early from trust collapse

Mixed Strategy:

Trust oscillates based on action patterns
Rewards fluctuate with trust level
Vulnerable to defection spirals

Metrics and Info

The info dictionary contains:

Key	Type	Description
`step`	int	Current timestep
`mean_trust`	float	Average trust level
`mean_reputation_damage`	float	Average reputation damage
`total_value`	float	Total value created this step
`mean_cooperation`	float	Mean cooperation level
`cooperation_rate`	float	Cooperation as % of endowments
`trust_matrix`	ndarray	Full trust matrix (2×2)
`reputation_matrix`	ndarray	Full reputation matrix (2×2)

Example: Tit-for-Tat Strategy

import coopetition_gym
import numpy as np

env = coopetition_gym.make("TrustDilemma-v0")
obs, info = env.reset(seed=42)

# Tit-for-Tat: Start cooperative, then mirror partner
my_action = 60.0  # Start with 60% cooperation
partner_last_action = 60.0

for step in range(100):
    # Tit-for-Tat: match partner's last action
    actions = np.array([my_action, partner_last_action])
    obs, rewards, terminated, truncated, info = env.step(actions)

    # Update for next round (mirror partner)
    partner_last_action = obs[1]  # Partner's last action
    my_action = partner_last_action

    if terminated or truncated: break

print(f"Final trust: {info['mean_trust']:.3f}")

Research Applications

TrustDilemma-v0 is suitable for studying:

Cooperation Emergence: How agents learn to sustain cooperation
Trust Building: Strategies for trust recovery after defection
Impulse Control: Resisting short-term temptation for long-term gain
Reciprocity: Tit-for-tat and related strategies in continuous settings
MARL Algorithms: Comparing PPO, MAPPO, and other algorithms

Baseline Results

Benchmark results following the Evaluation Protocol.

Evaluation Configuration

Parameter	Value
Episodes	100
Seeds	0-99
Horizon	100 steps
Training seeds	100-104 (5 runs)

Performance Comparison

Algorithm	Mean Return	Std	Final Trust	Coop Rate	Training Steps
Random	82.4	14.2	0.28	0.49	-
Constant(0.35)	98.6	6.3	0.42	0.35	-
Constant(0.50)	112.8	7.1	0.56	0.50	-
Constant(0.75)	134.2	8.4	0.68	0.75	-
Tit-for-Tat	128.5	10.8	0.64	0.58	-
IPPO	145.2	11.3	0.72	0.62	500K
MAPPO	162.8	9.2	0.78	0.68	500K

Results averaged over 5 training seeds. Mean Return is sum of both agents’ episode returns.

Learning Curve Characteristics

Random: Baseline lower bound; trust decays due to inconsistent behavior
Constant policies: Stable but suboptimal; no adaptation
Tit-for-Tat: Strong initial performance; sensitive to early defection
IPPO: Converges around 200K steps; coordination challenge leads to variance
MAPPO: Faster convergence (~150K steps); shared critic aids coordination

Recommended Hyperparameters

# PPO configuration for TrustDilemma-v0
algorithm: PPO
learning_rate: 3e-4
n_steps: 2048
batch_size: 64
n_epochs: 10
gamma: 0.99
gae_lambda: 0.95
ent_coef: 0.01
network: hidden_layers: [128, 128]

PartnerHoldUp-v0: Adds asymmetric power dynamics
RecoveryRace-v0: Focuses on trust recovery
SLCD-v0: Validated real-world case study

References

Pant, V. & Yu, E. (2025). Computational Foundations for Strategic Coopetition: Formalizing Interdependence and Complementarity. arXiv:2510.18802
Pant, V. & Yu, E. (2025). Computational Foundations for Strategic Coopetition: Formalizing Trust and Reputation Dynamics. arXiv:2510.24909
Axelrod, R. (1984). The Evolution of Cooperation. Basic Books.

TrustDilemma-v0

Overview

MARL Classification

Formal Specification

Agents

State Space

Action Space

Transition Dynamics

Reward Function

Episode Structure

Initial State

Game-Theoretic Background

The Trust Dilemma

Strategic Implications

Theoretical Foundations

Relationship to Classical Game Theory

Key Theoretical Results

Connections to Prior Work

Literature Connections

Equilibrium Analysis

Stage-Game Nash Equilibrium

Pareto Frontier

Repeated Game Equilibria

Trust-Mediated Equilibrium Dynamics

Equilibrium Selection in MARL

Environment Specification

Basic Usage

Parameters

Spaces

Observation Space

Action Space

Reward Structure

Integrated Utility

Trust Modulation

Trust Dynamics

Parameters

Update Mechanism

Key Properties

Value Function

Logarithmic Specification

Complementarity Effect

Episode Dynamics

Termination Conditions

Typical Trajectories

Metrics and Info

Example: Tit-for-Tat Strategy

Research Applications

Baseline Results

Evaluation Configuration

Performance Comparison

Learning Curve Characteristics

Recommended Hyperparameters

Related Environments

References