ReciprocalDilemma-v0

Category: Reciprocity Environment (TR-4) Agents: 2 Difficulty: Intermediate Source: coopetition_gym/envs/reciprocity_envs.py

Overview

ReciprocalDilemma-v0 implements a continuous iterated Prisoner’s Dilemma with TR-4 reciprocity dynamics. Two symmetric firms decide cooperation levels in a shared project, where reciprocity enables tit-for-tat-like conditional cooperation through bounded memory windows.

The environment tests whether reinforcement learning agents can learn conditional cooperation,responding to partner behavior over recent history rather than relying solely on slow-moving trust dynamics.

MARL Classification

Property	Value
Game Type	2-player Markov Game (general-sum)
Cooperation Structure	Mixed-Motive (cooperation vs. exploitation)
Observability	Full (all state variables observable)
Communication	Implicit (through actions only)
Agent Symmetry	Symmetric (identical capabilities)
Reward Structure	Integrated utility with reciprocity modifier
Action Space	Continuous, bounded: $A_i = [0, 100]$
State Dynamics	Deterministic
Horizon	Finite, T = 100 steps
Canonical Comparison	Iterated PD; Axelrod (1984); Killingback & Doebeli (2002)

Formal Specification

Mathematical Framework (TR-4)

Cooperation Signal (Eq 19): $s_{ij} = a_j - \bar{a}_j$

Where $\bar{a}_j$ is the memory average of agent $j$’s recent actions.

Memory Average (Eq 20): $\bar{a}_j = \frac{1}{\min(k, t-1)} \sum_{\tau=\max(1,t-k)}^{t-1} a_j^\tau$

Where $k = 5$ is the memory window length.

Bounded Response (Eq 21): $\varphi(x) = \tanh(\kappa \cdot x)$

Where $\kappa = 1.0$ controls response sensitivity.

Reciprocity Sensitivity (Eq 23): $\rho_{ij} = \rho_0 \cdot D_{ij}^\eta$

Where $\rho_0 = 1.0$ and $\eta = 1.0$.

Reciprocity Modifier (Eq 44): $U_{\text{recip},i} = \lambda_R \sum_{j \neq i} T_{ij} \cdot (1 + \omega D_{ij}) \cdot \rho_{ij} \cdot \varphi(s_{ij})$

Where $\lambda_R = 1.0$ and $\omega = 0.6$.

State Space

S ⊆ ℝ^d with components:

Component	Symbol	Description
Actions	a	Previous cooperation levels
Trust Matrix	T	Pairwise trust (from TR-2)
Reputation	R	Accumulated reputation damage
Interdependence	D	Structural dependencies
Memory	ā	Recent action averages

Action Space

For each agent $i$: $A_i = [0, e_i] = [0, 100] \subset \mathbb{R}$

Actions represent cooperation level in the shared project.

Uniaxial Treatment: This environment uses the single-dimension action space characteristic of Coopetition-Gym v1.x. Competitive dynamics emerge through the PD payoff structure and reciprocity responses rather than explicit competitive actions.

Reward Function

Rewards combine integrated utility (TR-1/TR-2) with reciprocity modifier (TR-4):

\[r_i = \pi_i^{\text{base}} \cdot m_{\text{recip},i}\]

Where $m_{\text{recip},i}$ is the multiplicative reciprocity modifier derived from Eq 44.

Distinction from TrustDilemma-v0

Aspect	TrustDilemma-v0	ReciprocalDilemma-v0
Mechanism	TR-2 trust dynamics (slow erosion/building)	TR-4 behavioral reciprocity (fast, 1-5 step response)
Response Time	Trust changes over 10-20+ steps	Memory window of $k=5$ steps
Adaptation	Gradual trust adjustment	Immediate reciprocal response
Key Equation	Trust update (Eqs 8-9)	Reciprocity modifier (Eq 44)
Strategy	Long-horizon impulse control	Conditional cooperation (tit-for-tat)

Environment Specification

Basic Usage

import coopetition_gym
import numpy as np

# Create environment
env = coopetition_gym.make("ReciprocalDilemma-v0")

# Reset
obs, info = env.reset(seed=42)

# Run episode with cooperative strategy
for step in range(100): actions = np.array([60.0, 60.0])
    obs, rewards, terminated, truncated, info = env.step(actions)

    if terminated or truncated: break

print(f"Mean trust: {info['mean_trust']:.3f}")

Parameters

Parameter	Default	Description
`max_steps`	100	Maximum timesteps
`render_mode`	None	Rendering mode

TR-4 Parameters

Parameter	Symbol	Value	Description
Base reciprocity	$\rho_0$	1.0	Reciprocity strength
Dependency elasticity	$\eta$	1.0	How dependency scales reciprocity
Response sensitivity	$\kappa$	1.0	Steepness of bounded response
Memory window	$k$	5	Steps of recent history considered
Reciprocity weight	$\lambda_R$	1.0	Overall reciprocity scaling
Dependency amplification	$\omega$	0.6	Dependency boost in trust gating

Spaces

Observation Space

Type: Box Dtype: float32

Includes actions, trust matrix, reputation, interdependence, and step info.

Action Space

Type: Box Shape: (2,) Dtype: float32 Range: [0.0, 100.0] for each agent

Metrics and Info

The info dictionary contains:

Key	Type	Description
`step`	int	Current timestep
`mean_trust`	float	Average trust level
`cooperation_signals`	dict	Per-pair $s_{ij}$ values
`reciprocity_effects`	dict	Per-pair reciprocity contributions
`memory_averages`	dict	Per-pair memory averages $\bar{a}_j$
`tr4_memory_window`	int	Memory window $k$

Key Dynamics

Reciprocity-Driven Cooperation

With symmetric dependencies ($D_{12} = D_{21} = 0.5$):

$\rho_{12} = \rho_{21} = 1.0 \cdot 0.5^{1.0} = 0.5$
Both agents respond equally to partner’s cooperation signals
Sustained cooperation builds positive feedback loop

Defection Response

When one agent defects:

Cooperation signal $s_{ij}$ becomes negative (action below memory average)
Bounded response $\varphi(s)$ maps to negative value in $(-1, 0)$
Reciprocity modifier reduces defector’s reward multiplier
Fast response within $k = 5$ steps (unlike slow trust erosion)

Research Applications

ReciprocalDilemma-v0 is suitable for studying:

Conditional Cooperation: Can agents learn tit-for-tat-like strategies?
Memory Effects: How does memory window $k$ affect cooperation stability?
Reciprocity vs. Trust: Comparing fast reciprocity (TR-4) with slow trust (TR-2)
Cooperation Emergence: Conditions for sustained cooperation in continuous PD

TrustDilemma-v0: TR-2 trust-based variant (slower dynamics)
GiftExchange-v0: Asymmetric TR-4 reciprocity
GraduatedSanction-v0: Multi-agent reciprocity with sanctions

References

Pant, V. & Yu, E. (2026). Computational Foundations for Strategic Coopetition: Formalizing Sequential Interaction and Reciprocity. arXiv:2604.01240. Link
Axelrod, R. (1984). The Evolution of Cooperation. Basic Books.
Killingback, T. & Doebeli, M. (2002). The Continuous Prisoner’s Dilemma and the Evolution of Cooperation through Reciprocal Altruism with Variable Investment. American Naturalist.

ReciprocalDilemma-v0

Overview

MARL Classification

Formal Specification

Mathematical Framework (TR-4)

State Space

Action Space

Reward Function

Distinction from TrustDilemma-v0

Environment Specification

Basic Usage

Parameters

TR-4 Parameters

Spaces

Observation Space

Action Space

Metrics and Info

Key Dynamics

Reciprocity-Driven Cooperation

Defection Response

Research Applications

Related Environments

References