PartnerHoldUp-v0

Category: Dyadic Environment Agents: 2 (Strong + Weak) Difficulty: Advanced Source: coopetition_gym/envs/dyadic_envs.py

Overview

PartnerHoldUp-v0 models an asymmetric vertical relationship where one agent (the “strong” partner) holds structural power over another (the “weak” partner). This environment captures the dynamics of supplier-manufacturer relationships, small-firm-large-firm partnerships, and other scenarios where one party has greater bargaining power or outside options.

The key challenge is understanding how power asymmetry affects cooperation dynamics, and whether the weak partner can develop effective defensive strategies against potential exploitation.

Partner Hold-Up Asymmetry Dynamics Left: Structural power asymmetry showing differential dependencies (D_strong=0.35, D_weak=0.85). Right: Resulting asymmetric trust trajectories where the weak partner’s trust erodes faster under exploitation.

MARL Classification

Property	Value
Game Type	Markov Game (2-player, general-sum, asymmetric)
Cooperation Structure	Mixed-Motive with power asymmetry (exploitation risk)
Observability	Full (all state variables observable to both agents)
Communication	Implicit (through actions only)
Agent Symmetry	Asymmetric (different endowments: 120 vs 80, different dependencies)
Reward Structure	Mixed with asymmetric interdependence (D_strong=0.35, D_weak=0.85)
Action Space	Continuous, agent-specific: A_0=[0,120], A_1=[0,80]
State Dynamics	Deterministic
Horizon	Finite, T=100 (early termination if weak partner’s trust < 0.10)
Canonical Comparison	Asymmetric Prisoner’s Dilemma; cf. principal-agent models, Stackelberg games with endogenous commitment

Formal Specification

This environment is formalized as a 2-player asymmetric Markov Game M = (N, S, {A_i}, P, {R_i}, T).

Agents

N = {Strong, Weak} with heterogeneous capabilities:

Agent	Index	Endowment $e_i$	Baseline $b_i$	Bargaining $\alpha_i$
Strong	0	120.0	42.0 (35%)	0.60
Weak	1	80.0	28.0 (35%)	0.40

State Space

S ⊆ ℝ¹⁷ with identical structure to TrustDilemma-v0:

Component	Dimension	Description
Actions a	2	Previous cooperation levels
Trust τ	4	Pairwise trust matrix (flattened)
Reputation R	4	Reputation damage matrix
Interdependence D	4	Dependency matrix (fixed)
Metadata	3	Timestep, auxiliary

Action Space

Agent-specific continuous action spaces:

Strong: $A_0 = [0, 120] \subset \mathbb{R}$
Weak: $A_1 = [0, 80] \subset \mathbb{R}$

Uniaxial Treatment: This environment uses the single-dimension action space characteristic of Coopetition-Gym v1.x. Competition emerges through structural power asymmetry (D_strong=0.35, D_weak=0.85) rather than explicit competitive actions.

Interdependence Matrix (Critical Asymmetry)

D = | 0.00  0.35 |   (Strong depends moderately on Weak)
    | 0.85  0.00 |   (Weak depends heavily on Strong)

This creates vulnerability asymmetry: Weak’s utility is strongly coupled to Strong’s behavior, but not vice versa.

Transition Dynamics

Trust dynamics follow TR-2 with amplified asymmetry:

Trust Update:

τ_ij(t+1) = clip(τ_ij(t) + Δτ_ij · (1 + ξ · D_ij), 0, Θ_ij)

where $\xi = 0.70$ amplifies trust sensitivity based on dependency.

Critical Threshold: $\tau_{\text{weak} \to \text{strong}} < 0.10$ triggers termination (weak partner exits).

Reward Function

Integrated utility with asymmetric weights:

r_i(s, a) = π_i(a) + Σ_j D_ij · π_j(a)

Strong’s reward: Includes only 35% of Weak’s payoff Weak’s reward: Includes 85% of Strong’s payoff (highly coupled)

Private payoffs:

π_Strong = (120 - a_0) + 20·ln(1+a_0) + 0.60·G(a)
π_Weak   = (80 - a_1) + 20·ln(1+a_1) + 0.40·G(a)

Episode Structure

Horizon: T = 100 steps
Truncation: t ≥ T
Termination: τ[1,0] < 0.10 (weak partner exits due to exploitation)
Discount: γ = 1.0

Initial State

$\tau_{ij}(0) = 0.55$ (moderate initial trust)
$R_{ij}(0) = 0.00$
$\mathbf{D}$ fixed as above

Game-Theoretic Background

The Hold-Up Problem

The classic hold-up problem arises when: 1. Relationship-specific investments: One party makes investments that have less value outside the relationship

Incomplete contracts: Not all contingencies can be specified ex ante
Asymmetric switching costs: One party can exit more easily than the other

This creates a tension between:

Ex ante efficiency: Both parties benefit from investment
Ex post opportunism: The stronger party may exploit the weaker party’s lock-in

Theoretical Foundations

Relationship to Classical Game Theory

PartnerHoldUp-v0 extends the classical hold-up problem literature by incorporating: 1. Continuous investments: Rather than binary invest/not-invest decisions, agents choose continuous contribution levels

Dynamic power relations: Trust and reputation evolve over time, affecting the effective power balance
Endogenous exit: The weak partner’s exit threshold creates a credible threat mechanism
Repeated interaction: Unlike one-shot hold-up models, the environment captures ongoing relationship dynamics

Key Theoretical Results

Stage-Game Analysis: In the single-shot version (ignoring trust dynamics and exit):

Strong agent’s dominant strategy: Contribute near baseline (a₀* ≈ 42)
- Captures maximum surplus given weak’s limited response
Weak agent’s best response: Match effort level proportionally (a₁* ≈ 28-35)
- Cannot profitably exceed strong’s proportional contribution
Nash equilibrium: (a₀, a₁) ≈ (45, 30) - Mutual low cooperation
Pareto frontier: Achievable only with binding commitments

Asymmetric Bargaining Analysis: With bargaining shares $\alpha = (0.60, 0.40)$:

Nash Bargaining Solution (with disagreement point at baseline):
- Strong receives: 60% × (total surplus - disagreement payoffs)
- Weak receives: 40% × (total surplus - disagreement payoffs)
Outside option effect: Strong’s lower dependency (D = 0.35) improves BATNA

Repeated Game Equilibria:

Exploitation equilibrium: Strong exploits maximally until exit threshold approached
Sustainable equilibrium: Strong moderates exploitation to maintain relationship
Trigger strategy by weak: Credible exit threat at $\tau < 0.10$ disciplines strong

The critical insight is that the exit threshold $\tau^* = 0.10$ creates a credible commitment device for the weak partner, partially offsetting the power asymmetry.

Connections to Prior Work

Concept	PartnerHoldUp-v0	Classical Reference
Relationship-specific investment	Continuous $a_i \in [0, e_i]$	Binary in Williamson (1985)
Incomplete contracts	Trust dynamics as implicit contract	Explicit contracts in Hart & Moore (1990)
Exit option	Endogenous $\tau < 0.10$ termination	Exogenous outside option in Nash bargaining
Power asymmetry	$D_{ij}$ captures structural dependency	Bargaining power in Rubinstein (1982)
Dynamic adjustment	Trust evolution	Static in classical models

Literature Connections

Williamson (1985): Transaction cost economics and the hold-up problem. PartnerHoldUp-v0 operationalizes:

Asset specificity → High D_weak = 0.85
Opportunism risk → Trust erosion from defection
Governance mechanisms → Trust dynamics as relational contract

Hart & Moore (1990): Property rights theory of the firm. The environment captures:

Residual control rights → Bargaining share α
Investment incentives → Continuous action choice
Renegotiation → Ongoing trust-mediated interaction

Klein, Crawford & Alchian (1978): The appropriable quasi-rent. Represented by:

Weak partner’s relationship-specific value captured by high interdependence D = 0.85
Strong partner’s ability to extract this rent via exploitation

Rubinstein (1982): Alternating offers bargaining. The environment extends this with:

Continuous-time (each step) rather than discrete rounds
State-dependent patience (trust affects future value)
Endogenous breakdown (exit threshold)

Power Balance Mechanisms

The environment embeds several mechanisms that affect the power balance: 1. Exit credibility: τ < 0.10 termination gives weak partner leverage

Reputation effects: Exploitation damages strong’s ability to sustain relationships
Trust ceiling: Θ = 1 - R limits recovery from exploitation
Interdependence amplification: High ξ = 0.70 magnifies trust effects for weak partner

Power Dynamics in PartnerHoldUp-v0

Agent 0 (Strong Partner):

Larger endowment (120 vs. 80)
Higher bargaining share (60% vs. 40%)
Low dependency on partner (D = 0.35)
More outside options (can replace weak partner)

Agent 1 (Weak Partner):

Smaller endowment
Lower bargaining share
High dependency on partner (D = 0.85)
Few alternatives (locked into relationship)

Environment Specification

Basic Usage

import coopetition_gym
import numpy as np

# Create environment
env = coopetition_gym.make("PartnerHoldUp-v0")

# Reset
obs, info = env.reset(seed=42)

# Run episode
for step in range(100):
    # Strong partner: 70 out of 120 (58%)
    # Weak partner: 50 out of 80 (62%)
    actions = np.array([70.0, 50.0])
    obs, rewards, terminated, truncated, info = env.step(actions)

    if terminated: print(f"Relationship ended at step {step}")
        print(f"Weak partner exited due to exploitation")
        break

print(f"Strong reward: {rewards[0]:.1f}, Weak reward: {rewards[1]:.1f}")

Parameters

Parameter	Default	Description
`max_steps`	100	Maximum timesteps per episode
`render_mode`	None	Rendering mode

Agent Configuration

Endowments

Agent	Role	Endowment	Description
0	Strong	120.0	Large manufacturer with more resources
1	Weak	80.0	Small supplier with limited capacity

Bargaining Shares (Alpha)

Agent	Alpha	Description
0	0.60	Strong captures 60% of surplus
1	0.40	Weak captures 40% of surplus

Baselines (Cooperation Threshold)

Agent	Baseline	As % of Endowment
0	42.0	35%
1	28.0	35%

Actions below baseline are considered “defection” and erode trust.

Interdependence Structure

The key asymmetry is in the dependency matrix:

D = [[ 0.00,  0.35 ],    # Strong's row
     [ 0.85,  0.00 ]]    # Weak's row

Interpretation:

D[0,1] = 0.35: Strong depends moderately on weak (has alternatives)
D[1,0] = 0.85: Weak depends heavily on strong (locked in)

Implications

Weak values strong’s payoff highly: Included in weak’s integrated utility
Strong values weak’s payoff less: Can afford to exploit
Asymmetric trust dynamics: Weak’s trust in strong is critical
Exit threshold matters: If weak’s trust falls too low, relationship ends

Trust Dynamics

Parameters

Parameter	Symbol	Value	Description
Trust Building Rate	λ⁺	0.10	Slow trust building
Trust Erosion Rate	λ⁻	0.35	Faster erosion than TrustDilemma
Reputation Damage	$\mu_R$	0.55	Moderate-high damage
Reputation Decay	$\delta_R$	0.025	Slow forgetting
Interdependence Amp.	ξ	0.70	High amplification (magnifies asymmetry)
Signal Sensitivity	κ	1.2	Moderate sensitivity
Initial Trust	τ₀	0.55	Moderate starting trust

Asymmetric Trust Effects

The high interdependence amplification (ξ = 0.70) means:

For weak partner: Trust erosion is amplified due to high dependency
For strong partner: Trust erosion has less impact (lower dependency)

This creates a vulnerability asymmetry where the weak partner is more affected by trust dynamics.

Termination Conditions

Normal Truncation

Episode ends at max_steps (100) if relationship persists.

Relationship Exit

Critical condition: If weak agent’s trust in strong falls below 0.10, the episode terminates.

if trust_matrix[1, 0] < 0.10:  # Weak's trust in strong
    terminated = True
    # Weak partner "exits" the relationship

This represents the weak partner:

Finding an alternative supplier
Exiting the market
Seeking legal recourse
Refusing further cooperation

Reward Structure

Integrated Utility with Asymmetry

Rewards follow the standard integrated utility formula, but the asymmetry manifests through: 1. Bargaining shares: Strong gets 60% of synergy, weak gets 40%

Interdependence: Weak’s utility includes 85% weight on strong’s payoff
Trust modulation: Both payoffs are affected by trust levels

Value Function

Parameter	Value	Description
θ	20.0	Logarithmic scale
γ	0.60	Moderate complementarity

Equilibrium Analysis

Asymmetric Stage-Game Equilibrium

Unlike TrustDilemma-v0, PartnerHoldUp-v0 has asymmetric equilibria due to differing capabilities.

Best Response Functions: For strong agent ($i=0$) with $e_0 = 120$, $\alpha_0 = 0.60$, $D_{01} = 0.35$:

BR_strong(a_weak) = argmax_{a_0} [(120-a_0) + θ·ln(1+a_0) + 0.60·G(a_0,a_1) + 0.35·π_weak]

For weak agent ($i=1$) with $e_1 = 80$, $\alpha_1 = 0.40$, $D_{10} = 0.85$:

BR_weak(a_strong) = argmax_{a_1} [(80-a_1) + θ·ln(1+a_1) + 0.40·G(a_0,a_1) + 0.85·π_strong]

Nash Equilibrium (Myopic):

Agent	Equilibrium Action	% of Endowment
Strong	$a_0^* \approx 45$	37.5%
Weak	$a_1^* \approx 30$	37.5%

Both contribute near their baselines (35%), reflecting mutual caution.

Equilibrium Payoffs:

Agent	Myopic NE Payoff	Full Cooperation Payoff
Strong	~95	~145
Weak	~68	~110

Pareto Frontier with Asymmetry

The Pareto frontier is asymmetric:

Strong prefers: High (a_s, a_w) with more surplus capture
Weak prefers: Moderate cooperation with trust protection

Pareto-Optimal Profiles:

Profile	Strong Utility	Weak Utility	Total	Sustainable?
(45, 30) NE	95	68	163	Yes (marginally)
(70, 50)	125	92	217	Yes
(90, 65)	138	105	243	Yes
(120, 80) Max	145	112	257	Fragile

Price of Anarchy: PoA = 257/163 ≈ 1.58

Stackelberg Equilibrium

Given strong’s power advantage, a Stackelberg analysis is relevant:

Strong as Leader: If strong commits first, the optimal strategy is:

Choose $a_0$ anticipating weak’s best response $\text{BR}_{\text{weak}}(a_0)$
Maximize own payoff subject to weak’s participation constraint

Stackelberg Outcome:

Strong: $a_0^* \approx 55\text{-}60$ (moderate exploitation)
Weak: $\text{BR}_{\text{weak}} \approx 35\text{-}40$ (defensive response)
Strong extracts more surplus than simultaneous-move NE

Weak’s Participation Constraint: Weak will participate (not exit) iff:

U_weak(a_0, BR_weak(a_0)) ≥ Outside_Option

The exit threshold $\tau < 0.10$ operationalizes this constraint dynamically.

Repeated Game with Exit Threat

The exit threshold creates a credible commitment mechanism:

Exit as Punishment: Weak’s threat to exit if $\tau < 0.10$ is credible because:

Weak loses from staying in exploitative relationship
Exit terminates strong’s long-term gains
Creates dynamic incentive for strong to moderate

Modified Folk Theorem: With exit threats, a range of equilibria becomes feasible:

Lower bound: Strong exploits until exit threshold approached
Upper bound: Full cooperation if credible punishment exists

Equilibrium Exploitation Level: Strong’s optimal exploitation satisfies:

∂π_strong/∂(exploitation) = λ⁻ × (value_of_relationship) × ∂τ/∂a

Balancing marginal gain from extraction against marginal loss from trust erosion.

Trust Ceiling Effects

Reputation damage creates strategic constraints:

Dynamic Constraints:

τ_max(t) = 1 - R(t)

Once strong exploits and causes reputation damage R:

Trust ceiling drops permanently
Future cooperation limited
Weak’s defensive options reduced

Irreversibility Premium: Strong must weigh:

Short-term exploitation gains
Permanent ceiling reduction
Reduced future surplus to extract

MARL Implications

Learning Challenges:

Challenge	Source	MARL Implication
Power asymmetry	Different BRs	Agents need different policies
Exit threshold	Termination risk	Long-horizon credit assignment
Trust ceiling	Irreversibility	Exploration/exploitation tension
Stackelberg dynamics	Sequential structure	Leader-follower learning

Expected Algorithm Performance:

Algorithm	Strong’s Outcome	Weak’s Outcome	Notes
Independent PPO	Exploitation	Defensive	Power asymmetry amplified
MAPPO	Moderate	Moderate	Centralization helps coordination
Hierarchical	Optimal extraction	Constrained	Matches Stackelberg

Strategic Analysis

Strong Partner’s Dilemma

The strong partner faces a choice:

Exploitation Strategy:

Contribute less than fair share
Capture most of the surplus
Risk triggering weak partner’s exit

Sustainable Strategy:

Maintain moderate cooperation
Keep weak partner’s trust above exit threshold
Accept lower per-period returns for longevity

Weak Partner’s Options

The weak partner can:

Defensive Strategy:

Maintain minimal viable cooperation
Reduce relationship-specific investments
Keep exit option credible

Trust-Building Strategy:

Over-cooperate to build trust cushion
Invest in relationship to increase strong’s dependency
Signal long-term commitment

Equilibrium Considerations

In repeated games:

Strong partner may restrain exploitation to avoid triggering exit
Weak partner may invest if strong signals long-term orientation
“Shadow of the future” moderates short-term opportunism

Metrics and Info

The info dictionary includes:

Key	Type	Description
`step`	int	Current timestep
`mean_trust`	float	Average trust level
`weak_trust_in_strong`	float	Critical metric for exit
`strong_trust_in_weak`	float	Less critical
`power_asymmetry`	float	Measure of dependency imbalance
`cooperation_rate`	float	Overall cooperation level

Example: Defensive Weak Partner Strategy

import coopetition_gym
import numpy as np

env = coopetition_gym.make("PartnerHoldUp-v0")
obs, info = env.reset(seed=42)

episode_rewards = np.zeros(2)

for step in range(100):
    # Strong: Moderate exploitation (50% of 120 = 60, below fair share)
    strong_action = 60.0

    # Weak: Defensive (slightly above baseline to maintain trust)
    # Baseline is 28, so 35 is modest cooperation
    weak_action = 35.0

    actions = np.array([strong_action, weak_action])
    obs, rewards, terminated, truncated, info = env.step(actions)
    episode_rewards += rewards

    if terminated: print(f"Relationship ended at step {step}")
        break

print(f"Strong total: {episode_rewards[0]:.1f}")
print(f"Weak total: {episode_rewards[1]:.1f}")
print(f"Final weak trust in strong: {info.get('weak_trust_in_strong', 'N/A')}")

Research Applications

PartnerHoldUp-v0 is suitable for studying:

Power Dynamics: How asymmetry affects cooperation
Contract Theory: Hold-up problems and incomplete contracts
Supply Chain Management: Buyer-supplier relationships
Strategic Management: Alliance design with power imbalances
Fair Division: Bargaining with unequal positions

Baseline Results

Benchmark results following the Evaluation Protocol.

Evaluation Configuration

Parameter	Value
Episodes	100
Seeds	0-99
Horizon	100 steps
Training seeds	100-104 (5 runs)

Performance Comparison

Algorithm	Strong Return	Weak Return	Final Trust	Exploitation Rate
Random	78.2	45.6	0.24	0.52
Constant(Fair)	95.4	62.8	0.48	0.35
Strong Exploits	108.3	38.2	0.22	0.68
Weak Defensive	86.5	58.4	0.36	0.42
IPPO	102.6	52.4	0.38	0.55
MAPPO	98.2	64.2	0.52	0.40

Exploitation Rate = (Strong action/Strong endowment) - (Weak action/Weak endowment). Higher indicates strong exploiting power asymmetry.

Key Observations

Power amplification: IPPO tends to amplify power asymmetry (strong learns exploitation)
Centralized training helps: MAPPO achieves more balanced outcomes
Trust-fairness tradeoff: Higher exploitation yields higher strong returns but lower trust
Exit threshold matters: Weak trust < 0.15 often leads to early termination

Recommended Hyperparameters

# PPO configuration for PartnerHoldUp-v0
algorithm: PPO
learning_rate: 3e-4
n_steps: 2048
batch_size: 64
gamma: 0.99
# Higher entropy helps explore fair strategies
ent_coef: 0.02
network: hidden_layers: [128, 128]

TrustDilemma-v0: Symmetric version
PlatformEcosystem-v0: Multi-agent power asymmetry
CooperativeNegotiation-v0: Explicit contracts

References

Williamson, O.E. (1985). The Economic Institutions of Capitalism. Free Press.
Hart, O. & Moore, J. (1990). Property Rights and the Nature of the Firm. Journal of Political Economy.
Pant, V. & Yu, E. (2025). Computational Foundations for Strategic Coopetition: Formalizing Trust and Reputation Dynamics. arXiv:2510.24909

PartnerHoldUp-v0

Overview

MARL Classification

Formal Specification

Agents

State Space

Action Space

Interdependence Matrix (Critical Asymmetry)

Transition Dynamics

Reward Function

Episode Structure

Initial State

Game-Theoretic Background

The Hold-Up Problem

Theoretical Foundations

Relationship to Classical Game Theory

Key Theoretical Results

Connections to Prior Work

Literature Connections

Power Balance Mechanisms

Power Dynamics in PartnerHoldUp-v0

Environment Specification

Basic Usage

Parameters

Agent Configuration

Endowments

Bargaining Shares (Alpha)

Baselines (Cooperation Threshold)

Interdependence Structure

Implications

Trust Dynamics

Parameters

Asymmetric Trust Effects

Termination Conditions

Normal Truncation

Relationship Exit

Reward Structure

Integrated Utility with Asymmetry

Value Function

Equilibrium Analysis

Asymmetric Stage-Game Equilibrium

Pareto Frontier with Asymmetry

Stackelberg Equilibrium

Repeated Game with Exit Threat

Trust Ceiling Effects

MARL Implications

Strategic Analysis

Strong Partner’s Dilemma

Weak Partner’s Options

Equilibrium Considerations

Metrics and Info

Example: Defensive Weak Partner Strategy

Research Applications

Baseline Results

Evaluation Configuration

Performance Comparison

Key Observations

Recommended Hyperparameters

Related Environments

References