Custom Rewards¶

One of JobShopLab’s strengths is its flexibility in defining reward functions. This tutorial shows how to create and use custom reward functions to optimize different scheduling objectives.

Reward Functions in Job Shop Scheduling¶

Different applications have different optimization goals:

Makespan: Minimize total completion time (most common)
Tardiness: Minimize lateness against due dates
Throughput: Maximize number of completed jobs
Resource utilization: Maximize machine usage
Energy consumption: Minimize energy usage
Multiple objectives: Weighted combinations of the above

JobShopLab’s reward factory system makes it easy to define custom reward functions for any of these objectives.

Understanding the Reward Factory¶

Reward factories in JobShopLab:

Receive the current state and transition information
Calculate a reward value based on the state
Return a scalar reward to the agent

The RewardFactory abstract base class defines the interface:

class RewardFactory(ABC):
    def __init__(self, loglevel, config, instance, *args, **kwargs):
        # Initialize with config and instance information

    @abstractmethod
    def make(self, state_result: StateMachineResult,
             terminated: bool, truncated: bool) -> float:
        """Calculate and return the reward value."""
        pass

Creating a Custom Reward Factory¶

To create a custom reward function, subclass RewardFactory:

from jobshoplab.env.factories.rewards import RewardFactory
from jobshoplab.types import StateMachineResult

class CustomRewardFactory(RewardFactory):
    def __init__(self, loglevel, config, instance, bias_a, bias_b, *args, **kwargs):
        super().__init__(loglevel, config, instance)
        self.bias_a = bias_a
        self.bias_b = bias_b

    def make(self, state_result: StateMachineResult, terminated: bool, truncated: bool) -> float:
        # During episode (not done)
        if not (terminated or truncated):
            return self.bias_a

        # Episode completed
        else:
            # Return reward inversely proportional to makespan
            return self.bias_b * state_result.state.time.time

Using a Custom Reward Factory¶

There are two ways to use your custom reward factory:

Via dependency injection (for quick experiments):

from functools import partial

# Create factory with specific parameters
reward_factory = partial(CustomRewardFactory, bias_a=0, bias_b=1)

# Use in environment
env = JobShopLabEnv(config=config, reward_factory=reward_factory)

Via configuration (for reproducible experiments):

# First, register your factory with JobShopLab (in your module)
from jobshoplab.env.factories import register_reward_factory

register_reward_factory("CustomRewardFactory", CustomRewardFactory)

# Then in your config file:
"""
env:
  reward_factory: "CustomRewardFactory"

reward_factory:
  custom_reward_factory:
    bias_a: 0
    bias_b: 1
"""

Reward Design Considerations¶

When designing custom rewards, consider:

Sparse vs. Dense Rewards¶

Sparse rewards: Only given at episode end (e.g., final makespan) - Pros: Clear global objective - Cons: Delayed feedback makes learning difficult
Dense rewards: Given at each step - Pros: Immediate feedback helps learning - Cons: May lead to suboptimal policies if not aligned with global objective

A common approach is to combine both:

def make(self, state_result, terminated, truncated):
    # Dense reward component based on current progress
    current_progress = self._calculate_progress(state_result)
    dense_reward = self.dense_weight * current_progress

    # If episode is done, add sparse reward component
    if terminated or truncated:
        makespan = state_result.state.time.time
        sparse_reward = self.sparse_weight * (1000 / makespan)
        return dense_reward + sparse_reward

    return dense_reward

Reward Scaling¶

Rewards should typically be in a reasonable range (e.g., -1 to 1) for most RL algorithms. Consider normalizing rewards:

def make(self, state_result, terminated, truncated):
    if terminated:
        # Get makespan
        makespan = state_result.state.time.time

        # Get lower bound estimate for the instance
        lower_bound = self._calculate_lower_bound()

        # Normalized reward (1.0 for perfect solution)
        return lower_bound / makespan

    return 0

Multi-objective Rewards¶

For multiple objectives, use weighted combinations:

def make(self, state_result, terminated, truncated):
    if terminated:
        # Makespan component
        makespan = state_result.state.time.time
        makespan_reward = self.makespan_weight * (1000 / makespan)

        # Tardiness component
        tardiness = self._calculate_tardiness(state_result)
        tardiness_reward = self.tardiness_weight * (1 / (1 + tardiness))

        # Energy component
        energy = self._calculate_energy(state_result)
        energy_reward = self.energy_weight * (1 / energy)

        return makespan_reward + tardiness_reward + energy_reward

    return 0