multicloud365
  • Home
  • Cloud Architecture
    • OCI
    • GCP
    • Azure
    • AWS
    • IAC
    • Cloud Networking
    • Cloud Trends and Innovations
    • Cloud Security
    • Cloud Platforms
  • Data Management
  • DevOps and Automation
    • Tutorials and How-Tos
  • Case Studies and Industry Insights
    • AI and Machine Learning in the Cloud
No Result
View All Result
  • Home
  • Cloud Architecture
    • OCI
    • GCP
    • Azure
    • AWS
    • IAC
    • Cloud Networking
    • Cloud Trends and Innovations
    • Cloud Security
    • Cloud Platforms
  • Data Management
  • DevOps and Automation
    • Tutorials and How-Tos
  • Case Studies and Industry Insights
    • AI and Machine Learning in the Cloud
No Result
View All Result
multicloud365
No Result
View All Result

RLHF 101: A Technical Tutorial on Reinforcement Studying from Human Suggestions – Machine Studying Weblog | ML@CMU

admin by admin
June 2, 2025
in AI and Machine Learning in the Cloud
0
RLHF 101: A Technical Tutorial on Reinforcement Studying from Human Suggestions – Machine Studying Weblog | ML@CMU
399
SHARES
2.3k
VIEWS
Share on FacebookShare on Twitter



Reinforcement Studying from Human Suggestions (RLHF) is a well-liked method used to align AI techniques with human preferences by coaching them utilizing suggestions from individuals, reasonably than relying solely on predefined reward capabilities. As a substitute of coding each fascinating habits manually (which is usually infeasible in advanced duties) RLHF permits fashions, particularly giant language fashions (LLMs), to be taught from examples of what people take into account good or unhealthy outputs. This method is especially necessary for duties the place success is subjective or onerous to quantify, reminiscent of producing useful and protected textual content responses. RLHF has grow to be a cornerstone in constructing extra aligned and controllable AI techniques, making it important for creating AI that behaves in methods people intend.

This weblog dives into the total coaching pipeline of the RLHF framework. We are going to discover each stage — from information era and reward mannequin inference, to the ultimate coaching of an LLM. Our purpose is to make sure that every thing is totally reproducible by offering all the required code and the precise specs of the environments used. By the tip of this put up, you need to know the overall pipeline to coach any mannequin with any instruction dataset utilizing the RLHF algorithm of your alternative!

Preliminary: Setup & Surroundings

We are going to use the next setup for this tutorial:

  • Dataset: UltraFeedback, a well-curated dataset consisting of basic chat prompts. (Whereas UltraFeedback additionally comprises LLM-generated responses to the prompts, we received’t be utilizing these.)
  • Base Mannequin: Llama-3-8B-it, a state-of-the-art instruction-tuned LLM. That is the mannequin we’ll fine-tune.
  • Reward Mannequin: Armo, a strong reward mannequin optimized for evaluating the generated outputs. We are going to use Armo to assign scalar reward values to candidate responses, indicating how “good” or “aligned” a response is.
  • Coaching Algorithm: REBEL, a state-of-the-art algorithm tailor-made for environment friendly RLHF optimization.

To get began, clone our repo, which comprises all of the assets required for this tutorial:

git clone https://github.com/ZhaolinGao/REBEL
cd REBEL

We use two separate environments for various phases of the pipeline:

  • vllm: Handles information era, leveraging the environment friendly vllm library.
  • insurgent: Used for coaching the RLHF mannequin.

You’ll be able to set up each environments utilizing the offered YAML recordsdata:

conda env create -f ./envs/rebel_env.yml
conda env create -f ./envs/vllm_env.yml

Half 1: Information Technology

Step one within the RLHF pipeline is producing samples from the coverage to obtain suggestions on. Concretely, on this part, we’ll load the bottom mannequin utilizing vllm for quick inference, put together the dataset, and generate a number of responses for every immediate within the dataset. The whole code for this half is on the market right here.

Activate the vllm atmosphere:

conda activate vllm

First, load the bottom mannequin and tokenizer utilizing vllm:

from transformers import AutoTokenizer
from vllm import LLM
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
llm = LLM(
    mannequin="meta-llama/Meta-Llama-3-8B-Instruct",
    tensor_parallel_size=8,
)

Right here, tensor_parallel_size specifies the variety of GPUs to make use of.

Subsequent, load the UltraFeedback dataset:

from datasets import load_dataset
dataset = load_dataset("allenai/ultrafeedback_binarized_cleaned_train", cut up="prepare")

You’ll be able to choose a subset of the dataset utilizing dataset.choose. For instance, to pick the primary 10,000 rows:

dataset = dataset.choose(vary(10000))

Alternatively, you’ll be able to cut up the dataset into chunks utilizing dataset.shard for implementations like SPPO the place every iteration solely trains on one of many chunks.

Now, let’s put together the dataset for era. The Llama mannequin makes use of particular tokens to tell apart prompts and responses. For instance:

consumer

What's France's capital?assistant

Due to this fact, for each immediate within the dataset, we have to convert it from plain textual content into this format earlier than producing:

def get_message(instruction):
    message = [
        {"role": "user", "content": instruction},
    ]
    return message
prompts = [tokenizer.apply_chat_template(get_message(row['prompt']), tokenize=False, add_generation_prompt=True) for row in dataset]
  • get_message transforms the plain-text immediate right into a dictionary indicating it’s from the consumer.
  • tokenizer.apply_chat_template provides the required particular tokens and appends the response tokens (assistantnn} on the finish with add_generation_prompt=True.

Lastly, we are able to generate the responses utilizing vllm with the prompts we simply formatted. We’re going to generate 5 responses per immediate:

import torch
import random
import numpy as np
from vllm import SamplingParams

def set_seed(seed=5775709):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)

for p in vary(5):
    set_seed(p * 50)
    sampling_params = SamplingParams(
        temperature=0.8,
        top_p=0.9,
        max_tokens=2048,
        seed=p * 50,
    )
    response = llm.generate(prompts, sampling_params)
    output = checklist(map(lambda x: x.outputs[0].textual content, response))
    dataset = dataset.add_column(f"response_{p}", output)
  • temperature=0.8, top_p=0.9 are widespread settings to regulate variety in era.
  • set_seed is used to make sure reproducibility and units a special seed for every response.
  • llm.generate generates the response, and the outcomes are added to the dataset with dataset.add_column.

You possibly can run the entire scipt with:

python ./src/ultrafeedback_largebatch/generate.py --world_size NUM_GPU --output_repo OUTPUT_REPO

Half 2: Reward Mannequin Inference

The second step within the RLHF pipeline is querying the reward mannequin to inform us how good a generated pattern was. Concretely, on this half, we’ll calculate reward scores for the responses generated in Half 1 what are later used for coaching. The whole code for this half is on the market right here.

Activate the insurgent atmosphere:

conda activate insurgent

To start, we’ll initialize the Armo reward mannequin pipeline. This reward mannequin is a fine-tuned sequence classification mannequin that assigns a scalar reward rating to a given dialogue primarily based on its high quality.

rm = ArmoRMPipeline("RLHFlow/ArmoRM-Llama3-8B-v0.1", trust_remote_code=True)

Now, we are able to collect the reward scores:

def get_message(instruction, response):
    return [{"role": "user", "content": instruction}, {"role": "assistant", "content": response}]

rewards = {}
for i in vary(5):
    rewards[f"response_{i}_reward"] = []
    for row in dataset:
        reward = rm(get_message(row['prompt'], row[f'response_{i}']))
        rewards[f"response_{i}_reward"].append(reward)
for ok, v in rewards.objects():
    dataset = dataset.add_column(ok, v)
  • get_message codecs the consumer immediate and assistant response into a listing of dictionaries.
  • rm computes a reward rating for every response within the dataset.

You’ll be able to run the entire scipt with:

python ./src/ultrafeedback_largebatch/rank.py --input_repo INPUT_REPO
  • INPUT_REPO is the saved repo from Half 1 that comprises the generated responses.

Half 3: Filter and Tokenize

Whereas the previous two elements are all we’d like in idea to do RLHF, it’s typically advisable in follow to carry out a filtering course of to make sure coaching runs easily. Concretely, on this half, we’ll stroll by means of the method of getting ready a dataset for coaching by filtering excessively lengthy prompts and responses to forestall out-of-memory (OOM) points, selecting the right and worst responses for coaching, and eradicating duplicate responses. The whole code for this half is on the market right here.

Let’s first initialize two completely different tokenizers the place one pads from the fitting and one pads from the left:

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
tokenizer.add_special_tokens({"pad_token": "[PAD]"})
tokenizer_left = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct", padding_side="left")
tokenizer_left.add_special_tokens({"pad_token": "[PAD]"})

These two completely different tokenizers enable us to pad the immediate from left and the response from the fitting such that they meet within the center. By combining left-padded prompts with right-padded responses, we be certain that:

  • Prompts and responses meet at a constant place.
  • Relative place embeddings stay appropriate for mannequin coaching.

Right here’s an instance format:

[PAD] ... [PAD] consumer

PROMPTassistant


RESPONSE[PAD] ... [PAD]

We need to be certain that the size of

[PAD] ... [PAD] consumer

PROMPTassistant

is similar for all prompts, and the size of

RESPONSE[PAD] ... [PAD]

is similar for all responses.

We filter out prompts longer than 1,024 tokens and responses exceeding 2,048 tokens to forestall OOM throughout coaching:

dataset = dataset.filter(lambda row: tokenizer.apply_chat_template(get_message(row['prompt']), tokenize=True, add_generation_prompt=True, return_tensors="pt").form[-1] 



Word that we skip the primary 5 tokens of responses when counting lengths to exclude particular tokens (e.g. assistantnn) and solely depend the precise size of the response plus the EOS token () on the finish.

Now we might tokenize the immediate with left padding to a most size of 1,024 tokens:

llama_prompt_tokens = []
for row in dataset:
    llama_prompt_token = tokenizer_left.apply_chat_template(
            get_message(row['prompt']), 
            add_generation_prompt=True,
            tokenize=True,
            padding='max_length',
            max_length=1024,
    )
    assert len(llama_prompt_token) == 1024
    assert (llama_prompt_token[0] == 128000 or llama_prompt_token[0] == 128256) and llama_prompt_token[-1] == 271
    llama_prompt_tokens.append(llama_prompt_token)
dataset = dataset.add_column("llama_prompt_tokens", llama_prompt_tokens)

The assertions are used to make sure that the size is at all times 1,024 and the tokenized immediate both begins with [pad] token or token and ends with nn token.

Then, we choose the responses with the best and lowest rewards for every immediate because the chosen and reject responses, and tokenize them with proper padding:

chosen, reject, llama_chosen_tokens, llama_reject_tokens, chosen_reward, reject_reward = [], [], [], [], [], []

for row in dataset:

    all_rewards = [row[f"response_{i}_reward"] for i in vary(5)]
    chosen_idx, reject_idx = np.argmax(all_rewards), np.argmin(all_rewards)

    chosen.append(row[f"response_{chosen_idx}"])
    reject.append(row[f"response_{reject_idx}"])

    llama_chosen_token = tokenizer.apply_chat_template(
            get_message(response=row[f"response_{chosen_idx}"]),
            add_generation_prompt=False,
            tokenize=True,
            padding='max_length',
            max_length=2048+5,
    )[5:]
    llama_chosen_tokens.append(llama_chosen_token)
    chosen_reward.append(row[f"response_{chosen_idx}_reward"])
    assert len(llama_chosen_token) == 2048
    assert llama_chosen_token[-1] == 128009 or llama_chosen_token[-1] == 128256

    llama_reject_token = tokenizer.apply_chat_template(
            get_message(response=row[f"response_{reject_idx}"]),
            add_generation_prompt=False,
            tokenize=True,
            padding='max_length',
            max_length=2048+5,
    )[5:]
    llama_reject_tokens.append(llama_reject_token)
    reject_reward.append(row[f"response_{reject_idx}_reward"])
    assert len(llama_reject_token) == 2048
    assert llama_reject_token[-1] == 128009 or llama_reject_token[-1] == 128256

dataset = dataset.add_column("chosen", chosen)
dataset = dataset.add_column("chosen_reward", chosen_reward)
dataset = dataset.add_column("llama_chosen_tokens", llama_chosen_tokens)
dataset = dataset.add_column("reject", reject)
dataset = dataset.add_column("reject_reward", reject_reward)
dataset = dataset.add_column("llama_reject_tokens", llama_reject_tokens)

Once more the assertions are used to make sure that the lengths of the tokenized responses are at all times 2,048 and the tokenized responses both finish with [pad] token or token.

Lastly, we filter out rows the place the chosen and reject responses are the identical:

dataset = dataset.filter(lambda row: row['chosen'] != row['reject'])

and cut up the dataset right into a coaching set and a take a look at set with 1,000 prompts:

dataset = dataset.train_test_split(test_size=1000, shuffle=True)

You possibly can run the entire scipt with:

python ./src/ultrafeedback_largebatch/filter_tokenize.py --input_repo INPUT_REPO
  • INPUT_REPO is the saved repo from Half 2 that comprises the rewards for every response.

Half 4: Coaching with REBEL

Lastly, we’re now able to replace the parameters of our mannequin utilizing an RLHF algorithm! We are going to now use our curated dataset and the REBEL algorithm to fine-tune our base mannequin.

At every iteration (t) of REBEL, we purpose to resolve the next sq. loss regression drawback:
$$theta_{t+1}=argmin_{thetainTheta}sum_{(x, y, y’)in mathcal{D}_t}left(frac{1}{eta} left(ln fracx){pi_{theta_t}(y|x)} – ln fracx){pi_{theta_t}(y’|x)}proper) – left(r(x, y) – r(x, y’)proper)proper)^2$$

the place (eta) is a hyperparameter, (theta) is the parameter of the mannequin, (x) is the immediate, (mathcal{D}_t) is the dataset we collected from the earlier three elements, (y) and (y’) are the responses for (x), (pi_theta(y|x)) is the chance of era response (y) given immediate (x) underneath the parameterized coverage (pi_theta), and (r(x, y)) is the reward of response (y) for immediate (x) which is obtained from Half 2. The detailed derivations of the algorithm are proven in our paper. In brief REBEL lets us keep away from the complexity (e.g. clipping, critic fashions, …) of different RLHF algorithms like PPO whereas having stronger theoretical ensures!

On this tutorial, we reveal a single iteration of REBEL ((t=0)) utilizing the bottom mannequin (pi_{theta_0}). For multi-iteration coaching, you’ll be able to repeat Components 1 by means of 4, initializing every iteration with the mannequin skilled within the earlier iteration.

The whole code for this half is on the market right here. To allow full parameter coaching utilizing 8 GPUs, we use the Speed up library with Deepspeed Stage 3 by operating:

speed up launch --config_file accelerate_cfgs/deepspeed_config_stage_3.yaml --main-process-port 29080 --num_processes 8 src/ultrafeedback_largebatch/insurgent.py --task.input_repo INPUT_REPO --output_dir OUTPUT_DIR
  • INPUT_REPO is the saved repo from Half 3 that comprises the tokenized prompts and responses.
  • OUTPUT_DIR is the listing to avoid wasting the fashions.

Step 1: Initialization & Loading

We begin by initializing the batch measurement for distributed coaching:

args.world_size = accelerator.num_processes
args.batch_size = args.world_size * args.per_device_train_batch_size * args.gradient_accumulation_steps
args.local_batch_size = args.per_device_train_batch_size * args.gradient_accumulation_steps
args.insurgent.num_updates = args.total_episodes // args.batch_size
  • args.world_size is the variety of GPUs we’re utilizing.
  • args.local_batch_size is the batch measurement for every GPU.
  • args.batch_size is the precise batch measurement for coaching.
  • args.insurgent.num_updates is the full variety of updates to carry out and args.total_episodes is the variety of information factors to coach for. Usually, we set args.total_episodes to be the dimensions of the coaching set for one epoch.

Subsequent, we load the mannequin and tokenizer, making certain dropout layers are disabled such that the logprobs of the generations are computed with out randomness:

tokenizer = AutoTokenizer.from_pretrained(
                args.base_model, 
                padding_side="proper",
                trust_remote_code=True,
            )
tokenizer.add_special_tokens({"pad_token": "[PAD]"})
coverage = AutoModelForCausalLM.from_pretrained(
            args.base_model,
            trust_remote_code=True,
            torch_dtype=torch.bfloat16,
            attn_implementation="flash_attention_2",
        )
disable_dropout_in_model(coverage)

Step 2: Coaching

Trying once more on the REBEL goal, the one issues we’d like now to coach is to compute (pi_theta(y|x)) and (pi_{theta_0}(y|x)). We are able to compute every of them with:

output = coverage(
    input_ids=input_ids, 
    attention_mask=attention_mask,
    return_dict=True,
    output_hidden_states=True,
)
logits = output.logits[:, args.task.maxlen_prompt - 1 : -1]
logits /= args.activity.temperature + 1e-7
all_logprobs = F.log_softmax(logits, dim=-1)
logprobs = torch.collect(all_logprobs, 2, input_ids[:, args.task.maxlen_prompt:].unsqueeze(-1)).squeeze(-1)
logprobs = (logprobs * seq_mask).sum(-1)
  • output.logits comprises the logits of all tokens within the vocabulary for the sequence of input_ids.
  • output.logits[:, args.task.maxlen_prompt - 1 : -1] is the logits of all tokens within the vocabulary for the sequence of response solely. It’s shifted by 1 because the logits at place (p) are referring to the logits at place (p+1).
  • We divide logits by args.activity.temperature to acquire the precise chance throughout era.
  • torch.collect is used to collect the angle token within the response.
  • mb_seq_mask masks out the paddings.

Step 4: Loss Computation

Lastly, we might compute the loss by:

reg_diff = ((pi_logprobs_y - pi_0_logprobs_y) - (pi_logprobs_y_prime - pi_0_logprobs_y_prime)) / eta - (chosen_reward - reject_reward)
loss = (reg_diff ** 2).imply()

Efficiency

With just one iteration of the above 4 elements, we are able to significantly improve the efficiency of the bottom mannequin on AlpacaEval, MT-Bench, and ArenaHard, three benchmarks generally used to judge the standard, alignment, and helpfulness of responses generated by LLMs.

Takeaway

On this put up, we outlined the pipeline for implementing RLHF, masking your entire course of from information era to the precise coaching section. Whereas we centered particularly on the REBEL algorithm, this pipeline is flexible and may be readily tailored to different strategies reminiscent of DPO or SimPO. The mandatory parts for these strategies are already included apart from the precise loss formulation. There’s additionally a pure extension of the above pipeline to multi-turn RLHF the place we optimize for efficiency over a complete dialog (reasonably than a single era) — take a look at our follow-up paper right here for extra info!

In the event you discover this implementation helpful, please take into account citing our work:

@misc{gao2024rebel,
      title={REBEL: Reinforcement Studying by way of Regressing Relative Rewards}, 
      creator={Zhaolin Gao and Jonathan D. Chang and Wenhao Zhan and Owen Oertell and Gokul Swamy and Kianté Brantley and Thorsten Joachims and J. Andrew Bagnell and Jason D. Lee and Wen Solar},
      yr={2024},
      eprint={2404.16767},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

Tags: BlogfeedbackHumanLearningMachineMLCMUReinforcementRLHFTechnicalTutorial
Previous Post

Embrace A Fashionable Community For Transformation Success

Next Post

How Generative AI in Healthcare Sector Reshaping the Trade

Next Post
How Generative AI in Healthcare Sector Reshaping the Trade

How Generative AI in Healthcare Sector Reshaping the Trade

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Trending

INTEGRATE 2025 Day 2 Highlights

INTEGRATE 2025 Day 2 Highlights

June 11, 2025
Microsoft Provides Even Extra AI to Its Safety Copilot

Microsoft Provides Even Extra AI to Its Safety Copilot

March 27, 2025
Impression of Ozempic and Weight Loss Medicine in Shopper Going through Industries

Impression of Ozempic and Weight Loss Medicine in Shopper Going through Industries

January 23, 2025
SSH Error: No Matching Host Key Sort Discovered (SOLVED!)

SSH Error: No Matching Host Key Sort Discovered (SOLVED!)

May 22, 2025
Tshark Command-Line Cheat Sheet – Anto ./on-line

Tshark Command-Line Cheat Sheet – Anto ./on-line

April 30, 2025
Cultivating Excellence Via Information – TDAN.com

Cultivating Excellence Via Information – TDAN.com

July 4, 2025

MultiCloud365

Welcome to MultiCloud365 — your go-to resource for all things cloud! Our mission is to empower IT professionals, developers, and businesses with the knowledge and tools to navigate the ever-evolving landscape of cloud technology.

Category

  • AI and Machine Learning in the Cloud
  • AWS
  • Azure
  • Case Studies and Industry Insights
  • Cloud Architecture
  • Cloud Networking
  • Cloud Platforms
  • Cloud Security
  • Cloud Trends and Innovations
  • Data Management
  • DevOps and Automation
  • GCP
  • IAC
  • OCI

Recent News

Maximize Financial savings with Automated Cloud Price Optimization

Serverless vs Serverful: Smarter Azure Decisions

July 20, 2025
AzureKeyVault – Synchronize Secrets and techniques to Native Server

AzureKeyVault – Synchronize Secrets and techniques to Native Server

July 20, 2025
  • About Us
  • Privacy Policy
  • Disclaimer
  • Contact

© 2025- https://multicloud365.com/ - All Rights Reserved

No Result
View All Result
  • Home
  • Cloud Architecture
    • OCI
    • GCP
    • Azure
    • AWS
    • IAC
    • Cloud Networking
    • Cloud Trends and Innovations
    • Cloud Security
    • Cloud Platforms
  • Data Management
  • DevOps and Automation
    • Tutorials and How-Tos
  • Case Studies and Industry Insights
    • AI and Machine Learning in the Cloud

© 2025- https://multicloud365.com/ - All Rights Reserved