Job Description
Job Title: RLHF Specialist
Location: Remote (Worldwide)
Job Summary: An RLHF Specialist is responsible for improving and aligning AI models using Reinforcement Learning from Human Feedback (RLHF) methodologies. This role focuses on designing, implementing, and optimizing feedback pipelines that enhance model performance, safety, factual accuracy, and alignment with human values.
Responsibilities:
• Generate high-quality preference data by comparing multiple model responses and ranking them based on criteria such as helpfulness, honesty, and harmlessness (HHH).
• Design complex, multi-turn prompts to stress-test model behavior and expose weaknesses in reasoning or safety.
• Write detailed chain-of-thought explanations and rationales to train reward models on why specific responses are superior.
• Collaborate with Machine Learning Engineers to analyze model failure modes and identify data gaps that, when filled, will improve reinforcement learning outcomes.
• Develop and iterate on annotation strategies for preference scoring and reinforcement signals, ensuring consistency across a global team.
• Proactively probe models to identify vulnerabilities, biases, or hallucination patterns, documenting findings for model optimization.
• Analyze edge cases where the reward model behaves unexpectedly (e.g., over-indexing on verbosity or style over substance). Provide detailed feedback to ML engineers on reward model failure modes and suggest specific data interventions to correct model behavior.
• Develop and document templated instruction sets for larger annotation teams. Translate complex reinforcement learning concepts into simple, repeatable tasks for junior reviewers, ensuring high-quality data collection at scale.
• Monitor model performance over time by maintaining a personal test set of prompts. Regularly re-evaluate new model versions against historical benchmarks to track improvements or regressions in reasoning and alignment.
Requirements:
• Minimum of 5 years of experience in Data Annotation, Model Evaluation, Computational Linguistics, or Trust and Safety, specifically working with AI/ML training data.
• Strong proficiency in Python and deep learning frameworks (PyTorch, JAX, or TensorFlow).
• Deep understanding of Reinforcement Learning concepts (PPO, Trust Regions, Reward Hacking) and how they apply to language generation.
• Hands-on experience fine-tuning open-source models (e.g., Llama 2/3, Mistral, gemma) using techniques like LoRA/QLoRA.
• Experience working with annotation tools (LabelBox, Scale AI, Snorkel) and managing human-in-the-loop workflows.
• Ability to diagnose why an RL policy collapsed and adjust hyperparameters or reward structure accordingly.
• Experience with Constitutional AI or Self-Alignment techniques.
• Contributions to open-source alignment libraries (TRL, Transformer Reinforcement Learning, Axolotl).
• Experience with cloud Platforms (AWS SageMaker, GCP Vertex AI).
Method of Application:
Qualified candidates should send a copy of their cv and portfolio to with the job title as the subject of the mail.
Salary:
$1,000