Overview
You will design and implement a production-oriented ML ranking pipeline for a Reddit-style feed. Using the fixed-seed synthetic dataset generated by the script below, you will engineer a rich, well-justified feature matrix, train and compare multiple model families, and deploy a Dockerized FastAPI serving endpoint — the single highest-weight deliverable at 30%, where reviewers will invest the most scrutiny. You will also document, in concrete technical depth, how this system would operate at Reddit's actual scale.
Canonical Dataset — Fixed-Seed Starter Code: Every candidate must run the script below exactly as written to produce data/posts.csv. Do not modify the seed, parameters, or logic in any way. Do not substitute any other data source.
# generate_dataset.py — run once to produce data/posts.csv
# Requires: pandas, numpy (both pinned in your requirements.txt)
import numpy as np
import pandas as pd
RNG = np.random.default_rng(seed=42)
N = 50_000
SUBREDDITS = ["AskReddit","worldnews","gaming","science","todayilearned",
"movies","funny","aww","technology","sports"]
age_hours = RNG.exponential(scale=24, size=N).clip(0.1, 720)
base_score = (RNG.pareto(a=1.5, size=N) * 100).astype(int).clip(1, 50_000)
upvote_ratio = RNG.beta(a=9, b=2, size=N).round(4)
num_comments = (RNG.negative_binomial(n=5, p=0.05, size=N)).clip(0, 5_000)
author_karma = RNG.integers(1, 500_000, size=N)
subreddit = RNG.choice(SUBREDDITS, size=N)
# Synthetic titles sampled from a small vocabulary
WORDS = ["breaking","new","why","how","best","worst","top","daily",
"amazing","free","official","weekly","update","thread","discussion"]
titles = [" ".join(RNG.choice(WORDS, size=RNG.integers(3, 9))) for _ in range(N)]
# Engagement label: logistic function of several signals + noise
log_odds = (
0.03 * base_score / 1000
- 0.05 * age_hours
+ 1.2 * upvote_ratio
+ 0.002 * num_comments
+ RNG.normal(scale=0.8, size=N)
)
prob_engaged = 1 / (1 + np.exp(-log_odds))
engaged = (RNG.uniform(size=N) < prob_engaged).astype(int)
df = pd.DataFrame({
"post_id": np.arange(N),
"title": titles,
"subreddit": subreddit,
"score": base_score,
"num_comments": num_comments,
"upvote_ratio": upvote_ratio,
"post_age_hours":age_hours,
"author_karma": author_karma,
"engaged": engaged,
})
df.to_csv("data/posts.csv", index=False)
print(f"Saved {len(df):,} rows to data/posts.csv "
f"(engagement rate: {engaged.mean():.3f})")
Schema: post_id (int), title (str), subreddit (str), score (int), num_comments (int), upvote_ratio (float), post_age_hours (float), author_karma (int), engaged (int, 0/1 label).
The platform holds a hidden holdout test set generated with the same script and seed strategy. Automated scoring uses NDCG@10 and AUC-ROC against that holdout — do not attempt to reconstruct or augment it.
Estimated effort: 8–18 hours.
What you'll build
- A feature engineering pipeline producing a rich, well-justified feature matrix from raw post records — with time-decay transformations, interaction terms, text-derived signals, leakage prevention, and importance or ablation evidence.
- At least two distinct model families trained and compared on a held-out validation split using both NDCG@10 and AUC-ROC, with calibration checks and documented overfitting mitigations.
- A Dockerized FastAPI endpoint (
POST /rank) — the deliverable carrying the most weight at 30%. This is where you should invest the most time and polish: correct ranked output, thorough edge-case handling, a latency benchmark script, and a substantive written design discussion for high-QPS operation at Reddit scale.
- Fully reproducible code that runs end-to-end from a single command on a fresh environment with pinned dependencies.
- A README containing a dedicated
## Scale section (350+ words) covering distributed ML infrastructure, and either an inline or linked discussion (400+ words) on high-QPS endpoint design.
Requirements
Feature Engineering Quality (weight: 15%)
- Construct time-decay features (e.g., exponential or power-law decay applied to
post_age_hours) and justify your choice of decay constant — reference the data-generating process or empirical validation against the label.
- Create interaction terms (e.g.,
score × upvote_ratio, num_comments / (post_age_hours + 1)) and explain the intuition behind each interaction relative to the engagement prediction task.
- Derive text signals from
title: at minimum TF-IDF weighted scores; optionally lightweight hashed n-gram embeddings. Justify in writing why these signals are expected to correlate with engaged.
- Explicitly prevent and document data leakage: no target-correlated information may flow into features during cross-validation or held-out evaluation. Describe your leakage prevention strategy in
evaluation_report.md.
- Include feature importance or ablation results (permutation importance, SHAP values, or hold-one-group-out ablation) demonstrating which feature groups drive performance. Summarize findings in
evaluation_report.md.
Model Design and Evaluation Rigor (weight: 15%)
- Train at least two distinct model families (e.g., a gradient-boosted tree such as XGBoost or LightGBM, plus a logistic regression baseline or lightweight PyTorch MLP).
- Evaluate every model on a held-out validation split using both NDCG@10 (ranking objective) and AUC-ROC (classification objective). Explain in writing what each metric captures and why both matter for a feed-ranking system.
- Include a calibration check (reliability diagram or Brier score) for at least the best-performing model and discuss what miscalibration would mean in a live feed-ranking context.
- Explicitly document overfitting mitigations (early stopping, regularization, cross-validation strategy).
- Collect all metrics, plots, and narrative in
evaluation_report.md.
Production-Readiness of Serving Endpoint (weight: 30% — invest the most depth here)
This criterion carries the highest weight. Reviewers will build your Docker image from scratch, exercise the endpoint with valid and invalid payloads, run your benchmark script against the live container, and read your written design discussion in full. Polish and thoroughness matter most here.
- Implement a FastAPI application with a
POST /rank route that accepts a JSON body containing a list of 1–20 post objects (same schema as training, minus engaged) and returns them sorted by predicted engagement score, descending, with the predicted score attached to each item.
- Handle all edge cases gracefully with correct HTTP status codes and informative error messages: empty list (422), list exceeding 20 items (422), missing or malformed required fields (422), and unknown
subreddit values not seen during training (handle gracefully without crashing — document your chosen strategy).
- Containerize the serving stack in a
Dockerfile; the image must build and run with standard docker build + docker run commands without any manual intervention or host-side dependency installation.
- Include a
benchmark.py script that sends at least 200 requests to the running container and reports mean, p50, p95, and p99 per-request inference latency. Include the benchmark results in your README.
- Write a minimum 400-word discussion (in
README.md or a dedicated SERVING.md) on how this endpoint's design would change to serve Reddit-scale QPS (thousands of requests per second). Address batching strategies, model quantization or distillation, response caching, horizontal scaling, and load balancing. For each approach, state concretely why it fits this specific feed-ranking workload.
Reproducibility and Code Quality (weight: 20%)
- The entire pipeline — dataset generation, feature engineering, model training, evaluation, and serving — must run end-to-end via a single top-level command (e.g.,
make all or bash run_pipeline.sh) on a fresh environment with pinned dependencies installed.
- All dependencies must be pinned in
requirements.txt or pyproject.toml.
- Code must be modular: separate Python modules or packages for feature engineering, training, and serving. Use type annotations throughout and follow consistent style (PEP 8 or equivalent).
- Include at minimum a linting step (
flake8, ruff, or mypy) invokable via make lint or an equivalent Makefile target. A brief CI config (e.g., .github/workflows/lint.yml) is strongly encouraged and will be rewarded.
System Scalability Reasoning in README (weight: 20%)
- Include a
## Scale section in your README of at least 350 words covering all of the following with concrete, realistic proposals — not a vague buzzword list:
- Offline vs. online feature computation trade-offs and where a feature store (e.g., Feast, Tecton) fits into the architecture — explain why a feature store is the right tool for this specific workload.
- How a Spark or Flink pipeline would replace the current batch preprocessing — state which framework you'd choose and why for Reddit's event volume.
- Kafka (or a comparable stream-processing layer) for real-time signal ingestion — describe what signals would be streamed and why low latency matters.
- Model serving latency budgets in a feed-ranking context — state what p99 latency is acceptable, justify it from a user-experience perspective, and explain how you'd enforce it.
- A/B testing frameworks for safely rolling out a new ranking model — describe traffic-splitting strategy, guardrail metrics, and minimum detectable effect considerations.
Deliverables
Submit a public GitHub repository containing:
| File / Directory | Purpose |
|---|
generate_dataset.py | The fixed-seed script above (included verbatim — do not modify) |
data/ | Output directory for posts.csv (gitignored); include .gitkeep |
features/ | Feature engineering module |
training/ | Model training and evaluation scripts |
serving/ | FastAPI application source |
Dockerfile | Containerized serving stack |
benchmark.py | Latency benchmark script |
evaluation_report.md | All model metrics, calibration plots, ablation/importance results, leakage documentation |
Makefile or run_pipeline.sh | Single-command reproducibility |
requirements.txt or pyproject.toml | Pinned dependencies |
README.md | Design decisions, feature insights, benchmark results, ## Scale section (350+ words), high-QPS discussion (400+ words) or pointer to SERVING.md |
How we'll evaluate
- Feature Engineering Quality (15%): We will review whether time-decay, interaction, and text-derived features are present, clearly justified, and demonstrated to matter via importance or ablation results in
evaluation_report.md. We will verify that data leakage is explicitly prevented and documented with a clear strategy.
- Model Design and Evaluation Rigor (15%): We will verify that at least two distinct model families are trained and compared using both NDCG@10 and AUC-ROC on a held-out split, that a calibration check is present for the best model, that overfitting mitigations are discussed, and that the candidate demonstrates understanding of the difference between ranking and classification objectives. Automated scoring uses the platform's hidden holdout test set.
- Production-Readiness of Serving Endpoint (30%): We will build your Docker image from scratch and call
POST /rank with valid and invalid payloads (empty list, missing fields, oversized list, unknown subreddit), verifying correct ranked output and HTTP error codes. We will run benchmark.py against the live container and check that mean, p50, p95, and p99 latencies are reported. We will read your written discussion of high-QPS design changes and assess it for depth, concreteness, and correctness. This criterion carries the most weight — invest accordingly.
- Reproducibility and Code Quality (20%): We will clone the repository into a clean environment, run your single command, and verify it completes without errors. We will assess code modularity, type annotation coverage, dependency pinning, and whether a linting step is present and passes cleanly. A CI config will be rewarded.
- System Scalability Reasoning in README (20%): We will read your
## Scale section and assess whether proposals are concrete, correctly motivated, and demonstrate genuine understanding of distributed ML systems — including feature stores, Spark/Flink vs. Kafka trade-offs, latency budgets, and A/B testing frameworks. Vague buzzword lists without justification will score poorly.
Out of scope
- Training on any dataset other than the one produced by running
generate_dataset.py verbatim.
- Modifying
generate_dataset.py in any way that changes its output (the seed and parameters are fixed).
- Large language model fine-tuning or any model requiring a GPU at inference time.
- A frontend UI of any kind.
- Authentication or rate-limiting on the FastAPI endpoint.
- Multi-armed bandit or reinforcement learning approaches.
- Deployment to a live cloud environment (local Docker is sufficient).