Introduction
Model evaluation is one of the most deceptively dangerous areas in machine learning interviews.
On the surface, it looks straightforward. Most candidates can define accuracy, precision, recall, ROC curves, and bias–variance tradeoffs. Many can even write formulas or compute metrics from confusion matrices. And yet, model evaluation questions are where a large number of strong ML candidates quietly fail interviews.
Why?
Because interviewers are not testing whether you know evaluation metrics. They are testing whether you can trust metrics in the presence of real-world complexity.
By 2026, ML interviewers across Big Tech, startups, and AI-first companies have converged on a hard-earned lesson: most ML failures are not caused by bad models, but by bad evaluation. Models are optimized correctly, against the wrong signal. Systems look good offline and fail in production. Metrics improve while user experience degrades.
Interviewers use model evaluation questions to identify whether you might repeat those mistakes.
Why Model Evaluation Questions Carry So Much Weight
From an interviewer’s perspective, model evaluation sits at the intersection of:
- Statistics
- ML theory
- Product impact
- Risk management
It reveals how you think about:
- Tradeoffs rather than absolutes
- Proxies rather than truths
- Uncertainty rather than confidence
Candidates who treat metrics as objective facts often sound confident, but unsafe. Candidates who treat metrics as tools with failure modes sound cautious, and trustworthy.
That distinction is often decisive.
The Most Common Misconception Candidates Have
Most candidates believe:
“If I know all the metrics, I’ll be fine.”
In reality, interviewers assume you know the metrics.
What they want to know is:
- When accuracy is misleading
- Why ROC-AUC can hide business failure
- When PR curves matter more
- How bias–variance shows up in production
- Why offline metrics often disagree with online outcomes
Candidates who answer evaluation questions mechanically (“Accuracy works when classes are balanced”) stop too early. Interviewers push further:
- Balanced in what sense?
- What if the class distribution shifts?
- What if false positives and false negatives have asymmetric cost?
- What if your metric incentivizes harmful behavior?
Candidates who cannot navigate these follow-ups are often downgraded, even if their initial answer was correct.
How Interviewers Actually Use Evaluation Questions
Model evaluation questions are rarely isolated. They appear embedded in:
- ML system design interviews
- Coding interviews (metric implementation)
- Project walkthroughs
- Debugging and error analysis discussions
Interviewers use them to test:
- Statistical intuition - Do you understand distributions, variance, and tradeoffs?
- Decision quality - Can you choose metrics that match objectives?
- Failure awareness - Do you anticipate misleading signals?
- Communication clarity - Can you explain evaluation to non-ML stakeholders?
A candidate who chooses the “wrong” metric but explains the reasoning clearly often scores higher than a candidate who chooses the “right” metric without justification.
Why Accuracy, ROC, and PR Are Interview Traps
Metrics like accuracy, ROC-AUC, and PR curves are intentionally common interview topics because they expose shallow thinking quickly.
For example:
- Accuracy hides class imbalance
- ROC-AUC can look strong while precision collapses
- PR curves are sensitive to base rates
- Threshold choice matters more than the curve itself
Interviewers expect candidates to go beyond definitions and talk about behavioral consequences.
When a candidate says:
“ROC-AUC is threshold-independent, so it’s better”
An experienced interviewer hears:
“This person may optimize a metric without understanding deployment consequences.”
Bias–Variance: More Than a Curve
Bias–variance questions are another favorite, not because interviewers care about the textbook tradeoff, but because they want to know:
- How you diagnose underfitting vs. overfitting
- How you respond when both appear simultaneously
- How bias–variance interacts with data quality, not just model choice
Candidates who treat bias–variance as an abstract curve often struggle to apply it in real scenarios.
Interviewers are listening for:
- Data-centric reasoning
- Error decomposition intuition
- Practical mitigation strategies
What This Blog Will Focus On
This blog is not a glossary. It is an interview playbook.
In the sections that follow, we will cover:
- High-frequency model evaluation interview questions
- Accuracy, precision, recall, ROC, PR, calibration, and beyond
- Bias–variance tradeoffs as interviewers expect you to explain them
- Common traps and misleading intuitions
- Example answers that balance rigor and practicality
- Typical follow-up questions interviewers ask to probe depth
Each explanation is structured to reflect how strong candidates actually answer in interviews, not how metrics are presented in textbooks.
Who This Blog Is For
This guide is designed for:
- ML Engineers
- Data Scientists
- Applied Scientists
- Software Engineers transitioning into ML
- Candidates preparing for FAANG, Big Tech, and AI-focused roles
You do not need to be a statistician. You need to demonstrate evaluation judgment.
The Core Principle to Keep in Mind
As you read the rest of this blog, remember:
Metrics do not tell the truth. They tell a story, and you are responsible for choosing which story matters.
Interviewers hire candidates who understand that responsibility.
Section 1: Accuracy, Precision, Recall - When Each One Fails
Accuracy, precision, and recall are often the first metrics discussed in ML interviews, and also the fastest way to expose shallow evaluation thinking. Interviewers do not ask about these metrics because they are simple. They ask because most candidates misuse them with confidence.
Understanding when each metric fails is far more important than knowing how to compute it.
Accuracy: The Most Misused Metric in Interviews
What candidates usually say
“Accuracy works well when classes are balanced.”
This answer is correct, and insufficient.
Why interviewers push further
Interviewers know that “balanced” is rarely static. They want to know whether you understand:
- How base rates change over time
- How deployment conditions differ from training
- How accuracy hides asymmetric costs
Accuracy collapses multiple types of errors into a single number. That makes it attractive, but also dangerous.
When Accuracy Actively Misleads
Accuracy fails when:
- One class dominates (fraud, spam, churn, defects)
- Error costs are asymmetric
- The operating threshold matters
- Distribution shifts occur post-deployment
A model that predicts “not fraud” 99% of the time can achieve 99% accuracy and still be useless.
Strong interview framing
“Accuracy can look good even when the model fails at the only cases we care about.”
Interviewers listen for this cost-awareness.
Precision: Optimizing for Trust, Not Coverage
What precision measures
Of the cases the model flags as positive, how many are actually positive?
Candidates often say:
“Precision is important when false positives are costly.”
Again, correct, but shallow.
Where precision breaks down
Precision ignores what the model misses. You can increase precision by being conservative, flagging fewer positives, while silently destroying recall.
High precision can coexist with:
- Severe under-detection
- Missed risk
- Business failure
Precision as an Interview Trap
Interviewers often follow up with:
“What happens to recall when you optimize precision?”
Strong candidates respond:
“It usually drops, which is acceptable only if missing positives is cheaper than acting on false ones.”
This shows understanding of metric tradeoffs, not metric worship.
Recall: Coverage Without Trust
What recall measures
Of all actual positives, how many did the model catch?
Candidates often say:
“Recall is important when missing positives is costly.”
Still not enough.
Where recall fails
Recall can be maximized trivially by flagging everything as positive, destroying precision and overwhelming downstream systems.
High recall alone often leads to:
- Alert fatigue
- User friction
- Operational overload
Why Interviewers Care About the Precision–Recall Tension
Interviewers are not asking you to pick a metric. They are asking whether you understand incentives.
Each metric shapes behavior:
- Accuracy rewards majority-class correctness
- Precision rewards conservatism
- Recall rewards aggressiveness
Strong candidates talk about metrics as levers, not scores.
The Confusion Matrix Is the Real Evaluation Tool
Interviewers often pivot to:
“Can you walk through the confusion matrix here?”
This is intentional.
Strong candidates reason directly about:
- False positives vs. false negatives
- Who is harmed by each error
- How errors propagate downstream
Metrics are summaries. Confusion matrices expose consequences.
Thresholds Matter More Than Metrics
Another common trap:
“ROC-AUC or precision is high, so the model is good.”
Interviewers know that:
- Metrics are threshold-agnostic
- Systems are not
Deployment requires a threshold. Thresholds determine:
- Error tradeoffs
- System load
- User experience
Strong candidates say:
“Metric choice is only half the decision. Threshold selection defines behavior.”
This distinction separates senior candidates from junior ones.
Distribution Shift: The Silent Killer
Accuracy, precision, and recall all assume the evaluation distribution matches reality.
Interviewers expect you to acknowledge that:
- Base rates drift
- User behavior changes
- Adversaries adapt
A metric that looked correct at launch may fail six months later.
How Strong Candidates Frame Metric Choice
Strong candidates typically structure answers like this:
- Define the business or system goal
- Explain error costs
- Choose a metric as a proxy
- Acknowledge what the metric hides
- Discuss thresholds and monitoring
Weak candidates stop at step 3.
Section 1 Summary: What Interviewers Are Really Testing
Interviewers are not testing whether you know definitions of accuracy, precision, and recall. They are testing whether you understand that:
- Metrics encode priorities
- Priorities create incentives
- Incentives shape behavior
- Behavior determines success or failure
Candidates who treat metrics as absolute truths are risky. Candidates who treat them as imperfect tools are trusted.
Section 2: ROC Curves, PR Curves, and Threshold Selection in Interviews
ROC curves and precision–recall (PR) curves are among the most common, and most misunderstood, topics in ML interviews. Interviewers do not ask about them to test whether you can draw curves. They ask because these metrics look sophisticated while hiding dangerous assumptions.
Candidates who treat ROC-AUC or PR-AUC as “better metrics” often fail follow-ups. Candidates who explain when each curve misleads tend to pass.
ROC Curves: What They Measure and What They Hide
What ROC curves show
ROC curves plot the true positive rate against the false positive rate across all thresholds.
Candidates often say:
“ROC-AUC is threshold-independent, so it’s a good metric.”
That statement is true, and incomplete.
What ROC-AUC hides
- Class imbalance
- Base-rate sensitivity
- Operational cost
- Real-world thresholds
A model can have a strong ROC-AUC while producing an unusable precision at deployment thresholds.
The Interview Trap With ROC-AUC
Interviewers frequently ask:
“Would you use ROC-AUC for a fraud detection problem?”
Strong candidates respond:
“ROC-AUC is useful for ranking, but it can be misleading when positives are rare. I’d want to look at precision–recall as well.”
This shows metric skepticism, not metric ignorance.
Why ROC-AUC Looks Better Than It Is
ROC-AUC gives equal weight to false positives and false negatives in rate space. In highly imbalanced problems, small false positive rates can still produce overwhelming absolute numbers of false alerts.
Interviewers want you to understand that:
“A low false positive rate is not the same as few false positives.”
That single insight often distinguishes strong candidates.
Precision–Recall Curves: When They Matter More
What PR curves show
PR curves plot precision vs. recall across thresholds, directly reflecting class imbalance.
PR curves answer a more practical question:
“When the model fires, can we trust it?”
This makes PR curves critical for:
- Fraud detection
- Medical diagnosis
- Defect detection
- Abuse and spam systems
The Common PR Curve Misinterpretation
Candidates often say:
“PR curves are better when data is imbalanced.”
Interviewers push:
“Why?”
Strong candidates explain:
“Because PR curves reflect the positive class base rate and show how precision degrades as recall increases.”
This demonstrates distribution awareness, not memorization.
PR-AUC Is Not a Free Win
PR-AUC has its own traps:
- It’s sensitive to base-rate changes
- It’s harder to compare across datasets
- Improvements can be misleading without context
Interviewers respect candidates who say:
“PR-AUC is useful, but I’d still inspect specific operating points.”
Threshold Selection: Where Interviews Are Actually Won
Metrics summarize performance. Thresholds define behavior.
Interviewers increasingly care about:
- How candidates choose thresholds
- How thresholds reflect business cost
- How thresholds are monitored over time
A common interviewer prompt:
“How would you choose a threshold?”
Weak answer:
“I’d pick the one that maximizes F1.”
Strong answer:
“I’d choose a threshold based on error cost, system capacity, and downstream impact, not just a metric optimum.”
Why F1 Is Often a Bad Default
F1 balances precision and recall, but assumes they matter equally.
Interviewers expect you to know:
- Many systems care asymmetrically
- Optimizing F1 can harm business outcomes
- F1 ignores true negatives entirely
A senior-level answer includes:
“F1 is a useful diagnostic, but rarely the deployment objective.”
Operating Points Matter More Than Curves
Interviewers listen closely when candidates talk about:
- Specific recall targets
- Alert volume constraints
- SLA-driven thresholds
- Human review capacity
Strong candidates reason about capacity and cost, not curves.
How Strong Candidates Talk About ROC/PR in Interviews
They:
- State what the curve measures
- Explain what it hides
- Tie it to class imbalance
- Discuss threshold choice explicitly
- Acknowledge drift and monitoring
Weak candidates stop at step 1.
Distribution Shift Breaks Both Curves
ROC and PR curves assume stable distributions.
Interviewers expect you to mention:
- Base-rate drift
- Adversarial behavior
- Monitoring for metric decay
This theme connects strongly to real-world evaluation issues and is discussed more deeply in Common Pitfalls in ML Model Evaluation and How to Avoid Them, where offline–online metric mismatch is a recurring failure mode.
Section 2 Summary: What Interviewers Want to Hear
Interviewers are not testing curve literacy. They are testing whether you understand that:
- Metrics are summaries
- Thresholds define action
- Base rates change
- Costs dominate elegance
Candidates who frame ROC and PR curves as diagnostic tools, not deployment decisions, consistently score higher.
Section 3: Bias–Variance Tradeoff in Interviews (Beyond the Textbook)
Bias–variance questions appear in almost every ML interview, but not because interviewers care about the curve you memorized in school. They ask because bias–variance reasoning exposes how you diagnose failure, allocate effort, and make tradeoffs under constraints.
Candidates who treat bias–variance as a static diagram often fail follow-ups. Candidates who treat it as a debugging lens usually pass.
What Interviewers Assume You Already Know
Interviewers assume you know the basics:
- High bias → underfitting
- High variance → overfitting
- Increasing model complexity reduces bias and increases variance
Repeating this earns no points.
What interviewers want to know is whether you can apply bias–variance thinking to real systems, where data is messy, metrics drift, and multiple failure modes coexist.
How Bias–Variance Appears in Real Interviews
Interviewers rarely ask:
“Explain the bias–variance tradeoff.”
They ask questions like:
- “Training error is low, validation error is high, what now?”
- “Both training and validation performance are poor, what does that tell you?”
- “Performance improved offline but worsened online, why?”
Each question tests whether you can map symptoms to causes, not whether you know definitions.
Bias Is Not Just About Model Simplicity
A common candidate mistake is equating bias exclusively with:
- Linear models
- Simple architectures
Interviewers expect you to recognize other sources of bias:
- Poor or incomplete features
- Label noise or weak labels
- Incorrect objective or metric
- Data that fails to represent reality
A strong interview answer sounds like:
“High bias might come from model limitations, but also from feature gaps or a misaligned objective.”
This signals data-centric thinking, not algorithm fixation.
Variance Is Not Just About Overfitting
Similarly, variance is not just “the model is too complex.”
Interviewers listen for awareness of:
- Small or skewed datasets
- Data leakage
- Distribution mismatch between train and validation
- Over-tuning on validation sets
A senior-level response:
“High variance can be architectural, but it’s often driven by limited data or leakage inflating training performance.”
This shows that you think in systems, not silos.
The Bias–Variance Trap Interviewers Set
A classic interviewer trap:
“Your model underfits. What do you do?”
Weak answer:
“I’d use a more complex model.”
Strong answer:
“I’d first verify whether the features or labels are limiting signal before increasing complexity.”
Interviewers reward restraint. Jumping to complexity too quickly signals poor prioritization.
When Bias and Variance Coexist
Real systems often exhibit both:
- High bias in some segments
- High variance in others
Interviewers increasingly test this nuance.
Strong candidates say:
“Aggregate metrics may hide segment-level bias–variance issues. I’d break down errors by cohort.”
This reflects real-world debugging practice.
Bias–Variance and Data Quantity: Not Always Intuitive
Another misconception:
“More data always reduces variance.”
Interviewers expect you to recognize exceptions:
- More noisy data can increase effective variance
- More biased data can worsen bias
- More data from the wrong distribution can degrade performance
A strong framing:
“More data helps only if it improves signal quality and coverage.”
Bias–Variance in the Presence of Distribution Shift
Bias–variance thinking must extend beyond static datasets.
Interviewers listen for awareness that:
- Models with low variance offline may behave unpredictably online
- Distribution shift can introduce new bias
- Overfitting to historical data can amplify future error
This connects directly to interview expectations discussed in Understanding the Bias–Variance Tradeoff in Machine Learning, where production failures are often misdiagnosed as “model issues” instead of data issues.
How Strong Candidates Use Bias–Variance as a Diagnostic Tool
Strong candidates structure answers like this:
- Identify the symptom (training vs. validation vs. online gap)
- Hypothesize bias vs. variance causes
- Propose low-risk interventions first
- Escalate complexity only if needed
For example:
“If validation performance is poor, I’d inspect feature quality and label noise before increasing model capacity.”
This sequence signals maturity and efficiency.
Bias–Variance and Business Constraints
Interviewers also care about whether you can operate under constraints:
- Latency
- Cost
- Interpretability
- Deployment complexity
A strong candidate says:
“Even if variance is high, we may accept it if constraints limit model complexity, and instead mitigate risk with monitoring.”
This demonstrates engineering judgment, not academic optimization.
Section 3 Summary: What Interviewers Are Really Testing
Bias–variance questions are not about curves. They are about whether you can:
- Diagnose failure correctly
- Avoid premature complexity
- Think data-first
- Adapt under constraints
- Communicate uncertainty
Candidates who treat bias–variance as a living diagnostic framework consistently outperform those who treat it as a definition.
Section 4: Calibration, Log Loss, and Probabilistic Evaluation in Interviews
As ML systems move from prediction to decision-making, interviewers have shifted their expectations accordingly. It is no longer enough for a model to rank examples correctly. Interviewers want to know whether your model’s probabilities can be trusted.
That is why calibration, log loss, and probabilistic evaluation appear more frequently in ML interviews, especially for senior and applied roles.
Candidates who treat probability outputs casually often fail these questions. Candidates who understand what probabilities mean operationally stand out immediately.
Why Interviewers Care About Probabilities (Not Just Rankings)
Metrics like accuracy, ROC-AUC, and PR-AUC focus on ordering. Many real systems, however, require:
- Thresholding based on confidence
- Risk scoring
- Expected cost optimization
- Human-in-the-loop decisions
In these systems, a score of 0.9 must actually mean “~90% likelihood.” If it doesn’t, downstream decisions become unreliable, even if ranking metrics look strong.
Interviewers use probabilistic evaluation questions to test whether you understand this distinction.
Calibration: What It Really Means
Most candidates define calibration as:
“Predicted probabilities matching observed frequencies.”
That definition is correct, but incomplete.
Interviewers expect you to understand that:
- Calibration measures confidence correctness
- A well-calibrated model is not necessarily accurate
- An accurate model is not necessarily calibrated
A strong interview explanation:
“Calibration tells us whether we can trust probability outputs for decision-making, not whether predictions are correct.”
This framing shifts the conversation from math to risk management.
Why Models Are Often Poorly Calibrated
Interviewers frequently ask:
“Why are modern models often miscalibrated?”
Strong candidates mention:
- Overparameterization
- Aggressive optimization (e.g., cross-entropy minimization)
- Class imbalance
- Distribution shift
- Training objectives misaligned with deployment needs
A particularly strong answer:
“Many models optimize for discrimination, not probability quality, so calibration degrades unless explicitly addressed.”
This signals applied understanding.
Calibration Failure Modes Interviewers Look For
Interviewers listen for awareness of:
- Overconfident predictions
- Underconfident predictions
- Calibration drift post-deployment
- Segment-level miscalibration
Strong candidates mention that calibration can vary across cohorts, even if global calibration looks acceptable.
Log Loss (Cross-Entropy): Why It Matters
Log loss is often introduced as:
“A loss function for probabilistic classifiers.”
Interviewers want more.
They expect you to understand that:
- Log loss penalizes confident wrong predictions heavily
- It encourages probability correctness, not just correctness
- It aligns better with probabilistic decision-making
A strong interview framing:
“Log loss discourages overconfidence, which is critical when probabilities drive downstream actions.”
The Log Loss Interview Trap
Interviewers often ask:
“Why not just optimize accuracy?”
Strong candidates respond:
“Accuracy ignores confidence. A model that’s slightly wrong but extremely confident is more dangerous than one that’s uncertain.”
This answer demonstrates risk-aware reasoning, which is heavily rewarded.
Log Loss vs. AUC: A Subtle but Important Distinction
Interviewers often probe this comparison.
Strong candidates explain:
- AUC evaluates ranking quality
- Log loss evaluates probability quality
- A model can improve AUC while worsening log loss
A senior-level insight:
“Optimizing AUC alone can encourage sharper rankings at the cost of probability reliability.”
This nuance separates experienced candidates from metric-driven ones.
Calibration Techniques Interviewers Expect You to Know
You don’t need to implement them, but you should understand why they exist:
- Platt scaling
- Isotonic regression
- Temperature scaling
Strong candidates emphasize:
“Calibration is often a post-training step and must be validated on held-out data.”
They also mention that calibration itself can overfit if done incorrectly.
Probabilistic Evaluation Beyond a Single Number
Interviewers are impressed when candidates mention:
- Reliability diagrams
- Calibration curves
- Expected calibration error (ECE)
But they are more impressed when candidates say:
“I’d inspect calibration visually, not just trust a summary metric.”
This signals skepticism toward single-number evaluation.
Calibration Under Distribution Shift
A critical interview insight:
“Calibration degrades faster than ranking under distribution shift.”
Interviewers expect you to mention:
- Periodic recalibration
- Monitoring confidence drift
- Segment-level calibration checks
How Strong Candidates Talk About Probabilistic Evaluation
They:
- Explain why probabilities matter
- Describe what calibration measures
- Connect log loss to overconfidence risk
- Acknowledge tradeoffs with ranking metrics
- Discuss monitoring and drift
Weak candidates stop at definitions.
Section 4 Summary: What Interviewers Are Really Testing
When interviewers ask about calibration and log loss, they are not testing metric trivia. They are asking:
- Can you trust your model’s confidence?
- Do you understand risk beyond accuracy?
- Will your decisions degrade safely under uncertainty?
Candidates who treat probabilities as first-class outputs, not side effects, are consistently evaluated more favorably.
Conclusion
Model evaluation is where ML interviews are most often decided, and least often understood.
By the time interviewers ask evaluation questions, they already assume you can train a model and compute metrics. What they are testing is something subtler and far more important: whether you can be trusted to decide when a model is good enough, when it is misleading, and what to fix first when it fails.
Throughout this blog, a consistent pattern emerges. Candidates who struggle with evaluation tend to:
- Treat metrics as objective truths rather than proxies
- Optimize single numbers without considering incentives
- Ignore thresholds, costs, and capacity constraints
- Assume offline validation generalizes cleanly to production
- Escalate model complexity instead of diagnosing error sources
Candidates who succeed do the opposite. They treat evaluation as a decision-making discipline, not a reporting task. They:
- Explain what each metric measures and what it hides
- Tie metric choice to business or system cost
- Discuss thresholds explicitly
- Expect distribution shift and monitor for it
- Use error analysis to guide incremental, low-risk fixes
Interviewers are not looking for perfect metrics. They are looking for evaluation judgment.
One of the most important insights to internalize is this:
A model that looks good on the wrong metric is worse than a weaker model evaluated honestly.
This is why interviewers push so hard on accuracy vs. precision/recall, ROC vs. PR curves, bias–variance diagnostics, calibration, and error analysis. Each of these topics exposes how you reason under uncertainty, and whether you understand the consequences of your choices.
Another recurring theme is restraint. Senior candidates are not those who know the most metrics, but those who know when not to trust them. Saying “this metric alone isn’t sufficient” is often a stronger signal than presenting a higher score.
These expectations align with real-world ML failures, where evaluation, not modeling, accounts for the majority of costly mistakes. Similar themes appear in The Complete ML Interview Prep Checklist (2026), where evaluation rigor consistently ranks as a top differentiator between offers and rejections.
If you approach evaluation questions as opportunities to show caution, prioritization, and ownership, rather than mathematical correctness, you will consistently outperform candidates who treat them as trivia.
Ultimately, strong evaluation answers make interviewers feel safe. And in ML hiring, safety beats sophistication.
Frequently Asked Questions (FAQs)
1. Why do interviewers focus so much on model evaluation?
Because evaluation errors cause silent failures in production. Interviewers use evaluation questions to assess judgment and risk awareness.
2. Is accuracy ever a good metric to use in interviews?
Yes, but only when classes are balanced and error costs are symmetric. You must explain why those conditions hold.
3. What’s the most common evaluation mistake candidates make?
Choosing a metric without explaining what behavior it incentivizes or what it hides.
4. Do I need to memorize formulas for ROC, PR, or log loss?
No. Interviewers care more about intuition, tradeoffs, and failure modes than formulas.
5. Why is ROC-AUC often misleading in real systems?
Because it ignores class imbalance and deployment thresholds, which dominate real-world behavior.
6. When should I prefer PR curves over ROC curves?
When the positive class is rare and precision matters operationally, such as fraud or abuse detection.
7. How should I talk about threshold selection in interviews?
Frame thresholds around cost, capacity, and downstream impact, not metric optimization alone.
8. Is F1-score a good default metric?
Rarely. It assumes precision and recall are equally important, which is often false in practice.
9. How deep should I go into bias–variance in interviews?
Focus on diagnosis and mitigation under constraints, not textbook curves.
10. What do interviewers want to hear about calibration?
That you understand probability quality, overconfidence risk, and monitoring under distribution shift.
11. Can a model have good AUC but poor log loss?
Yes. Ranking quality and probability quality are different, and interviewers expect you to know that.
12. How do interviewers expect me to approach error analysis?
By segmenting errors, prioritizing by cost, and proposing targeted fixes, not by retraining blindly.
13. Should I always suggest changing the model when performance is poor?
No. Interviewers prefer candidates who investigate data, labels, thresholds, and features first.
14. How do I handle evaluation questions if I lack production experience?
Reason carefully, state assumptions, and focus on tradeoffs and failure modes rather than anecdotes.
15. How do I know if my evaluation answers are strong enough?
If your answers consistently explain why a metric was chosen, what it hides, and how you’d monitor it, you’re meeting the bar.