Question 10 of 10Pro Only
What evaluation metrics are appropriate for different NLP tasks? How do you evaluate generation quality, and what are the limitations of automated metrics like BLEU and ROUGE?
Sample answer preview
Evaluation metrics for NLP vary significantly across task types. Choosing appropriate metrics is crucial because they determine what model behaviors are rewarded during development and selection. Classification tasks use standard metrics.
precisionrecallF1perplexityBLEUROUGE