Tarea 6: Fine-tuning a Transformer for an NLP Task
Build a small, reproducible experiment that fine-tunes one Transformer model for one NLP task using a public dataset. Train the model and evaluate performance using task-appropriate metrics. All explanations and analysis must be included directly in the notebook using Markdown cells.
1. Choose 1 task and dataset (pick one option)
Pick one option and clearly document which dataset split you used.
Option A — Sentiment Classification
-
Task: classify a text as positive / negative / neutral
-
Dataset: a small subset (e.g., 500–2,000 examples) from a public dataset such as TweetEval Sentiment or SST-2
-
Metrics: Accuracy, Macro-F1, Confusion Matrix
Option B — Natural Language Inference (NLI)
-
Task: classify a pair of sentences as entailment / contradiction / neutral
-
Dataset: a small subset of MNLI or SNLI
-
Metrics: Accuracy, Macro-F1
Option C — Summarization
-
Task: generate a short summary (1–2 sentences)
-
Dataset: a small subset of CNN/DailyMail, XSum, or another short-news dataset
-
Metrics: ROUGE-1, ROUGE-2, ROUGE-L (and optionally BERTScore)
2. Choose 1 Transformer model
Use one pre-trained Transformer model from Hugging Face for the entire experiment. Examples:
-
distilbert-base-uncased
-
bert-base-uncased
-
roberta-base
-
google/t5-small (for summarization tasks)
-
facebook/bart-base (for summarization)
Use only one model for the whole experiment.
3. Required Experiments
You must fine-tune the same model on the same task and dataset and evaluate it under different training configurations. Run at least four experiments by varying one of the following factors:
1. Training set size
Example: 10% of the training data, 25% of the training data, 50% of the training data, 100% of the training data
2. Number of epochs
Example: 1 epoch, 2 epochs, 3 epochs, 5 epochs
3. Learning rate
Example: 1e-5, 2e-5, 3e-5, 5e-5
4. A combination of two controlled settings
Example: compare two dataset sizes and two learning rates
Important
-
Use the same validation/test set for all experiments.
-
Keep the evaluation protocol identical across runs.
-
Clearly explain which variable you changed and why.
4. Model and Training Requirements
Your implementation must include:
-
Loading the dataset (train/dev/test or train/test)
-
Preprocessing the data for the selected model
-
Tokenizing the inputs using the corresponding tokenizer
-
Fine-tuning the model on the training split
-
Evaluating the model on a held-out evaluation set
You must explain:
-
why you selected that model
-
how you prepared the inputs
-
which hyperparameters you used
-
what training settings changed across experiments
5. Evaluation Requirements
Compute task-appropriate metrics.
Classification
-
Accuracy
-
Macro-F1
-
Confusion Matrix (for classification tasks)
Summarization
-
ROUGE
-
BLEU
- BertScore
Present the results in a comparative table across the four experiments.
6. Submission
Submit the Jupyter Notebook (.ipynb) through Moodle, containing:
-
-
the code
-
explanations in Markdown cells
-
results and analysis
-
Learning Goal
The goal of this assignment is not to achieve state-of-the-art performance, but to understand:
-
how Transformer fine-tuning works
-
how training choices affect model performance
-
how to evaluate NLP models in a controlled and reproducible way