Build a small, reproducible experiment that fine-tunes one Transformer model for one NLP task using a public dataset. Train the model and evaluate performance using task-appropriate metrics. All explanations and analysis must be included directly in the notebook using Markdown cells.

1. Choose 1 task and dataset (pick one option)

Pick one option and clearly document which dataset split you used.

Option A — Sentiment Classification

Task: classify a text as positive / negative / neutral
Dataset: a small subset (e.g., 500–2,000 examples) from a public dataset such as TweetEval Sentiment or SST-2
Metrics: Accuracy, Macro-F1, Confusion Matrix

Option B — Natural Language Inference (NLI)

Task: classify a pair of sentences as entailment / contradiction / neutral
Dataset: a small subset of MNLI or SNLI
Metrics: Accuracy, Macro-F1

Option C — Summarization

Task: generate a short summary (1–2 sentences)
Dataset: a small subset of CNN/DailyMail, XSum, or another short-news dataset
Metrics: ROUGE-1, ROUGE-2, ROUGE-L (and optionally BERTScore)

2. Choose 1 Transformer model

Use one pre-trained Transformer model from Hugging Face for the entire experiment. Examples:

distilbert-base-uncased
bert-base-uncased
roberta-base
google/t5-small (for summarization tasks)
facebook/bart-base (for summarization)

Use only one model for the whole experiment.

3. Required Experiments

You must fine-tune the same model on the same task and dataset and evaluate it under different training configurations. Run at least four experiments by varying one of the following factors:

1. Training set size

Example: 10% of the training data, 25% of the training data, 50% of the training data, 100% of the training data

2. Number of epochs

Example: 1 epoch, 2 epochs, 3 epochs, 5 epochs

3. Learning rate

Example: 1e-5, 2e-5, 3e-5, 5e-5

4. A combination of two controlled settings

Example: compare two dataset sizes and two learning rates

Important

Use the same validation/test set for all experiments.
Keep the evaluation protocol identical across runs.
Clearly explain which variable you changed and why.

4. Model and Training Requirements

Your implementation must include:

Loading the dataset (train/dev/test or train/test)
Preprocessing the data for the selected model
Tokenizing the inputs using the corresponding tokenizer
Fine-tuning the model on the training split
Evaluating the model on a held-out evaluation set

You must explain:

why you selected that model
how you prepared the inputs
which hyperparameters you used
what training settings changed across experiments

5. Evaluation Requirements

Compute task-appropriate metrics.

Classification

Accuracy
Macro-F1
Confusion Matrix (for classification tasks)

Summarization

ROUGE
BLEU
BertScore

Present the results in a comparative table across the four experiments.

6. Submission

Submit the Jupyter Notebook (.ipynb) through Moodle, containing:

- the code
- explanations in Markdown cells
- results and analysis

Learning Goal

The goal of this assignment is not to achieve state-of-the-art performance, but to understand:

how Transformer fine-tuning works
how training choices affect model performance
how to evaluate NLP models in a controlled and reproducible way