PLN-AVZ-LCD: Tarea 5.3 - Implementation Assignment: Prompting Strategies for an Open-Source LLM

Abrió: miércoles, 5 de marzo de 2026, 00:00

Cierre: jueves, 12 de marzo de 2026, 14:00

Build a small, reproducible experiment that compares Zero-shot (ZSL), 1-shot, 3-shot, and 5-shot prompting for one NLP task using one open-source instruction-tuned LLM. Save predictions and evaluate them with task-appropriate metrics.

Choose 1 task and dataset (pick one option)

Option A — Sentiment Classification (recommended)

Task: classify a text as positive / negative / neutral
Dataset: a small subset (e.g., 100–500 examples) from a public dataset such as TweetEval Sentiment or SST-2
Metrics: Accuracy, Macro-F1, Confusion Matrix

Option B — Natural Language Inference (NLI)

Task: classify a pair of sentences as entailment / contradiction / neutral
Dataset: small subset of MNLI or SNLI
Metrics: Accuracy, Macro-F1

Option C — Extractive Information Extraction

Task: extract structured fields (e.g., date, location, person) from short texts
Dataset: create a small dataset yourself (50–150 examples) or use a public IE dataset subset
Metrics: Exact Match for each field, overall EM, and partial-span F1 (optional)

Option D — Summarization

Task: generate a short summary (1–2 sentences)
Dataset: small subset of CNN/DailyMail, XSum, or any short-news dataset
Metrics: ROUGE-1/2/L (and optionally BERTScore)

Pick one option and clearly document which dataset split you used.

Choose 1 open-source LLM (pick one)

Use any instruction-tuned model available on Hugging Face, for example:

Qwen/Qwen2-7B-Instruct
mistralai/Mistral-Nemo-Instruct-2407
google/gemma-2-9b-it (may require HF access token)

Use only one model for the whole experiment.

Required Experiments

You must run the same dataset under four prompting conditions:

Zero-shot (ZSL): no examples in the prompt
1-shot: include 1 labeled example
3-shot: include 3 labeled examples
5-shot: include 5 labeled examples

Important:

Use the same evaluation set for all conditions.
Keep decoding deterministic for fairness (do_sample=False).
Your prompt must force a strict output format (e.g., label only, or JSON only).

Prompt Design Requirements

Your prompt must include:

A short task instruction
The allowed labels / output schema
Output constraint: “Return only …”
For few-shot: examples formatted consistently (Input → Output)

You must explain your prompt design choices in your report.

Implementation Requirements

Load the dataset (train/dev/test or train/test).
Select a small evaluation set (e.g., 100–300 items).
For each of the 4 settings (ZSL/1/3/5-shot):
- Construct prompts
- Generate predictions
- Save outputs

Save predictions to disk in this format:

Classification/NLI: CSV with columns: id, text (or premise+hypothesis), gold, pred, shots
Summarization: JSONL with id, input, gold_summary, pred_summary, shots
Extraction: JSONL with id, input, gold_json, pred_json, shots

Evaluation Requirements

Compute task-appropriate metrics:

Classification/NLI: Accuracy + Macro-F1 + Confusion Matrix
Summarization: ROUGE-1/2/L (at minimum)
Extraction: Exact Match (per-field and overall); optional span-F1