Tarea 5.3 - Implementation Assignment: Prompting Strategies for an Open-Source LLM
Build a small, reproducible experiment that compares Zero-shot (ZSL), 1-shot, 3-shot, and 5-shot prompting for one NLP task using one open-source instruction-tuned LLM. Save predictions and evaluate them with task-appropriate metrics.
Choose 1 task and dataset (pick one option)
Option A — Sentiment Classification (recommended)
-
Task: classify a text as positive / negative / neutral
-
Dataset: a small subset (e.g., 100–500 examples) from a public dataset such as TweetEval Sentiment or SST-2
-
Metrics: Accuracy, Macro-F1, Confusion Matrix
Option B — Natural Language Inference (NLI)
-
Task: classify a pair of sentences as entailment / contradiction / neutral
-
Dataset: small subset of MNLI or SNLI
-
Metrics: Accuracy, Macro-F1
Option C — Extractive Information Extraction
-
Task: extract structured fields (e.g., date, location, person) from short texts
-
Dataset: create a small dataset yourself (50–150 examples) or use a public IE dataset subset
-
Metrics: Exact Match for each field, overall EM, and partial-span F1 (optional)
Option D — Summarization
-
Task: generate a short summary (1–2 sentences)
-
Dataset: small subset of CNN/DailyMail, XSum, or any short-news dataset
-
Metrics: ROUGE-1/2/L (and optionally BERTScore)
Pick one option and clearly document which dataset split you used.
Choose 1 open-source LLM (pick one)
Use any instruction-tuned model available on Hugging Face, for example:
-
Qwen/Qwen2-7B-Instruct -
mistralai/Mistral-Nemo-Instruct-2407 -
google/gemma-2-9b-it(may require HF access token)
Use only one model for the whole experiment.
Required Experiments
You must run the same dataset under four prompting conditions:
-
Zero-shot (ZSL): no examples in the prompt
-
1-shot: include 1 labeled example
-
3-shot: include 3 labeled examples
-
5-shot: include 5 labeled examples
Important:
-
Use the same evaluation set for all conditions.
-
Keep decoding deterministic for fairness (do_sample=False).
-
Your prompt must force a strict output format (e.g., label only, or JSON only).
Prompt Design Requirements
Your prompt must include:
-
A short task instruction
-
The allowed labels / output schema
-
Output constraint: “Return only …”
-
For few-shot: examples formatted consistently (Input → Output)
You must explain your prompt design choices in your report.
Implementation Requirements
-
Load the dataset (train/dev/test or train/test).
-
Select a small evaluation set (e.g., 100–300 items).
-
For each of the 4 settings (ZSL/1/3/5-shot):
-
Construct prompts
-
Generate predictions
-
Save outputs
-
Save predictions to disk in this format:
-
Classification/NLI: CSV with columns:
id, text (or premise+hypothesis), gold, pred, shots -
Summarization: JSONL with
id, input, gold_summary, pred_summary, shots -
Extraction: JSONL with
id, input, gold_json, pred_json, shots
Evaluation Requirements
Compute task-appropriate metrics:
-
Classification/NLI: Accuracy + Macro-F1 + Confusion Matrix
-
Summarization: ROUGE-1/2/L (at minimum)
-
Extraction: Exact Match (per-field and overall); optional span-F1