PLN-AVZ-LCD: Tarea 7 - Prompting Strategies with Open-Source LLMs

Abrió: miércoles, 26 de marzo de 2026, 00:00

Cierre: martes, 14 de abril de 2026, 12:00

Tarea: Prompting Strategies with Open-Source LLMs

Objective

Design, implement, and evaluate different prompting strategies for an NLP task e.g., Sentiment Analysis, Question Answering, etc., using open-source Large Language Models (LLMs).

You will explore how prompting techniques such as:

Direct prompting
Chain-of-Thought (CoT)
Self-consistency
Tree-of-Thought (ToT)

affect model performance.

Task Description

You must:

Select a public dataset (small portion)
Choose at least one open-source LLM
Design multiple prompting strategies
Run experiments and evaluate performance
Analyze results and explain behavior

Dataset

You may choose any dataset, for example:

IMDb (binary sentiment)
SST-2 (Stanford Sentiment Treebank)
Twitter sentiment datasets
Amazon reviews
OR create your own small dataset (minimum 5

Models

You can use:

Qwen (e.g., Qwen2.5-3B / 7B)
Mistral (e.g., Mistral-7B-Instruct)
Gemma
LLaMA (if accessible)
Any HuggingFace model

🧪 Required Experiments

You must implement at least 4 of the following:

1. 🔹 Direct Prompting (Baseline)

Example (could be different for other tasks):

Classify the sentiment:
Text: {text}
Answer: Positive / Negative / Neutral

2. 🔹 Role Prompting

Example:

You are a sentiment analysis expert...

3. 🔹 Zero-shot Chain-of-Thought (CoT)

Example:

Think step by step before answering.

4. 🔹 Few-shot Prompting

Provide 2–5 examples in the prompt.

5. 🔹 Few-shot CoT

Combine:

examples
reasoning steps

6. 🔹 Self-Consistency

Sample multiple outputs (e.g., 5–10)
Use majority voting

7. 🔹 Tree-of-Thought (ToT)

Generate multiple reasoning paths
Select best candidate

⚠️ Simple implementations are fine

⚙️ Evaluation

You must report:

Accuracy
Macro F1-score
Confusion Matrix
Or any relevant metric

📈 Analysis Questions (IMPORTANT)

Answer these in your report:

Which prompting strategy worked best? Why?
Did CoT improve performance? When and why?
Did self-consistency help? Or hurt?
When did prompting fail?
How sensitive were results to prompt design?
Did model size affect results?
Is reasoning actually needed for sentiment analysis?

📄 Deliverables

1. Code

Colab / Notebook
Well commented
Reproducible

2. Report

Include:

Dataset description
Model(s) used
Prompt designs
Results table
Error analysis
Answers to analysis questions

3. Results Table (example)

Method	Accuracy	Macro F1
Direct	0.72	0.70
CoT	0.74	0.72
Few-shot	0.81	0.80
Self-consistency	0.83	0.82

Bonus (optional)

Compare 2 different LLMs
Try multilingual sentiment
Try sarcasm / hard examples
Perform error clustering