Tarea 7 - Prompting Strategies with Open-Source LLMs
Tarea: Prompting Strategies with Open-Source LLMs
Objective
Design, implement, and evaluate different prompting strategies for an NLP task e.g., Sentiment Analysis, Question Answering, etc., using open-source Large Language Models (LLMs).
You will explore how prompting techniques such as:
-
Direct prompting
-
Chain-of-Thought (CoT)
-
Self-consistency
-
Tree-of-Thought (ToT)
affect model performance.
Task Description
You must:
-
Select a public dataset (small portion)
-
Choose at least one open-source LLM
-
Design multiple prompting strategies
-
Run experiments and evaluate performance
-
Analyze results and explain behavior
Dataset
You may choose any dataset, for example:
-
IMDb (binary sentiment)
-
SST-2 (Stanford Sentiment Treebank)
-
Twitter sentiment datasets
-
Amazon reviews
-
OR create your own small dataset (minimum 5
Models
You can use:
-
Qwen (e.g., Qwen2.5-3B / 7B)
-
Mistral (e.g., Mistral-7B-Instruct)
-
Gemma
-
LLaMA (if accessible)
-
Any HuggingFace model
🧪 Required Experiments
You must implement at least 4 of the following:
1. 🔹 Direct Prompting (Baseline)
Example (could be different for other tasks):
Classify the sentiment:
Text: {text}
Answer: Positive / Negative / Neutral
2. 🔹 Role Prompting
Example:
You are a sentiment analysis expert...
3. 🔹 Zero-shot Chain-of-Thought (CoT)
Example:
Think step by step before answering.
4. 🔹 Few-shot Prompting
Provide 2–5 examples in the prompt.
5. 🔹 Few-shot CoT
Combine:
-
examples
-
reasoning steps
6. 🔹 Self-Consistency
-
Sample multiple outputs (e.g., 5–10)
-
Use majority voting
7. 🔹 Tree-of-Thought (ToT)
-
Generate multiple reasoning paths
-
Select best candidate
⚠️ Simple implementations are fine
⚙️ Evaluation
You must report:
-
Accuracy
-
Macro F1-score
-
Confusion Matrix
- Or any relevant metric
📈 Analysis Questions (IMPORTANT)
Answer these in your report:
-
Which prompting strategy worked best? Why?
-
Did CoT improve performance? When and why?
-
Did self-consistency help? Or hurt?
-
When did prompting fail?
-
How sensitive were results to prompt design?
-
Did model size affect results?
-
Is reasoning actually needed for sentiment analysis?
📄 Deliverables
1. Code
-
Colab / Notebook
-
Well commented
-
Reproducible
2. Report
Include:
-
Dataset description
-
Model(s) used
-
Prompt designs
-
Results table
-
Error analysis
-
Answers to analysis questions
3. Results Table (example)
| Method | Accuracy | Macro F1 |
|---|---|---|
| Direct | 0.72 | 0.70 |
| CoT | 0.74 | 0.72 |
| Few-shot | 0.81 | 0.80 |
| Self-consistency | 0.83 | 0.82 |
Bonus (optional)
-
Compare 2 different LLMs
-
Try multilingual sentiment
-
Try sarcasm / hard examples
-
Perform error clustering