Iterative Prompt Refinement

What is Iterative Prompt Refinement

Iterative prompt refinement is the systematic practice of improving prompts through repeated cycles of testing, evaluation, and modification. Rather than treating prompt creation as a one-time activity, this skill recognizes that the first version of a prompt is rarely optimal and that deliberate iteration is necessary to discover formulations that consistently produce desired outputs.

The process begins with an initial prompt based on requirements and best practices. This prompt is tested against representative inputs. The outputs are evaluated against objective criteria to identify specific deficiencies: missing constraints, ambiguous instructions, insufficient examples, or structural issues. Each refinement cycle targets identified problems with precise modifications. The new prompt is tested again, and the cycle continues until performance stabilizes at an acceptable level.

Iterative refinement differs from casual tweaking in its systematic approach. Changes are hypothesis-driven rather than random. Each iteration addresses specific identified issues. Results are documented to track which modifications produce measurable improvements. The goal is not just to fix a single output but to discover prompt patterns that generalize across varied inputs.

Why This Skill Matters

Without iterative refinement, prompt engineering remains guesswork. A prompt is constructed based on intuition, tested on a few examples, and deployed if the results seem acceptable. This approach produces fragile prompts that work for some cases but fail unpredictably on others. The underlying issues remain unaddressed: unclear constraints, misinterpreted priorities, or missing context.

The problem compounds with task complexity. Simple tasks may tolerate imperfect prompts. Complex tasks involving multiple requirements, nuanced constraints, or specialized domains require precise instruction. Initial attempts at these prompts rarely capture all necessary dimensions. Without refinement, these gaps manifest as inconsistent outputs, edge case failures, or outputs that technically satisfy the prompt but violate intent.

More fundamentally, iterative refinement is necessary to discover what actually works. Prompt behavior is often counterintuitive. Rephrasing a constraint may have no visible effect, while adding a specific example dramatically changes outputs. Minor structural adjustments—changing the order of instructions, consolidating scattered requirements, adjusting the level of detail—can produce disproportionate improvements. These discoveries only emerge through systematic testing.

The absence of refinement also prevents learning. When a prompt succeeds or fails, the reason is often unclear without controlled variation. Did the prompt work because of the specific wording, the examples provided, or the context included? Without isolating variables through iteration, it is impossible to identify which elements contribute to success. This means each new prompt starts from scratch rather than building on accumulated insight.

Core Concepts

Hypothesis-Driven Iteration

Hypothesis-driven iteration means making specific predictions about what change will improve outcomes and testing those predictions. Instead of making arbitrary changes, identify a specific problem (e.g., "the model is ignoring negative constraints") and propose a targeted solution (e.g., "move negative constraints to the end of the prompt").

This approach separates effective modifications from noise. When each iteration tests a clear hypothesis, results provide actionable data about what works. When changes are arbitrary, positive results are coincidental and difficult to replicate.

Representative Sampling

Testing requires representative inputs that span the expected variation of the task. If the task involves handling multiple document types, testing only on one type provides false confidence. If the task includes edge cases (empty inputs, malformed data, unusual values), testing only on typical cases misses critical failure modes.

Representative sampling uncovers prompt weaknesses across the input space. It reveals whether a prompt that works for standard cases fails on edge cases, or whether a formulation optimized for one input type degrades performance on others.

Baseline Comparison

Each iteration should be compared against a baseline to verify that changes represent genuine improvement rather than random variation. The baseline may be the previous prompt version, a simple heuristic, or a manual process. Without baseline comparison, it is impossible to know whether a new prompt is actually better or whether the test cases happened to be easier.

Baseline comparison requires consistent evaluation criteria. If the metric changes between iterations (e.g., switching from accuracy to fluency), comparisons become meaningless. Established criteria allow meaningful assessment of whether refinements are moving in the right direction.

Failure Analysis

When prompts fail, the response should not be immediate revision but systematic analysis. Identify the pattern of failure: does the prompt fail on specific input types, at certain complexity levels, or when particular constraints conflict? Examine the outputs to understand what the model interpreted versus what was intended.

Failure analysis distinguishes between surface issues and structural problems. A prompt that fails on a specific example may need a clarification. A prompt that fails systematically may require restructuring. Without analysis, iterations address symptoms rather than root causes.

Version Control

Iterative refinement produces multiple prompt versions. Without tracking, it becomes impossible to remember which version worked best for which scenario, what changes were tried, and what the results were. Version control documents the evolution of the prompt.

Effective version control includes not just the prompt text but also the rationale for changes, test results, and observed behavior. This history enables reverting to previous versions, identifying which modifications contributed to success, and building a library of proven patterns.

How This Skill Is Used

Iterative prompt refinement begins with requirements. The first step is to articulate what success looks like: what the output should contain, what constraints it must satisfy, and what quality dimensions matter. These requirements guide both prompt construction and evaluation.

An initial prompt is constructed based on requirements and known best practices. This prompt is not expected to be optimal—it is a starting point for refinement. The prompt should incorporate the essential elements: clear task description, constraints, examples if relevant, and output format.

Testing begins with a small but diverse set of inputs. These inputs should include typical cases and obvious edge cases. The outputs are evaluated against requirements to identify gaps. The evaluation produces a prioritized list of issues: critical failures that prevent the task from working, moderate issues that degrade quality, and minor inconsistencies.

Refinement cycles target issues from highest to lowest priority. Each cycle focuses on a specific problem. The prompt is modified with intent, not randomly. If the issue is missing constraints, constraints are added. If the issue is ambiguous phrasing, language is clarified. If the issue is insufficient examples, examples are expanded.

After each modification, the prompt is retested on the same inputs used previously. This allows direct comparison to assess whether the change produced improvement. If the change helped, it is retained. If the change had no effect or made things worse, the prompt reverts and a different approach is tried.

As critical issues are resolved, testing expands to broader input sets. The prompt may work well on the initial test cases but fail on new variations. This reveals that the prompt was overfitted to specific examples rather than capturing general principles. Further refinement addresses these broader patterns.

The process continues until performance stabilizes. When additional iterations produce minimal improvement, or when the prompt meets requirements across representative inputs, refinement concludes. The final prompt is documented with test cases and known limitations.

Common Mistakes

Overfitting to Test Cases

A common mistake is refining a prompt until it works perfectly on a specific set of examples, then assuming it will generalize. The prompt may be memorizing specific test patterns rather than learning the underlying task. When deployed on new inputs, performance collapses.

The corrective approach is to hold out validation data. Refine on a training set, but regularly test on a separate validation set. If performance improves on training data but degrades on validation data, the prompt is overfitting.

Changing Multiple Variables

When a prompt underperforms, the instinct may be to change everything—rephrase instructions, add examples, reorder sections, and adjust constraints all at once. The problem is that when multiple changes are made simultaneously, it becomes impossible to know which change produced which effect.

Effective refinement changes one variable at a time. If multiple issues exist, prioritize and address them sequentially. Isolated changes allow clear attribution of cause and effect.

Insufficient Testing

Testing a prompt on two or three examples provides false confidence. A prompt may work well on simple cases but fail on complex ones. It may handle standard inputs but break on edge cases. Insufficient testing misses these failure modes.

Robust testing requires systematic coverage across the input space. Test cases should vary in complexity, include boundary conditions, and span the different categories of inputs the task will encounter.

Chasing Perfection

Refinement can continue indefinitely. There is always some minor improvement possible, some additional example to add, some phrasing to tweak. The question is whether the marginal benefit of additional refinement justifies the time cost.

The signal that refinement is reaching diminishing returns is when iterations produce minimal measurable improvement. When an hour of work yields a 1% performance gain, it may be time to accept the current prompt and move to deployment.

Ignoring Output Analysis

When outputs are incorrect or unsatisfactory, the reflex may be to immediately modify the prompt without understanding why the model produced that specific output. This treats symptoms without addressing root causes.

Effective refinement begins with analysis. Examine incorrect outputs to understand what the model interpreted. Was a constraint genuinely ambiguous? Did an example inadvertently suggest the wrong pattern? Was critical information buried? Analysis informs targeted modifications rather than random changes.

When to Use This Skill

Ideal Scenarios:

Stringent requirements: When correctness, consistency, or specific formatting is critical
Complex tasks: Multi-constraint problems requiring precise instruction
High-stakes applications: When errors have significant costs
Reusable prompts: Creating prompts for repeated use across many inputs
Novel domains: Applying AI to unfamiliar task types
Performance optimization: Improving efficiency while maintaining quality

Not Ideal For:

Simple one-off tasks: Quick queries where refinement time exceeds task value
Low-stakes experimentation: Creative exploration where variability is acceptable
Rapid prototyping: Early stages where speed matters more than optimization
Trivial prompts: Tasks so simple that first attempts are usually sufficient

Decision Criteria:

Use iterative refinement when:

1. Task has multiple requirements or constraints
2. Consistency across inputs is critical
3. You'll use the prompt repeatedly
4. Initial prompt shows inconsistent results
5. Errors have significant costs

Common Use Cases

Use Case 1: JSON Output Consistency

Context: You need AI to extract structured data from unstructured text as JSON.

Challenge: First prompt produces inconsistent JSON structures—sometimes missing fields, sometimes using wrong data types.

Solution: Systematically test and refine prompt to ensure consistent structure.

Example Prompt:

Version 1 (Initial):
Extract name, price, category from product descriptions as JSON
[Results: inconsistent structure]

Version 2 (Added schema):
Extract as JSON with this exact schema:
{"name": "string", "price": "number", "category": "string"}
[Results: better but still missing fields]

Version 3 (Added negative constraints):
Extract as JSON with schema {"name": "string", "price": "number", "category": "string"}
- Include all three fields always
- Never add extra fields
- Price must be a number (no currency symbols)
[Results: 95% consistency]

Result: Final prompt produces reliable JSON structure across varied inputs.

Use Case 2: Tone Consistency

Context: Generating customer support responses that need consistent professional tone.

Challenge: Outputs vary between casual and formal, with some responses sounding abrupt.

Solution: Iterate with explicit tone examples and constraints.

Example Prompt:

Version 1:
Write a professional response to customer inquiries
[Results: inconsistent tone]

Version 2:
Write responses that are:
- Empathetic but professional
- Solution-focused
- Under 150 words
- No jargon
Example: "I understand your frustration with [issue]. Here's how we can resolve it..."
[Results: More consistent but still variability]

Version 3 (Refined):
Write responses following this pattern:
1. Acknowledge concern (empathy)
2. Explain solution clearly
3. Provide next steps
Constraints:
- Use "we" language
- Avoid apologies without solutions
- Maximum 150 words
3 examples demonstrating tone:
[Examples provided]
[Results: Consistent professional tone]

Result: Customer responses maintain consistent brand voice.

Use Case 3: Code Quality

Context: Generating code that must follow specific architectural patterns.

Challenge: Generated code works but doesn't follow team's conventions.

Solution: Iterate with explicit style constraints and examples.

Example Prompt:

Version 1:
Write a function to [task]
[Results: Works but wrong style]

Version 2:
Write a function following these conventions:
- Use TypeScript strict types
- Error handling with try/catch
- JSDoc comments
- Max function length: 50 lines
[Results: Better but inconsistent]

Version 3:
Write [function description]
Requirements:
- TypeScript with strict types (no 'any')
- Error handling: try/catch with typed errors
- JSDoc with @param, @returns, @throws
- Single responsibility (max 50 lines)
- Naming: camelCase for variables, PascalCase for types

Example of desired style:
[Code example]
[Results: Consistent with team conventions]

Result: Generated code requires minimal refactoring.

Step-by-Step Guide

Step 1: Define Success Criteria

Before creating your first prompt, establish clear, measurable success criteria.

What makes an output correct?

Required elements (fields, sections, data points)
Quality thresholds (accuracy, completeness, formatting)
Constraint satisfaction (length, style, structure)

Example criteria:

JSON output must validate against schema
Response must be under 200 words
Must include all 3 required sections
Tone must match examples

Step 2: Create Initial Prompt

Build your first prompt incorporating requirements and best practices.

Include:

Clear task description
Explicit constraints
Output format specification
Representative examples (2-3)
Context about domain/use case

Don't expect perfection—this is a starting point for refinement.

Step 3: Test on Representative Inputs

Create a diverse test set that includes:

Typical cases: Standard inputs you expect most often
Edge cases: Boundary conditions, empty inputs, unusual values
Complex cases: Multi-part requests, conflicting constraints
Varied formats: Different structures if applicable

Aim for 10-20 test inputs to uncover prompt weaknesses.

Step 4: Evaluate Outputs Systematically

For each test output, assess against your success criteria.

Track:

Which criteria are met/unmet
Patterns in failures (same constraint ignored across inputs?)
Severity of issues (critical blocker vs. minor inconsistency)

Document results in a structured format:

Test Input	Output Quality	Issues	Severity
Input 1	70%	Missing field X	High
Input 2	85%	Length exceeded	Medium
Input 3	95%	Minor inconsistency	Low

Step 5: Prioritize Issues

Rank identified issues by impact and frequency:

Priority 1 (Critical): Issues that cause task failure

Missing required outputs
Violated constraints that break functionality
Fundamental misunderstandings

Priority 2 (Important): Issues that degrade quality

Inconsistent formatting
Occasional omissions
Style violations

Priority 3 (Nice-to-have): Minor improvements

Word count optimization
Polishing phrasing
Edge-case refinements

Address issues in priority order.

Step 6: Formulate Hypotheses

For each prioritized issue, develop a hypothesis about what change will help.

Bad hypothesis: "Make the prompt better" Good hypothesis: "Moving negative constraints to the end of the prompt will reduce violations"

Common hypotheses:

"Adding an explicit example will reduce format errors"
"Rephrasing constraint X in stronger language will improve compliance"
"Consolidating scattered requirements will reduce omissions"
"Adding a 'before/after' example will clarify expectations"

Step 7: Test One Change at a Time

Modify your prompt to test your hypothesis.

Important: Change only one variable per iteration.

If you change multiple things simultaneously:

You won't know which change caused the effect
Positive and negative changes may cancel out
You can't replicate successes

Document each iteration:

What you changed and why
Hypothesis being tested
Test results compared to baseline
Decision (keep/revert/modify further)

Step 8: Compare to Baseline

After each modification, retest on the same inputs used previously.

Assess:

Did the specific issue you targeted improve?
Did any other aspects degrade?
Is improvement consistent across test cases?

Keep the change if:

It improves the targeted issue
No significant regressions elsewhere
Improvement is consistent across multiple test cases

Revert the change if:

No measurable improvement
Makes things worse
Improvement is isolated to specific cases (suggests overfitting)

Step 9: Expand Testing

Once critical issues are resolved, test on broader inputs.

Add:

New edge cases
Different input formats
Varied complexity levels
Inputs from actual use cases

Watch for:

Performance dropping on new inputs (sign of overfitting)
New failure modes emerging
Prompt becoming too specialized

Step 10: Converge and Document

Continue iterations until:

Additional refinements produce less than 5% improvement
Prompt meets requirements across representative inputs
Known limitations are acceptable

Document final prompt:

Prompt text
Success criteria used
Test cases and results
Known limitations/edge cases
Version history (what was tried)

Measuring Success

Quality Checklist

✅ Consistency: Prompt produces similar quality outputs across varied inputs

✅ Coverage: Success rate >90% on representative test set

✅ Constraints: All critical constraints satisfied in >95% of outputs

✅ Generalization: Performance similar on training and validation sets

✅ Robustness: Handles edge cases without catastrophic failures

✅ Efficiency: Generates outputs in acceptable time/without excessive verbosity

✅ Maintainability: Prompt is understandable and modifiable

✅ Documentation: Clear record of iterations, decisions, and test results

Red Flags 🚩

🚩 Overfitting: Works perfectly on test set, fails on new inputs

🚩 Diminishing Returns: Hours of work for less than 2% improvement

🚩 Complexity Creep: Prompt becoming too long/complex to maintain

🚩 Fragility: Small changes cause large performance swings

🚩 Narrow Optimization: Optimized for specific inputs, not general task

🚩 Tunnel Vision: Focusing on minor issues while ignoring bigger problems

🚩 Insufficient Testing: Confidence based on fewer than 10 test cases

🚩 Version Chaos: Lost track of what works and what doesn't

Quick Reference

Basic Prompt Pattern

# Task Description
[Clear, concise statement of what you want]

# Output Format
[Specific structure or format requirements]

# Requirements
1. [Critical requirement 1]
2. [Critical requirement 2]
3. [Critical requirement 3]

# Constraints
- Must: [Positive constraint]
- Must not: [Negative constraint]
- Maximum/Minimum: [Boundaries]

# Examples
Example 1:
Input: [example]
Output: [desired output]

Example 2:
Input: [example]
Output: [desired output]

# Additional Context
[Relevant background, domain info, or clarifications]

Iteration Template

## Iteration N

**Hypothesis**: [What you expect this change to improve]

**Change**: [Specific modification to prompt]

**Test Results**:
- Training set: [XX percent success]
- Validation set: [YY percent success]
- Specific issues addressed: [list]

**Decision**: ✓ Keep / ✗ Revert / ○ Modify further

**Notes**: [Observations, unexpected behaviors, next steps]

Issue	Try This	Why It Works
Missing constraints	Move to end, emphasize with "CRITICAL"	Recency effect, emphasis
Format inconsistency	Add 3+ examples with exact structure	Pattern learning
Length issues	Explicit word/character count	Concrete boundary
Ambiguity	Add negative examples ("NOT this")	Clarifies boundaries
Omissions	Checklist format with checkboxes	Visual completeness cue
Style drift	Tone examples with dos/don'ts	Demonstrates nuance

Pro Tips 💡

Tip 1: Save every prompt version with timestamp in filename: prompt_v1_2025-01-20.md

Tip 2: Use a spreadsheet to track test results across iterations—columns = prompt versions, rows = test cases

Tip 3: Set a stopping rule before starting (e.g., "stop when less than 5% improvement over 3 iterations")

Tip 4: When stuck, try explaining your prompt to a colleague—articulation reveals assumptions

Tip 5: Keep a "pattern library" of refinements that worked for future reuse

Tip 6: Test on adversarial examples—inputs designed to break your prompt

Tip 7: If performance plateaus, try a radically different approach rather than incremental tweaks

Tip 8: Document not just what worked, but what DIDN'T work and why

FAQ

Q1: How many iterations should I do?

A: Stop when additional iterations produce diminishing returns—typically when you see less than 5% improvement over 2-3 consecutive iterations. Most well-defined tasks converge in 5-10 iterations. If you're still seeing major changes after 20+ iterations, either your success criteria aren't clear or the task itself isn't well-defined.

Q2: Should I test on the same inputs each iteration?

A: Yes, use the same core test set for apples-to-apples comparison, BUT periodically test on new inputs to detect overfitting. A 80/20 split works well: refine on 80% of test cases, validate on the held-out 20%. If performance diverges dramatically between sets, you're overfitting.

Q3: What if my prompt gets too long and complex?

A: Length isn't inherently bad, but complexity is. If your prompt exceeds 2000 tokens, consider: (1) Decomposing into sub-tasks with separate prompts, (2) Consolidating redundant instructions, (3) Moving reference material to separate documents referenced in prompt, (4) Using abstraction layers (simple prompt → complex prompt).

Q4: How do I know when issues are from the prompt vs. model limitations?

A: Test progressively simpler versions of your task. If the model fails even on simplified versions, it's likely a capability limitation. If it succeeds on simple versions but fails as you add complexity, it's a prompt design issue. Also try different models—if all models fail similarly, it's probably the task; if only one fails, it's model-specific.

Q5: Should I refine prompts for different models separately?

A: Generally, yes. Different models (GPT-4, Claude, etc.) respond differently to prompt structures. What works for one may not work for another. That said, start with a model-agnostic prompt, then model-specific refinements. Document which version works for which model to avoid confusion.

How This Skill Connects to Other Skills

Iterative prompt refinement builds on evaluation skills. Assessing whether a refinement improved outcomes requires the ability to judge output quality against objective criteria. Without evaluation, iteration cannot distinguish between helpful and harmful changes.

Decomposition supports refinement by allowing prompts to be built and tested component-wise. Instead of refining a complex monolithic prompt, individual elements (constraints, examples, instructions) can be tested in isolation and then integrated.

Context management enables fair testing. Comparing prompt versions requires controlling for context variables. If one prompt version benefits from different context or examples, the comparison is confounded. Context management ensures that differences are due to the prompt itself.

Instruction design informs what to refine. Understanding how to structure instructions, where to place constraints, and how to frame examples provides hypotheses about what improvements might help. Refinement tests these hypotheses systematically.

Documentation makes refinement cumulative. Documenting what works and what doesn't, recording failure patterns and successful formulations, creates a knowledge base that accelerates future prompt development. Each refinement project builds on previous insights.

Skill Boundaries

Iterative prompt refinement cannot compensate for fundamentally unsuitable tasks. If a task requires capabilities that the system does not possess—real-time data access, precise calculation, or domain expertise beyond training—no amount of prompt refinement will bridge that gap.

Refinement cannot overcome ambiguous requirements. If the success criteria are unclear or conflicting, prompt optimization cannot identify what to optimize for. Requirements must be clarified before refinement can proceed meaningfully.

Prompt refinement has limits in the face of inherent randomness. When outputs vary significantly even with identical prompts, refinement can shift the distribution but cannot eliminate variability. At some point, the ceiling is set by system consistency rather than prompt quality.

Refinement cannot substitute for proper evaluation methodology. If the testing approach is flawed—biased samples, inconsistent criteria, or insufficient coverage—refinement will optimize for the wrong objectives. The refinement process is only as good as the evaluation framework.

Iterative refinement reaches diminishing returns. After major issues are resolved, additional iterations produce incremental gains. The time spent on marginal improvements may be better spent on other aspects of the system—data quality, post-processing, or user experience.

Note: This skill is not yet in the main relationship map. Relationships will be defined as the skill library evolves.

Complementary Skills

Iteration: Iterative prompt refinement is the application of iteration skills specifically to prompt optimization.

Evaluation: Refinement requires evaluating each prompt version to identify what needs improvement.

Output Validation: Validation results inform which aspects of prompts need refinement.

Iterative Prompt Refinement

What is Iterative Prompt Refinement

Why This Skill Matters

Core Concepts

Hypothesis-Driven Iteration

Representative Sampling

Baseline Comparison

Failure Analysis

Version Control

How This Skill Is Used

Common Mistakes

Overfitting to Test Cases

Changing Multiple Variables

Effective refinement changes one variable at a time. If multiple issues exist, prioritize and address them sequentially. Isolated changes allow clear attribution of cause and effect.

Insufficient Testing

Chasing Perfection

Ignoring Output Analysis

When to Use This Skill

Ideal Scenarios:

Stringent requirements: When correctness, consistency, or specific formatting is critical
Complex tasks: Multi-constraint problems requiring precise instruction
High-stakes applications: When errors have significant costs
Reusable prompts: Creating prompts for repeated use across many inputs
Novel domains: Applying AI to unfamiliar task types
Performance optimization: Improving efficiency while maintaining quality

Not Ideal For:

Simple one-off tasks: Quick queries where refinement time exceeds task value
Low-stakes experimentation: Creative exploration where variability is acceptable
Rapid prototyping: Early stages where speed matters more than optimization
Trivial prompts: Tasks so simple that first attempts are usually sufficient

Decision Criteria:

Use iterative refinement when:

1. Task has multiple requirements or constraints
2. Consistency across inputs is critical
3. You'll use the prompt repeatedly
4. Initial prompt shows inconsistent results
5. Errors have significant costs

Common Use Cases

Use Case 1: JSON Output Consistency

Context: You need AI to extract structured data from unstructured text as JSON.

Challenge: First prompt produces inconsistent JSON structures—sometimes missing fields, sometimes using wrong data types.

Solution: Systematically test and refine prompt to ensure consistent structure.

Example Prompt:

Version 1 (Initial):
Extract name, price, category from product descriptions as JSON
[Results: inconsistent structure]

Version 2 (Added schema):
Extract as JSON with this exact schema:
{"name": "string", "price": "number", "category": "string"}
[Results: better but still missing fields]

Version 3 (Added negative constraints):
Extract as JSON with schema {"name": "string", "price": "number", "category": "string"}
- Include all three fields always
- Never add extra fields
- Price must be a number (no currency symbols)
[Results: 95% consistency]

Result: Final prompt produces reliable JSON structure across varied inputs.

Use Case 2: Tone Consistency

Context: Generating customer support responses that need consistent professional tone.

Challenge: Outputs vary between casual and formal, with some responses sounding abrupt.

Solution: Iterate with explicit tone examples and constraints.

Example Prompt:

Version 1:
Write a professional response to customer inquiries
[Results: inconsistent tone]

Version 2:
Write responses that are:
- Empathetic but professional
- Solution-focused
- Under 150 words
- No jargon
Example: "I understand your frustration with [issue]. Here's how we can resolve it..."
[Results: More consistent but still variability]

Version 3 (Refined):
Write responses following this pattern:
1. Acknowledge concern (empathy)
2. Explain solution clearly
3. Provide next steps
Constraints:
- Use "we" language
- Avoid apologies without solutions
- Maximum 150 words
3 examples demonstrating tone:
[Examples provided]
[Results: Consistent professional tone]

Result: Customer responses maintain consistent brand voice.

Use Case 3: Code Quality

Context: Generating code that must follow specific architectural patterns.

Challenge: Generated code works but doesn't follow team's conventions.

Solution: Iterate with explicit style constraints and examples.

Example Prompt:

Version 1:
Write a function to [task]
[Results: Works but wrong style]

Version 2:
Write a function following these conventions:
- Use TypeScript strict types
- Error handling with try/catch
- JSDoc comments
- Max function length: 50 lines
[Results: Better but inconsistent]

Version 3:
Write [function description]
Requirements:
- TypeScript with strict types (no 'any')
- Error handling: try/catch with typed errors
- JSDoc with @param, @returns, @throws
- Single responsibility (max 50 lines)
- Naming: camelCase for variables, PascalCase for types

Example of desired style:
[Code example]
[Results: Consistent with team conventions]

Result: Generated code requires minimal refactoring.

Step-by-Step Guide

Step 1: Define Success Criteria

Before creating your first prompt, establish clear, measurable success criteria.

What makes an output correct?

Required elements (fields, sections, data points)
Quality thresholds (accuracy, completeness, formatting)
Constraint satisfaction (length, style, structure)

Example criteria:

JSON output must validate against schema
Response must be under 200 words
Must include all 3 required sections
Tone must match examples

Step 2: Create Initial Prompt

Build your first prompt incorporating requirements and best practices.

Include:

Clear task description
Explicit constraints
Output format specification
Representative examples (2-3)
Context about domain/use case

Don't expect perfection—this is a starting point for refinement.

Step 3: Test on Representative Inputs

Create a diverse test set that includes:

Typical cases: Standard inputs you expect most often
Edge cases: Boundary conditions, empty inputs, unusual values
Complex cases: Multi-part requests, conflicting constraints
Varied formats: Different structures if applicable

Aim for 10-20 test inputs to uncover prompt weaknesses.

Step 4: Evaluate Outputs Systematically

For each test output, assess against your success criteria.

Track:

Which criteria are met/unmet
Patterns in failures (same constraint ignored across inputs?)
Severity of issues (critical blocker vs. minor inconsistency)

Document results in a structured format:

Test Input	Output Quality	Issues	Severity
Input 1	70%	Missing field X	High
Input 2	85%	Length exceeded	Medium
Input 3	95%	Minor inconsistency	Low

Step 5: Prioritize Issues

Rank identified issues by impact and frequency:

Priority 1 (Critical): Issues that cause task failure

Missing required outputs
Violated constraints that break functionality
Fundamental misunderstandings

Priority 2 (Important): Issues that degrade quality

Inconsistent formatting
Occasional omissions
Style violations

Priority 3 (Nice-to-have): Minor improvements

Word count optimization
Polishing phrasing
Edge-case refinements

Address issues in priority order.

Step 6: Formulate Hypotheses

For each prioritized issue, develop a hypothesis about what change will help.

Bad hypothesis: "Make the prompt better" Good hypothesis: "Moving negative constraints to the end of the prompt will reduce violations"

Common hypotheses:

"Adding an explicit example will reduce format errors"
"Rephrasing constraint X in stronger language will improve compliance"
"Consolidating scattered requirements will reduce omissions"
"Adding a 'before/after' example will clarify expectations"

Step 7: Test One Change at a Time

Modify your prompt to test your hypothesis.

Important: Change only one variable per iteration.

If you change multiple things simultaneously:

You won't know which change caused the effect
Positive and negative changes may cancel out
You can't replicate successes

Document each iteration:

What you changed and why
Hypothesis being tested
Test results compared to baseline
Decision (keep/revert/modify further)

Step 8: Compare to Baseline

After each modification, retest on the same inputs used previously.

Assess:

Did the specific issue you targeted improve?
Did any other aspects degrade?
Is improvement consistent across test cases?

Keep the change if:

It improves the targeted issue
No significant regressions elsewhere
Improvement is consistent across multiple test cases

Revert the change if:

No measurable improvement
Makes things worse
Improvement is isolated to specific cases (suggests overfitting)

Step 9: Expand Testing

Once critical issues are resolved, test on broader inputs.

Add:

New edge cases
Different input formats
Varied complexity levels
Inputs from actual use cases

Watch for:

Performance dropping on new inputs (sign of overfitting)
New failure modes emerging
Prompt becoming too specialized

Step 10: Converge and Document

Continue iterations until:

Additional refinements produce less than 5% improvement
Prompt meets requirements across representative inputs
Known limitations are acceptable

Document final prompt:

Prompt text
Success criteria used
Test cases and results
Known limitations/edge cases
Version history (what was tried)