What you need to master Prompt Engineering
High-level principles and reusable techniques to get the best out of a large language model with clever prompting.
Prompt engineering is a nascent discipline of designing and optimizing prompts for different tasks, overcoming or navigating the many limitations that LLMs currently have. While some believe that prompt engineering will become less and less relevant the more advanced (and “smart”) LLMs get, I still think some basic principles and techniques will remain useful, simply because no matter how intelligent an agent is, there are better and worse ways to give instructions.
Thousands of prompt engineering guides and cheat sheets are published online, many of which are not generalizable or even irrelevant as new LLMs are created. Also, most prompt hacks you’ll find online are based on some fundamental principles. If you master these principles, you can design your optimized prompt for any task with some focused experimentation.
For this reason, this article will not focus too deeply on specific prompting templates for concrete tasks. Instead, I will tell you the most important high-level principles and general techniques for all tasks and domains.
This article is part of wip book How to Train your Chatbot, which currently contains 140+ pages of theoretical and practical advice as well as 4 fully working demo applications using LLMs for all sorts of cool stuff. Get it today at half the usual price.
Principles for Effective Prompt Engineering
Prompt engineering aims to design an effective prompt for a given task. This is necessary because LLMs have inherent limitations and design caveats that make them brittle and prone to fail on an otherwise solvable task if given the wrong instructions. And wrong doesn’t necessarily mean wrong in any objective sense; it just means not being adjusted to the limitations of the LLM you are using.
Thus, to develop principles for practical prompt engineering, it will be worthwhile to briefly revisit some of the main limitations of LLMs we have discussed previously. Please go back and read our previous issues for a complete picture.
The following principles of prompt engineering stem from the basic structure of statistical language modeling: next-token prediction—understanding that an LLM is ultimately an autocompletion machine on steroids, based on word-context correlations learned from data. This realization informs the following principles.
Context matters
Since every new token generated is conditioned on the previously generated tokens, the response you get for a given prompt will be heavily influenced by the content of the prompt, down to the exact phrasing you use. Though LLMs are reasonably good at capturing the underlying semantics of text, the actual words, tone, style, and even the exact order in which you construct the prompt will influence the quality of the response.
Current research suggests, for example, that LLMs tend to focus more on the beginning and final parts of the prompt and less on the middle, although this may change rapidly as novel architectures are invented. But, regardless of the idiosyncracies of specific models, the critical insight here is that whatever you put in the prompt will heavily influence the response, so everything important should be explicitly mentioned, and everything irrelevant should be left out.
Focus matters
For the same reason, it is notoriously hard for LLMs to perform many tasks simultaneously. The more focused the instructions are, the better—and more robust—output you’ll get. LLMs are weak reasoners and struggle with complicated conditional instructions. Some bigger models may be able to deal with a larger degree of flexibility. Still, you should generally avoid writing prompts with conditions and make them as straightforward as possible.
For example, you can try prompting, “if the user says X then reply Y; otherwise reply Z”, hoping the LLM will correctly classify the input and choose the correct response in the same API call. This might work in your case, depending on how good the LLM is and how easy the task is. However, this simple task may fail because you ask the LLM to solve two problems simultaneously.
Instead, you can solve the same problem more robustly with two sequential instructions: first, ask the LLM to classify the user query and then pick the corresponding prompt for that category. Thus, the LLM never has to explicitly choose.
Reasoning requires verbosity
As we saw in a previous article, LLMs perform a fixed amount of computation per token, including input and generated tokens. Thus, intuitively, a larger, more detailed prompt will produce a better response, especially when some complicated reasoning is involved.
But crucially, this also applies to output generated with the LLM. The more it talks, the more computation is performed in total. For this reason, asking for very terse output is often not optimal. Instead, your prompt should incite the LLM to be verbose, explain its reasoning, summarize its key points before reaching a conclusion, etc.
But more is not always better
However, stuffing the prompt with redundant information or instructions is a bad idea. The information density in the context matters more than its raw length. You should strive for a minimum valuable prompt: the shortest prompt containing the necessary information to produce a successful response. To achieve this, consider making your instructions more intentional, using more precise wording, and avoiding vague terms, providing informative examples where necessary.
But crucially, do not add irrelevant instructions, as LLMs are lousy at ignoring things. A typical issue is adding negative examples to correct some behavior and discovering the LLM doubles down on it. The reason is simple: Everything in the context will influence what the LLM generates, so negative examples will still bias the sampling process toward similar content. Instead of showing negative examples to the LLM, reframe the instructions and the examples to cover the most important positive cases.
Experimentation has the last word
All of the above being said, prompt engineering is still mostly artisanal and far from an established science. For this reason, no amount of theory can replace good old experimentation. You should try different orderings of the instructions, output formats, and writing styles to see which gives you better results.
While you should write prompts that are generally agnostic to the specific LLM you’re using, keep in mind that the optimally-tuned prompt for one model, say GPT-4, might not be the universally best prompt. Different LLMs trained on different datasets and tuned with different strategies might have subtle differences that make one, for example, perform better with terse instructions while the other prefers verbosity. This may go down to the actual selection of words: a single word changed by an appropriate synonym may improve results significantly.
These principles are high-level insights that should inform how you approach prompt engineering. But keep in mind everything we know about large language models is changing very rapidly, and many of their current limitations could be fixed or at least reduced considerably with newer models, making some of these principles less relevant in the near future.
Prompt engineering techniques
In the following sections, we will explore specific prompt engineering techniques or patterns that are general enough to be valuable in many contexts and domains. These patterns and techniques are informed by the principles mentioned above, and we will discuss why they seem to work within this framework.
Zero-shot instructions
Zero-shot means providing a model with a single instruction and asking it to solve a problem without additional training data or examples. This should be the baseline for any new application and is useful for complex or novel tasks without existing data to draw from.
Zero-shot learning works by leveraging the model’s generalization ability from a single instruction. By providing a specific prompt, the model can use its internal knowledge and understanding to generate a solution without needing additional training data. This works as long as the model has seen similar tasks in the training set.
Some examples of zero-shot learning are:
Generating product descriptions for new or unique products.
Translating text between languages without parallel data.
Summarizing long documents or articles into concise overviews.
Few-shot learning
Few-shot learning involves adding a few (e.g., five) examples to the model’s input to help it generalize better. This technique is particularly useful for rare or ambiguous queries, as it allows the model to learn from a small number of examples.
Few-shot learning works by providing the model with examples of similar queries and their corresponding answers. This helps the model learn from these examples and apply the knowledge to new, unseen queries. The few-shot approach also reinforces the output format, thus improving fidelity.
Some examples of few-shot learning are:
Solving tasks that aren’t easy to explain concisely.
Reinforcing an output format or response style.
Generating recommendations of products, activities, etc., based on examples.
Role Playing
Role-playing involves informing the model about the audience, tone, role, and other context-specific details to bias it towards a specific complexity level, extension, or style. This technique is useful for generating responses tailored to a specific audience or context.
Role-playing provides the model with context-specific information that helps it generate more relevant and engaging responses to the target audience. The model can tailor its responses to meet their needs and expectations by understanding the audience and context.
Examples of role-playing include:
Writing dialogue for characters with distinct personalities and speaking styles.
Generating social media posts tailored to different demographics and platforms.
Drafting emails with appropriate tone and formality for different recipients.
Chain of Thought
Chain of thought involves asking the model to output a detailed reasoning process before providing the final answer. This technique is useful for complex queries that require multi-step reasoning or problem-solving.
Chain of thought works by forcing the model to explicitly demonstrate its thought process. This helps ensure that the model’s responses are based on sound reasoning and logic, making them more accurate and trustworthy. However, there is no formal guarantee the model won’t still hallucinate very plausible but completely mistaken reasoning steps.
Examples where chain of thought is useful include:
Solving logic puzzles and brain teasers by breaking down the steps.
Providing step-by-step instructions for complex procedures or recipes.
Analyzing data to draw insights and conclusions.
Structured Output
Structured output involves instructing the model to produce the output in a structured format, such as JSON. This technique is useful for applications that require structured data, such as database queries or data analysis.
Structured output simplifies the response parsing and allows for easier integration with downstream applications. By providing a structured format, the model can generate easily consumable and actionable responses.
Examples where structured output is useful:
Generating tabular data like schedules, calendars, or price lists.
Producing API responses in a standardized JSON format.
Extracting structured information like addresses, dates, or product details from text.
Self-Reflection
Self-reflection involves asking the model to evaluate its own response and determine if, given the new context, it would change it. This technique is useful for identifying and correcting errors or inconsistencies in the model’s original output.
Self-reflection allows the model to assess its own responses and identify potential errors or inconsistencies. By reflecting on its own output, the model can refine its responses and improve their accuracy and fidelity. This aligns with the verbosity principle: giving the model more chances to “change it mind” by performing more computation.
Examples of using self-reflection include:
Identifying biased or unethical statements in the model’s own outputs.
Detecting logical inconsistencies or contradictions in the generated text.
Refining responses based on feedback or additional context provided.
Ensembling
Ensembling involves combining the output of several models and asking one final model to produce a consensus answer. This technique improves the overall accuracy and fidelity of the response.
Ensembling works by leveraging the strengths of multiple models to generate a more accurate and reliable response. By combining the output of multiple models, ensembling can reduce the impact of individual errors and improve the overall quality of the response. It also relies on having more computational power available to reach the answer.
Examples where ensembling makes sense include:
Combining outputs from models trained on diverse data to reduce biases.
Aggregating responses from weaker models to reach consensus.
Leveraging models with different skills to produce more complete answers than any single model could.
Conclusions
Prompt engineering is a nascent discipline, and much is still unknown about its core principles, as well as the caveats and consequences of the current state of development in LLMs. This means you should beware that many of these ideas or principles might not remain relevant in the mid-term as novel models and architectures are invented.
Prompt engineering is a powerful and qualitatively new software development pattern. You can now program a computer to solve novel problems with reasonable effectiveness using natural language! But there is no free lunch, as usual. The main limitations of large language models (LLMs) in prompt engineering stem from their inherent design caveats and constraints.
LLMs can be brittle and prone to failure on tasks if the instructions provided are not aligned with the model’s capabilities. This means that even solvable tasks can fail if the wrong instructions are given, highlighting the importance of crafting prompts that suit the specific LLM being used. They struggle with complex reasoning tasks and conditional instructions. They may not perform well when faced with intricate conditional prompts, making it challenging to handle multiple instructions or conditions within a single prompt.
LLMs operate with a fixed amount of computation per token, which affects both input and output tokens. This constraint implies that a more detailed and verbose prompt produces better responses, especially for complex reasoning tasks. It also means that overly terse prompts may not yield optimal results. However, some balance is important because overly verbose prompts can be more confusing than informative.
Prompt engineering is still evolving and mostly artisanal, lacking a standardized approach. Experimentation is essential to determine the optimal prompt for a specific task and LLM. Different LLMs trained on diverse datasets may respond better to varying prompt styles, making it necessary to experiment with different strategies to find the most effective approach.
These limitations underscore the complexity and nuances involved in designing prompts for LLMs, highlighting the need for careful consideration and adaptation to maximize the performance of these models in various tasks and domains. In Part 3 of the book, we will apply these prompt engineering techniques to concrete problems and have plenty of time to explore the contexts in which they perform well.
This is a solid "glossary" of go-to approaches.
Your "minimum valuable prompt" is something I called a "minimum viable prompt" back in the day (https://www.whytryai.com/p/minimum-viable-prompt) - sounds like they're based on the same principles!
Personally, I think prompt engineering is going to become more important over time. The techniques will vary but the need will remain. After all, getting people to do what you want can be terribly difficult. Do you really believe AIs will be easier?