Jekyll2024-05-21T09:40:44-04:00 BlogThe Lean Software BoutiqueOmbuLabsA Deep Dive into Prompt Engineering Techniques: Part 12024-05-21T09:05:26-04:002024-05-21T09:05:26-04:00 Language Models (LLMs) are widely available and easily accessible and are increasingly a part of business. Whether you’re interacting with an LLM via the provided interface or connecting via an API and integrating it into other systems, it’s helpful to understand how to get the best possible results out of the model.

Prompt Engineering is a technique that focuses on perfecting your input to get the best possible output out of the language model. Of all the different techniques available to get LLMs to fit your use case best, it’s the most straightforward one to implement since it focuses primarily on improving the content of the input. In this Part I article, we’ll dive into different Prompt Engineering techniques and how to leverage them to write highly effective prompts, focusing on single prompt and chain techniques. In our following article, we’ll cover agents and multi-modal techniques.

For other available techniques to enhance LLM capabilities, check out our Techniques to Enhance the Capabilities of LLMs for your Specific Use Case article!

New to LLMs? Check out this article on the landscape by our friends over at Shift: Guest Post: Navigating the AI Chatbot Landscape.

A prompt is the input you provide to a generative model to produce an output. A successful prompt typically has four key components: instructions, context, input data, and an output indicator. These components ensure the model receives enough information in the input to provide the desired output.

Prompt engineering is refining the prompt to produce the desired output in specific scenarios without needing to update the actual model. As tasks and scenarios handled by LLMs become increasingly complex, different techniques emerge to help obtain the best results from an LLM. Some of these techniques can be applied directly while interacting with an LLM through a provided interface, such as ChatGPT (for GPT-3 and GPT-4), while others are best suited for LLM integration into new or existing systems.

Before we dive into these techniques, it’s important to keep two things in mind:

  • Different LLMs will respond differently to the same prompt. Techniques that might be effective with one model won’t necessarily perform as well with a different model (or even a different version of the same model).
  • Prompts are task-specific, as are some of the techniques covered here. Some are general ways to construct and use prompts and can be adapted to different scenarios, while others focus specifically on a type of task.

Let’s dive into some interesting prompt engineering techniques to help you construct powerful prompts.

Single Prompt Techniques

Single prompt techniques involve a single prompt being crafted to produce an output. The most common techniques in this category are zero-shot, few-shot, and chain of thought (CoT) prompting. These have been covered in the previous article in this series, Techniques to Enhance the Capabilities of LLMs for your Specific Use Case.

Below, we’ll cover two additional single prompt techniques.

Emotional Prompting

The idea behind this approach is to add an emotional stimulus to the prompt to elicit a better response from an LLM.

Emotion Prompt Diagram

Source: Figure 1 - Large Language Models Understand and Can Be Enhanced by Emotional Stimuli

The technique can be applied to a very simple prompt, for example:

Create a sales email for my new product

Inputting this into ChatGPT 4 yields:

To craft a compelling sales email for your new product, I'll need a bit more information. Could you please tell me about the product? What is it, its key features, the target audience, and any specific benefits or promotions you want to highlight?

It doesn’t give a direct response, instead it asks for more information for a better output. Now let’s add an emotional stimulus:

Create a sales email for my new product. I need this for a meeting in half an hour. Do your best.

It now outputs:

Certainly! I can draft a general sales email template for you. Let's say your product is a new type of ergonomic office chair that is designed to provide exceptional comfort and support to professionals who spend a lot of time sitting. Here’s a draft email for this product: 


The draft provided is omitted for brevity. The important thing to note is the emotional stimulus was enough to elicit a direct response, even if that response is a generic template with placeholders.

Li et al. evaluated different types of emotional stimuli and found that adding an emotional stimulus improved the performance of the LLMs evaluated.

Directional Stimulus

This technique gives the model hints in the input to guide the language model towards the desired output.

Directional Stimulus Diagram

Source: Figure 1 - Guiding Large Language Models via Directional Stimulus Prompting

Let’s walk through an example from the original paper:

Article: Seoul (CNN) South Korea's Prime Minister Lee Wan-koo offered to resign on Monday amid a growing political scandal. Lee will stay in his official role until South Korean President Park Geun-hye accepts his resignation. He has transferred his role of chairing Cabinet meetings to the deputy prime minister for the time being, according to his office. Park heard about the resignation and called it "regrettable," according to the South Korean presidential office. Calls for Lee to resign began after South Korean tycoon Sung Woan-jong was found hanging from a tree in Seoul in an apparent suicide on April 9. Sung, who was under investigation for fraud and bribery, left a note listing names and amounts of cash given to top officials, including those who work for the President. Lee and seven other politicians with links to the South Korean President are under investigation. A special prosecutor's team has been established to investigate the case. Lee had adamantly denied the allegations as the scandal escalated: "If there are any evidence, I will give out my life. As a Prime Minister, I will accept Prosecutor Office's investigation first." Park has said that she is taking the accusations very seriously. Before departing on her trip to Central and South America, she condemned political corruption in her country. "Corruption and deep-rooted evil are issues that can lead to taking away people's lives. We take this very seriously." "We must make sure to set straight this issue as a matter of political reform. I will not forgive anyone who is responsible for corruption or wrongdoing. "Park is in Peru and is expected to arrive back to South Korea on April 27. CNN's Paula Hancocks contributed to this report.

Keywords: Lee Wan-koo; resign, South Korean tycoon; Sung Woan-jong; hanging from a tree; investigation; notes; top officials

Question: Write a short summary of the article in 2-4 sentences that accurately incorporates the provided keywords.

The prompt instructs the LLM to write a summary of the article provided, incorporating the provided keywords. Doing so helps the LLM focus on the most important areas and guides it towards the desired output.

This technique can also be used in combination with other techniques or in a chain of prompts. For example, an LLM can be instructed to extract keywords first and then write a summary.

For more detail, check out the Guiding Large Language Models via Directional Stimulus Prompting paper by Li et al.


Chaining prompts involve sequential task processing and multiple, different stages of interaction, with a dependency on previous outputs to generate a prompt. It allows for different combinations of techniques and language models and can be used to produce good outputs for very complex tasks.

Generated Knowledge Prompting

A common way to get LLMs to produce better outputs in specific scenarios is to augment a query with additional, relevant knowledge before sending it to the LLM. Generated knowledge prompting offers a way to do that without the need for an external system to retrieve information from (like a vector database, for example). Instead, it uses an LLM to generate its own knowledge and then incorporates this generated knowledge into the original prompt to improve the final output.

Generated knowledge prompting diagram

Source: Figure 1 - Generated Knowledge Prompting for Commonsense Reasoning

This approach offers a knowledge generation method that is not conditioned by templates, making it easier to generalize. Liu et al.’s study that proposes the approach found that the approach works quite well on common sense reasoning tasks.

For example, let’s say we have a chatbot for a travel agency. A customer is interested in visiting Sydney in the summer. We can start with a few-shot prompt to the model to generate the knowledge:

Example: Generate a list of local dishes, popular summer activities and must-see places in Tokyo:

Local dishes to try:
- Sushi
- Ramen
- Tempura

Popular summer activities:
- Visiting the Firework Festivals
- Exploring beer gardens
- Experiencing traditional festivals
Must see places:
- Shinjuku Gyoen National Garden
- Meiji Shrine
- Tokyo Skytree

Generate a list of local dishes, popular summer activities and must-see places in Sydney, Australia.

We can then get the output of the model and use it as input for a subsequent prompt, for example:

Based on this list of local dishes, popular summer activities, and must-see places in Sydney, Australia:

[previous response]

Generate a suggested travel plan for a person traveling for 5 days.

The initial few-shot prompt will generate the needed knowledge for the model to provide the desired output. The few-shot approach ensures consistency in the response and format. The subsequent prompt leverages the generated knowledge to produce a more assertive output.

Please note that this is a simplified example to illustrate the idea behind generated knowledge prompting. For more details on the approach and what else it can accomplish, check out Generated Knowledge Prompting for Commonsense Reasoning by Liu et al.

Chain of Density (CoD)

The main goal of CoD is to improve an LLM’s ability to summarize content effectively. A good summary should strike the right balance of clarity and informativeness, that is, remain short with the proper information density.

Chain of Density works by increasing the entity density of summaries produced by an LLM without changing the length by increasing the number of entities in a summary iteratively.

Chain of Density Example

Source: Figure 2 - From Sparse to Dense: GPT-4 Summarization with Chain of Density Prompting

The technique can be adapted as needed, or integrated with other prompting techniques to have further steps, such as asking the LLM to rate each summary and taking the highest rated one.

For more details on results, check out the From Sparse to Dense: GPT-4 Summarization with Chain of Density Prompting paper by Adams et al.

Chain of Verification (CoVe)

The main purpose of CoVe is to reduce the possibility of hallucinations in LLM responses. Hallucination in a language model can be defined as a response that is plausible and “makes sense” but is factually incorrect.

The Chain of Verification method works in four steps:

  1. First, the LLM drafts an initial response to the input
  2. The LLM then plans verification questions to fact-check its own initial draft
  3. The model answers those verification questions independently so as to not bias the answers based on other responses
  4. The final, verified response is generated

Chain of Verification Diagram

Source: Figure 1 - Chain-of-Verification Reduces Hallucination in Large Language Models

The approach is very versatile, and each of these steps can be performed by prompting the same LLM in different ways or prompting different LLMs. Let’s run through an example running through ChatGPT 4:

We start with an initial prompt:

Name five large cities and South America that are not country, state, or provincial capitals

We get the response:

Sure, here are some large cities in South America that are not capitals of countries, states, or provinces:
Santos, Brazil 
Campinas, Brazil 
Porto Alegre, Brazil (not a national capital but is the capital of Rio Grande do Sul state)
Curitiba, Brazil (capital of Paraná state)
Rosario, Argentina

It’s easy to see from the response itself that two of the cities in the list are, in fact, state capitals and should not be there. The next step is to generate the questions. We feed the LLM another prompt:

Write five questions, one for each of these cities, following the template:
Input: Rio de Janeiro 
Question: Is Rio de Janeiro a national, state, provincial or departmental capital? 
Input: Buenos Aires 
Question: Is Buenos Aireas a national, state, provincial or departmental capital?



We get the response:

1. Is Santos a national, state, provincial, or departmental capital?
2. Is Campinas a national, state, provincial, or departmental capital?
3. Is Porto Alegre a national, state, provincial, or departmental capital?
4. Is Curitiba a national, state, provincial, or departmental capital?
5. Is Rosario a national, state, provincial, or departmental capital?

We now prompt the LLM to answer the questions:

Answer each one of the five questions below individually. Follow the template:

Question: Is Buenos Aires a national, state, provincial, or departmental capital?
Answer: Yes, Buenos Aires is the capital city of Argentina

Question: Is Rio de Janeiro a national, state, provincial, or departmental capital?
Answer: Yes, Rio de Janeiro is the capital of the state of Rio de Janeiro


And we get this response:

Question 1: Is Santos a national, state, provincial, or departmental capital?
Answer: No, Santos is not a capital city. It is a coastal city in the state of São Paulo, Brazil.

Question 2: Is Campinas a national, state, provincial, or departmental capital?
Answer: No, Campinas is not a capital city. It is a major city in the state of São Paulo, Brazil, known for its technological and educational institutions.

Question 3: Is Porto Alegre a national, state, provincial, or departmental capital?
Answer: Yes, Porto Alegre is the capital city of the state of Rio Grande do Sul, Brazil.

Question 4: Is Curitiba a national, state, provincial, or departmental capital?
Answer: Yes, Curitiba is the capital city of the state of Paraná, Brazil.

Question 5: Is Rosario a national, state, provincial, or departmental capital?
Answer: No, Rosario is not a capital city. It is a major city in the province of Santa Fe, Argentina.

Finally, we use the previous output to get the LLM to validate and update its initial response:

Here's an original list of large cities in South America that are not state, country or provincial capitals:


Based on these questions and answers, update the list to contain only cities that are not state, country or provincial capitals.


After you update the list, check how many cities are left. If the total is not five, add new cities that fit the guideline to the list until you reach five. Output only the final list and nothing else.

And we get the final response:

Santos, Brazil  
Campinas, Brazil  
Rosario, Argentina  
Vila Velha, Brazil  
Guarulhos, Brazil

That is a better answer, with only cities that fit our criteria.

While the steps can be performed with a single prompt, the verification step can also be performed with joint, 2-step, and factored prompts, with more sophisticated decomposition resulting in improved results.

For more detailed information on the approach and performance results, check out the Chain-of-Verification Reduces Hallucination in Large Language Models paper by Dhuliawala et al.


The self-consistency technique offers a way to improve the effectiveness of Chain-of-Thought (CoT) prompting without needing to train an additional verifier or a re-ranker. It leverages only prompting and works with pre-trained language models without additional human annotation or external systems.

The method consists of three steps:

  1. Initial prompt using CoT to the language model
  2. Create multiple prompts exploring different reasoning paths to approach the original problem in different ways
  3. Run the prompts through the LLM and analyze the responses for consistency of the answer and common themes, among other relevant characteristics depending on the problem, to get the final answer.

Self-Consistency Diagram

Source: Figure 1 - Self-Consistency Improves Chain of Thought Reasoning in Language Models

As stated by Wang et al. in the paper that describes the approach, self-consistency is built upon the intuition that complex reasoning problems can be approached in several different ways that all lead to the same, correct answer.

For more details on the approach and the results found, check out the Self-Consistency Improves Chain of Thought Reasoning in Language Models paper by Wang et al.

Active Prompting

Active prompting is another technique that improves upon the Chain-of-Thought (CoT) technique. The biggest limitation of CoT is it relies on a fixed set of human-annotated examples. Active prompting leverages uncertainty-based active learning to adapt LLMs to different tasks.

Active prompting is implemented in four stages:

  1. The language model is queried \(k\) times. This generates possible answers with intermediate steps, which form an initial set of training questions. The uncertainty \(u\) is then calculated based on the \(k\) answers using a given uncertainty metric.
  2. Uncertainty is then evaluated, and the most uncertain questions are selected for annotation
  3. Humans then annotate the selected questions
  4. The annotated examples are then pre-pended to the questions per the CoT template, and the LLM is prompted with each question

Active Prompting Diagram

Source: Figure 1 - Active Prompting with Chain-of-Thought for Large Language Models

Different kinds of uncertainty metrics can be used in active prompting. Diao et al. propose four different metrics:


Consider the set \(A = \{a_1, a_2, ..., a_k\}\) of answers to \(k\) questions. Disagreement is calculated by

\[u = \frac{h}{k}\]

where \(h\) is the number of unique answers in the set.

Questions with larger disagreement values can then be selected as candidates for annotation.


Entropy is calculated by:

\[u = \underset{i}{\mathrm{argmax}} - \sum_{j=1}^kP_\theta(a_j|q_i)\ln{P_\theta(a_j|q_i)}\]

where \(P_\theta(a_j \rvert q_i)\) is the frequency of a given answer \(j\) to a question \(i\) among all answers. The value of the uncertainty \(u\) is the index \(i\) for which the negative entropy (the summation part of the equation) is maximized.

Larger entropy means greater uncertainty and smaller entropy denotes smaller uncertainty; the goal is to find the questions with larger entropy.


Variance can also be used as a kind of uncertainty metric, calculated as:

\[u = \underset{i}{\mathrm{argmax}} \frac{\sum_{j=1}^k (a_j - \bar{a})^2}{k - 1} \Bigg|_{q=q_i}\]

where \(\bar{a} = \frac{1}{k}\sum_{j=1}^k a_j\).

Diao et al. hypothesize variance might be more suitable for Arabic answers.


Uncertainty is achieved by querying the language model with a pre-defined template instructing the model to classify each answer according to a pre-defined set of categories. The least confident questions are then selected by:

\[u = \underset{i}{\mathrm{argmax}}(1 - \underset{j}{\mathrm{max}}P_\theta(a_j|q_i)) = \underset{i}{\mathrm{argmin}} \ \underset{j}{\mathrm{max}}P_\theta(a_j|q_i)\]

where \(P_\theta(a_j \rvert q_i)\) is a categorical variable from the pre-defined set of categories.

The authors found that the first three metrics significantly outperform self-confidence but perform comparatively well within themselves.

Let’s walk through a simple example. We start with a set \(S\) of \(n=10\) questions that may or may not be annotated:

\[S = \{q_1, q_2, ..., q_{10}\}\]

We prompt the LLM \(k=5\) times for each question in \(S\) to generate possible answers with intermediate steps. This can be a zero-shot prompt, for example:

A car runs 10 kilometers with a liter of gas. A liter of gas costs $1.29. How much does it cost to drive 300 kilometers?

Or a few-shot prompt, for example:

Question: A bag of supplies lasts 5 days and costs $10. How much does it cost to buy supplies for a 30-day trip?
Answer: A bag lasts 5 days. The trip lasts 30 days. 30/5 = 6, so 6 bags are needed. Each bag costs $10. 6 bags times $10 = $60. The answer is $60.
Question: A car runs 11 kilometers with a liter of gas. A liter of gas costs $1.29. How much does it cost to drive 300 kilometers?

For each run of the prompt, we get an answer, resulting in a set \(A_i\) of \(k\) answers for each question \(q_i\) of index \(i\):

\[A_i = \{a_1, a_2, a_3, a_4, a_5\}\]

Let’s assume the set of answers for \(q_1\) is \(A = \{38.70, 38, 38.70, 38.70, 39\}\). Using disagreement as the uncertainty metric, we calculate it by taking the number of unique answers \(h=3\) and dividing it by the number of answers \(k=5\). The uncertainty measure of this set of answers is \(u = \frac{3}{5} = 0.6\). The process is repeated for all answer sets, resulting in a set of uncertainty metrics \(U\), one metric for each question:

\[U = \{u_1, u_2, ..., u_{10}\}\]

where \(u_1\) is the uncertainty measure of the answer set for \(q_1\).

We can now fetch the questions with the highest uncertainty. This can be done based on different criteria, such as taking the \(n\) questions with the highest uncertainty or taking all questions with uncertainty higher than \(x\). This creates a set of \(p\) questions that meet the criteria and need to be annotated.

The set of selected questions is annotated by humans. The annotated examples are pre-pended to the questions following the CoT template, and the LLM can be prompted with each question.

For more details on the approach and experiments, check out the Active Prompting with Chain-of-Thought for Large Language Models paper by Diao et al.

Tree of Thoughts (ToT)

The Three of Thoughts approach enhances a language model’s ability to problem-solve and perform more complex tasks by enabling LLMs to explore different reasoning paths over “thoughts” (coherent units of text). In the CoT approach, the LLM progresses linearly in its “reasoning” towards problem-solving. As such, if an error occurs along the way, they will tend to proceed. ToT proposed an alternative approach where the LLM evaluates itself at each node (or state) of thought, allowing it to stop inefficient approaches early and switch to alternative methods.

The problem is framed as a search over a tree, with a specific instantiation of ToT answering four questions, as stated by Yao et al.:

  1. How to decompose the intermediate process into thought steps
  2. How to generate potential thoughts from each state
  3. How to heuristically evaluate states
  4. What search algorithm to use

Tree of Thoughts Diagram

Source: Figure 1 - Tree of Thoughts: Deliberate Problem Solving with Large Language Models

Let’s walk through a creative writing example outlined in the original paper by Yao et al.:

Train of Thought Example

Source: Figure 4 - Tree of Thoughts: Deliberate Problem Solving with Large Language Models

A typical CoT prompt for a creative writing task would look like:

Input: Use these two sentences to write two paragraphs, each ending in one of the provided sentences: [TWO SENTENCES]
Input: Use these four sentences to write four paragraphs, each ending in one of the provided sentences: [FOUR SENTENCES]

As you can see, it samples the thoughts without explicit decomposition of steps, and the results on this kind of task aren’t as effective with CoT. The implementation with ToT would entail:

The ToT approach leverages the problem’s properties to design and decompose intermediate thought steps. A thought should be “small” enough that the language model can generate good, diverse samples but “big” enough that it can evaluate the thought’s prospect toward problem-solving.

For this same task, we’d instead start by prompting the LLM to create \(n\) plans to accomplish the task. For this example, let’s consider \(n=5\):

I have a creative writing problem. I need to write a coherent passage of four paragraphs, each ending in one of these sentences [FOUR SENTENCES]. Brainstorm five distinct step by step plans to accomplish this task. Consider factors such as [LIST OF FACTORS TO CONSIDER].

The output would be a set of \(n=5\) plans to accomplish the task. We then get the LLM to vote on each plan to accomplish the task and assign a measure of confidence in the plan’s success:

For each of the five proposed solutions to accomplish this task, evaluate their potential and give a probability of success and confidence in the result. Consider these factors when evaluating each option: [LIST OF FACTORS TO CONSIDER]

Now we have a best-rated plan, and prompt the LLM to generate \(k\) different passages following that plan. For this stage, let’s assume \(k=3\):

Following this step-by-step plan [BEST VOTED PLAN] write 3 different passages that are four paragraphs long each. The end sentence of each paragraph must be, in this order: [FOUR SENTENCES]

Finally, the language model votes on the best passage:

Analyze the choices below and conclude which one is the most promising one given the instruction.

And the best voted is our result.

This is a simple implementation with one branch. The approach can be expanded to handle increasingly complex tasks and yield more refined results.


Prompt engineering techniques go far beyond the typical input-output prompt and can be very powerful and effective for various tasks. Even in scenarios where other techniques to enhance LLMs are required, prompt engineering will feature either as part of intermediate steps (RAG, for example) or to elicit desired outputs (interacting with a fine-tuned model, for example).

Looking to leverage generative AI in your business? Let’s talk!

This blog post is part of a series with our friends over at Shift Interactive. Stay tuned for more!

Techniques to Enhance the Capabilities of LLMs for your Specific Use Case2024-04-22T10:58:36-04:002024-04-22T10:58:36-04:00 the advent of widely available Large Language Models (LLMs), businesses everywhere have sought to leverage these models to handle specific tasks that can increase productivity of their teams, automate specific tasks, increase the abilities of chat bots, among a variety of other things.

However, LLMs are not great at handling domain-specific tasks out of the box. In this article, we’ll explore a few different techniques to enhance the capabilities of LLMs and help them perform well for your specific use case.

Why Bother?

Large Language Models are available in a variety of different ways. Some are available through APIs (pair or not) like OpenAI or Mistral models. Others are available to run locally like Meta’s LlaMa 2 model. Regardless of which model you choose, these LLMs provide a solid foundation for our AI needs, but more often than not, you need to build upon that foundation.

These models are pre-trained on very large datasets, and are designed to generate responses based on a broad understanding of language and knowledge. They are not designed to handle highly specific or specialized scenarios, and might lack that extra depth in responses needed for some use cases, like technical support.

LLMs are also trained on a dataset that is fixed at a certain point in time, and thus don’t have access to up-to-date information. In a large portion of real world use cases, when language models need to interact with users (internal or external) to provide information, it is important for them to evolve as the information evolves and new information is generated.

The reliability of LLMs is also not always optimal. These models hallucinate, and LLM hallucinations can be a big issue, especially in use cases where an external user who needs reliable information is involved.

Hallucinations are defined as the model producing outputs that are coherent and grammatically correct, but are factually incorrect or nonsensical. That is, the model is “confidently incorrect”.

These hallucinations can happen due to a variety of reasons, including training data limitation and quality, the interpretative limits of the model, the inherent complexity of language, biases in the model, among others. Techniques to help LLMs perform better on smaller, specific scenarios can help reduce that risk and improve their reliability.

Finally, cost and scalability play a role in using LLMs as well. Direct queries to an LLM can become expensive, and training (and re-training) your own model is even more costly. Techniques to help these models “focus” on a specific scenario and pre-filter relevant information can help reduce the computational load on the LLM and the associated cost.

These are all good reasons to implement techniques to enhance the capabilities of LLMs, and help them perform better in specific scenarios. So let’s take a look at a few of these techniques.

Prompt Engineering Techniques

The simplest of the techniques we’ll cover, prompt engineering focuses on perfecting the input to elicit the best possible outputs from the model. Unlike the other methods in this article, Prompt Engineering does not involve modifying the model or integrating anything new into it.

Large language models are tuned to follow instructions, and are able to generalize from a few examples based on the diverse patterns they have encountered during their training, since they are trained on very, very large datasets. Prompt engineering leverages these capabilities to improve the responses obtained directly from the LLM.

Zero-shot prompting

Zero-shot prompting is the simplest, most common form of prompting. It involves prepending a specific instruction to the query without providing the model with any direct examples or additional information.

Let’s take, for example, a restaurant looking to classify reviews as positive, neutral or negative. A zero-shot prompt would look like this:

Classify the following restaurant review as positive, negative, or neutral:
"The menu was extensive, but the flavors were just okay. Nothing stood out as a must-try."

And the language model would output a classification.

Few-shot prompting

Few-shot prompting involves giving the language model a few examples to work with, helping guide it towards the desired response. It’s useful when you want a specific response format, or when the output you’d like is hard to describe.

For the same scenario as above, if you’d like to, instead of getting a general sentiment for the review, break it down into specific aspects, you could construct a prompt like this:

Classify the sentiment of the following restaurant review focusing on food, service, and ambiance:
Review: "The steak was cooked to perfection, but the service was a bit slow. The rooftop view was breathtaking, though."
Food: Positive
Service: Negative
Ambiance: Positive

Classify the sentiment of the following restaurant review focusing on food, service, and ambiance:
Review: "The dishes were uninspired and bland. However, the staff was friendly, and the interior decor was charming."
Food: Negative
Service: Positive
Ambiance: Positive

Now, classify the sentiment of this restaurant review focusing on food, service, and ambiance:
Review: "Amazing cocktails and appetizers, but the main courses were disappointing. The place was noisy, making it hard to enjoy the meal."

The model can then output a response following the desired format, since it has a frame of reference.

Chain-of-thought prompting

The idea behind chain-of-thought prompting is to guide the model through intermediate steps to allow for more complex problem-solving, by “guiding” the model through the reasoning steps.

A good example of how chain-of-thought prompting can enhance the model’s response is presented by Wei et al. (2022) in their paper introducing the method. Let’s take this standard prompt:

Question: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis
balls does he have now?

Answer: The answer is 11.

Question: The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have?

The model output for the prompt in their experiment was:

Answer: The answer is 27.

It’s quite easy to see that the answer is incorrect. It should be 9, not 27 as outputted.

Let’s apply chain-of-thought prompting instead:

Question: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis
balls does he have now?

Answer: Roger started with 5 balls. 2 cans of 3 tennis balls each is 6 tennis balls. 5 + 6 = 11. The answer is 11.

Question: The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have?

The model output now was:

Answer: The cafeteria had 23 apples originally. They used 20 to make lunch. So they had 23 - 20 = 3. They bought 6 more 
apples, so they have 3 + 6 = 9. The answer is 9.

Now, the model outputted the correct response.

It’s also possible to combine zero-shot and few-shot prompting with chain-of-thought prompts, for even more powerful prompting.

Other Techniques

In addition to the three techniques mentioned above, there are a variety of other prompting techniques that can be used to enhance results, such as contrastive prompting (providing examples of both correct and incorrect responses to help the model identify desired and undesirable outputs), role-based prompting (assigning a role or persona to the model to influence tone, style and other general characteristics of the response), analogical prompting (using similar problems or scenarios to guide the model’s response), among many others.

Ultimately, the most important thing to keep in mind is prompting is an iterative exercise. Refining your prompts to get the desired results and trying multiple techniques to see which ones perform better is the best way to find what works for your specific use case.

For a large portion of use cases, prompt engineering techniques are “good enough” to get the model to perform at the desired level. For those where it isn’t, we have the techniques below in the toolbox.

Transfer Learning

Transfer learning is a strategy that employs a model developed for one task as the starting point for a model on a second task. It enables models to leverage pre-existing knowledge to solve new but related problems, improving the learning efficiency.

In the context of language models, this means we can take a model trained on a large corpus of text (an LLM) and use these learned word and context representations to solve new language tasks, such as sentiment analysis or text classification.

There are several approaches that can be applied to transfer learning, such as fine-tuning, multi-task learning, and feature extraction, to name a few.

Full fine-tuning

Full fine-tuning (also known as instruction fine-tuning) aims to enhance a model’s performance across a variety of different tasks by training the already pre-training LLM on a smaller, specific, labeled dataset of examples and instructions that guide its responses to queries.

LLM fine-tuning cycle diagram Image by

It involves training the entire model on this new dataset, thus adjusting all layers of the language model during the training process, meaning all model parameters are updated. The model will learn from the specific examples in your instructions dataset, which should include examples of prompts and completion to guide the model.

LLM full fine-tuning overview diagram Source: Coursera: Learn Generative AI with LLMs

Fully fine-tuning a model involves creating a dataset specific to what you’d like to fine-tune the model on, pre-processing the data to transform it into something the model can consume, training the model (fine-tuning it) on your new dataset and evaluating its performance. Iterate over that process as many times as needed, making the necessary adjustments, to achieve the desired performance.

Multi-Task Learning

Multi-task learning is also a form of tuning. Its key goal is to train a model to perform multiple, related tasks simultaneously. The logic behind it is that learning one task should enhance learning of the other, related tasks.

For example, we could train a model to perform sentiment analysis on reviews for a restaurant while also training it to classify the reviews according to their main focus (food, service, location, etc.).

Multi-task learning is accomplished in a very similar way to fine-tuning; however, the dataset would now include instructions for all tasks you’d like to train the model on.

It is important to keep in mind some of the things that can go wrong, though! For proper multi-task learning, we need to ensure tasks are balanced during training, we don’t want the model to become a specialist on one task at the expense of all the others. Tasks being related also matters. The idea is that learning one task will enhance learning of the other. Training a model to classify restaurant reviews and predict the weather tomorrow will likely not yield very good results.

Parameter-efficient fine-tuning (PEFT)

PEFT is a form of instruction fine-tuning that focuses on only a subset of the LLMs parameters. Unlike full fine-tuning, which will create full copies of the LLM adjusting all of its parameters, PEFT updates only a specific subset of parameters and “freezes” the rest.

By doing this, PEFT allows for more manageable memory requirements while also helping prevent catastrophic forgetting.

Catastrophic forgetting happens when an LLM is fine-tuned to perform one specific task and forgets previously learned information, performing only on new data specifically.

PEFT avoids the loss of previously learned information by preserving the original LLM weights.

PEFT Diagram Image by

There are multiple different techniques to achieve PEFT. Some prioritize training select portions of the original model, altering specific subsets of its parameters; others integrate and train smaller additional components, such as adapter layers, without ever modifying the LLMs original structure.

Two of the most widely used and effective PEFT methods are LoRA (Low-Rank Adaptation) and QLoRa (Quantized Low-Rank Adaptation). We’ll cover these methods in more detail in our next article in this series.

Feature Extraction

Feature extraction involves using a pre-trained model to extract meaningful features from data, then using those features as input for a new model or task.

In order to do this, we need to remove the output layer of the LLM (which is specific to the task it was trained on, and yields a prediction) to access the last hidden layer of the model, which will output a feature vector rather than a prediction. The feature vector is what we’re interested in.

Feature extraction diagram

The extracted features will then be used as input for a new model that is trained to perform a different task. Now, only this new model needs to be trained from scratch.

This new model is usually much smaller and less complex, and thus more computationally efficient. This is possible because the heavy lifting of extracting meaningful features from data has already been done by the pre-trained model.

Retrieval Augmented Generation (RAG)

Retrieval Augmented Generation is a completely different way to augment LLMs. Unlike Transfer Learning techniques, it doesn’t alter the original model at all, but rather integrates it with a retrieval mechanism that can help the language model “look up” information.

RAG doesn’t have to be an alternative to fine-tuning though. A combination of the two can be incredibly powerful in creating specialized LLMs with access to up-to-date, specific, niche information.

RAG Diagram Source: Gradient Flow

The core idea behind RAG is to use a retrieval mechanism to fetch additional, relevant information before sending a query to the LLM, so that this additional information can be leveraged by the LLM to produce a better response.

For language models, since we’re dealing with textual information, the most common scenario is to have a vector database that stores the domain-specific dataset as embedded vectors. However, it is also possible to incorporate all kinds of data storages into the retrieval mechanism, depending on your specific use case.

The standard flow of a simple RAG implementation could look like this:

  1. User submits a query
  2. Query is embedded
  3. Semantic search is used to retrieve relevant information from the vector storage based on similarity
  4. This relevant data is combined with the original query into a carefully crafted prompt
  5. The prompt is sent to the LLM to produce a response
  6. The response is sent to the user

There are multiple ways to enhance this flow, and multiple advanced RAG techniques that will be covered in future articles in this series. For now, at a high level, RAG is a great way to integrate a retrieval mechanism to leverage new, specific and up-to-date information to enhance an LLMs capabilities, yielding better responses in specific use cases.

Which one is the best?

The one that fits your use case.

Which technique is best depends on the specific characteristics of your use case, restrictions around cost, computational power, and time, among other factors.

In simpler use cases, where the model doesn’t need to access very specific or up-to-date information, prompt engineering is often enough. Getting the desired output is more about leveraging the model’s capacity to generalize, draw parallels, and follow examples to get the desired outputs in the desired way. However, it is not robust or reliable enough for use cases where additional background knowledge is required, or knowledge specific to a domain is required.

For use cases where knowledge of a specific domain and its nuances is required, or where the foundational model simply doesn’t perform your specific task well, fine-tuning is a good option. It is especially beneficial in areas with specialized jargon, concepts or structures, for example in handling legal documents, medical research or financial reports. Fine-tuning enhances the accuracy and robustness of the model by exposing it to more examples, edge cases, and less common scenarios in the domain-specific dataset, while leveraging the general language constructs it has already learned in original training.

The counter points of fine-tuning are the high computational costs, since it involves updating the parameters of a large language model, which can be quite expensive. It also has large memory requirements and demands a high time investment and a high level of expertise.

Finally, if the model needs access to a broad range of information, and especially if it needs to use up-to-date information and documents that weren’t part of the training set (without retraining it, of course), RAG is the best option. For example, for a technical support chat bot that needs access to the company’s knowledge base and policies, it’s important that the model gains access to up-to-date information, and retraining it every time an update is made is very cost ineffective.

RAG also has limitations, however. The idea behind it is that it enhances the LLMs information retrieval capabilities by drawing context from the provided external data; but additional context is not always enough. If the pre-trained LLM is not great at the specific task (for example, summarizing financial data), just providing additional context in the form or extra information won’t help much. RAG also requires an “upfront investment” in making sure the external dataset used is prepared to be used in such scenarios, so that the most relevant information can be retrieved reliably and is not used in a way that confuses the model rather than enhancing it.

Finally, it’s also important to be careful when using AI, especially to interface with external customers, and to be aware of its limitations. All language models can hallucinate, and none of the techniques mentioned get rid of that risk completely. Therefore, always evaluate the potential risks of integrating AI into your use case and how to mitigate them. A good example of these risks materializing is a chat bot gone rogue earlier this year, making up a refund policy that didn’t exist. Check out the Air Canada Has to Honor a Refund Policy Its Chatbot Made Up article by Wired.


LLMs are incredibly powerful and can be used to enhance a wide variety of existing use cases, as well as enable new ones. Integrating LLMs into a product, internal flow or website can be a really good way to automate repetitive tasks, increase productivity, enhance user experience and take the capabilities of your product to the next level, unlocking new value for users (and for your company).

These techniques provide ways to enhance the LLMs capability and tailor them to your specific use case, ensuring they perform at the desired level in whichever task you desire to accomplish. Looking to leverage generative AI to solve specific problems? Need help figuring out how AI can help your company? Let’s talk!.

This blog post is part of a series with our friends over at Shift Interactive. Stay tuned for more!

Guest Post: Navigating the AI Chatbot Landscape2024-04-17T13:23:01-04:002024-04-17T13:23:01-04:00 often partner with our friends at Shift Interactive when we need an extra set of hands or expertise to complement our own. Recently we’ve been collaborating with them on interesting Artificial Intelligence and Machine Learning things. Check out their recent blog post below or here to get an overview of the AI chatbot landscape.

You can also check out the next article in this series: Techniques to Enhance the Capabilities of LLMs for your Specific Use Case.

Your company has a web application, mobile app, and/or website that gets thousands of users every single day. Questions and support from your users are getting difficult to manage with the current number of employees you have on staff. You think about how to offset some of the demand from your users for simple tasks like changing settings and resetting passwords so that your employees can focus on the more difficult requests and support. You decide to add a chatbot.

However, you don’t just want this chatbot to provide canned answers that are going to frustrate your users. Instead, you want a chatbot that is going to be smarter and personalized — the conversation feels like they are talking to a real person. And you’ve heard that AI does that. Would that make sense to use?

How do you start?

We got you.

AI Models

How does an AI chatbot work? It starts with a model.

A model is ultimately used to handle generating a response from information that is passed to it. For example, when building a chatbot, whatever the question is that a user would provide in a chat window, that question would be fed into the model and then the model would produce a response that would be passed back to the user.

Models exist and can be created to solve all kinds of problems. When models are used to generate content such as a response to a question, that model at a high level is considered to be using Generative AI because it’s generating something. At a more specific level, it’s also considered to be a Large-Language Model (LLM) because it can comprehend and/or generate human language text.

Foundation Models

We have two options when it comes to establishing a starting point for an AI model. We can choose to start from scratch — build and train a model entirely within our organization that is based only on data we have provided or we can choose to start with a Foundation Model (FM).

A Foundation Model is a model that was built and trained by another organization on a large and broad amount of data so that it can be a generalized model that can be used for a variety of purposes. OpenAI would be an example of an organization that creates foundation models.

Building and training a model from scratch is typically not cost-effective because the costs involved can easily be in the millions of dollars and take a long time. This process is also resource-intensive and takes a specialized team of data scientists and engineers to effectively train the model with a large amount of data that you may not have available.

Foundation Models are a more cost-effective entry to getting started and require much fewer resources.

Picking a Foundation Model

When building an AI-powered chatbot, we know a couple of specifications that we want for our foundation model. We know we want it to be a Generative AI model, an LLM, and able to specifically generate text.

Finding a model can also depend on the platform we want to use. There are many AI platforms available to support getting started with an FM. The following are examples of the many platforms available:

All of these platforms have pros & cons and choosing one over the other might come down to what foundation models are available and how best they suit your needs. The ecosystem, support, and experience with developing the AI model within the platform are also things to consider.

In this case, we’re going to focus on the Google Cloud Vertex AI Studio. The following is a list of foundation model groups that we can pick from at the time this article was published:

  • Gemini
  • PaLM
  • Codey
  • Imagen

Each of the groups has specific models that combine specific features and are designed for specific use cases. The following are two examples of models from the Gemini and PaLM model groups:

Gemini 1.0 Pro (gemini-1.0-pro)

Designed to handle natural language tasks, multiturn text and code chat, and code generation. Use Gemini 1.0 Pro for prompts that only contain text.

PaLM 2 for Chat (chat-bison)

Fine-tuned for multi-turn conversation use cases.

In this particular use case, the “Gemini 1.0 Pro” model offers more functionality than we actually need. So instead, we should start with the “PaLM 2 for Chat” model and work to customize it.

Configuring the Model

Things to consider when creating an AI chatbot for your users is that you want to make sure that the conversation is grounded in a discussion about your web application, company, support-related questions, etc.

You don’t want to create a chatbot for your web application that supports the user asking it about things like a sports team, evaluating code, or anything else that is unrelated to your company. At the same time, you do want the chatbot to be aware of specific prompts and responses that would be appropriate for users to request so that the chatbot can provide the user with a grounded experience.

For the PaLM 2 for Chat model, we have the following options to configure the model:

Context allows us to put guardrails on the model and to establish the style and tone of the response. Some guardrails that we might put in place are things like topics to focus on or avoid, what to do when the model doesn’t know the answer, and what words the model can’t use.

Examples provide the model with ideal responses to questions that may be asked to demonstrate to the model what is expected.

Grounding helps make sure the responses are focused on specific information such as company support features and frequently asked questions and answers.

Tuning the Model

There are different techniques for ensuring a model adapts to custom data and/or sources. Fine-tuning is just one technique. Additional articles in this series will highlight the different techniques along with their specific use cases. For this example though, we are going to focus on just fine-tuning the model in order to handle a custom dataset. Something to keep in mind is that fine-tuning can be rather expensive and resource-intensive depending on the amount of fine-tuning that is necessary.

In this step, we will first need to prepare a dataset to be used for tuning the model.

Preparing the dataset involves developing examples of a conversation that might occur for a user with our chatbot. The more examples we can provide in the dataset, the better-tuned our model will be.

Once we have a dataset ready, the next step is running the model through a fine-tuning process with the dataset. Depending on the platform and process you are taking with building your AI model, this will look different. For Google Cloud Vertex AI Studio, this involves storing the tuning dataset in a Google Cloud Storage bucket and then kicking off a text model supervised tuning job.

Evaluating the Model

Once we have completed the process of tuning our AI model, the final step to prepare the model for production is to evaluate the model with a test dataset. This will determine if the model is responding appropriately to our questions in a chat context.

The simplest way to evaluate the tuned model is to compare it with the pre-trained model. This involves preparing a dataset to be used for evaluation that contains questions that are representative of what our users might ask.

We would want to run the dataset through the pre-trained model to determine the responses. We would then run the dataset through our fine-tuned model and then compare the responses between the two result sets.

Specific metrics we would be looking for are the following:

  • Length of response
  • Whether the response had a positive or negative sentiment
  • Answer quality, coherence, relevance, fluency

We should be looking to establish a threshold percentage for each metric that we want to target. Meeting these defined thresholds will indicate that our model is ready for deployment. If we feel the model is not quite ready for production and needs further fine-tuning then we should continue to tune the model until it reaches the threshold for deployment.

Utilizing Your Model in a Solution

We have fine-tuned an AI model and evaluated it so that it is ready for deployment. Now, we need to be able to deploy our model so that it can be utilized behind API endpoints. We can choose to deploy the model in most cloud platforms very easily. This will come down to what your preference is and where you likely already have infrastructure in the cloud.

Once we have the model API deployed, we are ready to update our web application, mobile app, and/or website to have a chat interface that directly interacts with our AI model API endpoints.

Since we have thousands of users hitting our site every day, an isolated roll-out of the feature would likely be warranted so that we can ensure the AI model is effective in production before rolling it out to all of our users.

Additional metrics we likely want to measure and evaluate once our AI model is being used in production are the following:

  • Satisfaction Rate
  • Non-response Rate
  • Average Chat Time
  • Bounce Rate
  • Performance Rate
  • Self-service Rate

There will likely be additional metrics that you will want to determine as well that will be specific to your organization.

What’s Next

It’s not difficult to create a custom AI chatbot for your organization. It takes some time, preparation of datasets for fine-tuning and evaluation, and measurement of the effectiveness of the AI model before and after deployment.

Once you have the chatbot being utilized within your organization, it is important to continue to evaluate the AI model regularly to ensure it maintains a threshold for specific metrics identified by your organization.

Additionally, as new content, questions & answers, and services & offerings change within your organization, a combination of techniques may be necessary to ensure the AI model continues to provide relevant and up-to-date information to the user through chat conversations.

Next up we go more in-depth on enhancing the capabilities of an LLM for your use case. Check out the next article in this series: Techniques to Enhance the Capabilities of LLMs for your Specific Use Case.

How to Build a Discord Bot in Ruby on Rails2024-04-10T09:54:37-04:002024-04-10T09:54:37-04:00 OmbuLabs, we recently had the opportunity to develop a Discord bot from scratch. We found the experience both rewarding and insightful, prompting us to create this tutorial to share our approach with you.

In this guide, we’ll walk you through the process of creating a Discord bot using Ruby on Rails, leveraging the efficiency and ease of development provided by this powerful framework.

Why Ruby on Rails

We opted for Ruby on Rails for this project for several reasons. Firstly, we’re big fans of the framework and appreciate its productivity and ease of use.

Additionally, we needed a solution that allowed us to build the bot quickly without compromising on functionality or maintainability, and Ruby on Rails proved to be the perfect fit.

While we ended up settling on Ruby, this is not the only language that a discord bot can be built in, other popular languages include Python and JavaScript.

Getting Started

Before we begin, it’s best to have a basic understanding of Ruby and Rails. If you’re new to either language or framework, we recommend familiarizing yourself with the fundamentals before diving into this tutorial.

With that said, let’s get started on creating your Discord bot with Ruby on Rails!


  • Ruby and Rails
  • Discord Developer account

Setting up the Rails Project

  1. Create a new Rails project: rails new discord_bot_project
  2. Set up necessary gems:
    • discordrb: A Ruby library for interacting with the Discord API
    • dotenv-rails: For managing environment variables

    In your Gemfile add the following and run bundle install:

     gem 'discordrb'
     gem 'dotenv-rails'
  3. Discord bot setup
  4. Continue setting up the Rails Application
    • Create a .env file and add DISCORD_BOT_TOKEN=your_token_here.
    • Add .env to your .gitignore to keep the token secret.
  5. Create a Discord bot client in your Rails application.

    There are a number of ways that the bot can be setup and run, one of the simplest would be to do something like this:

    In config/discord_bot.rb

       require 'discordrb'
       bot = token: ENV['DISCORD_BOT_TOKEN']
       bot.message(content: 'Ombu!') do |event|
         event.respond 'Labs!'

    For our purposes we needed a setup that was a bit more involved. We had two main problems happening.

    The first issue was that our Puma configuration was spinning up 5 threads and this was causing multiple instances of the bot to be running at the same time. This in turn would cause multiple instances of the bot to acknowledge events that were fired, leading to multiple messages being sent in response to the same event, for example.

    We also had an issue where sometimes listeners were not being turned on by the time the event happened, and therefore we occasionally had no response. We decided the simplest solution would be to turn on all the listeners when starting up the bot.

    To solve these issues we created a Discord::Bot class that took care of setting up and starting the bot. We placed this in our app/lib directory because it made autoloading easier via Zeitwerk.

    Moving our bot to a class and starting the bot with a rake task allowed us to separate our web server and our bot. This prevented our issue of having multiple instances of the bot running at the same time, and therefore the listeners only acknowledged events once as we intended.

    Using the rake task to configure and run the bot had the added functionality of allowing us to register all of our slash commands at the same time. This meant that we didn’t have to re-register commands, but we could update commands if necessary. When we were first developing the app we found that we spent a lot of time waiting for the app to start up when it was registering the slash commands every time.

  6. Testing your discord bot

    To interact with your bot, you’ll need to add it to a Discord server. Follow these steps:

    • Go to the Discord Developer Portal and select the application you previously created.
    • In the sidebar, navigate to OAuth2 -> URL Generator.
    • Choose the scopes you want to generate for your application. For testing purposes, we selected bot in the first list and Administrator in the second. However, the scopes can be decided depending on what your needs are. We went with broad scopes so we wouldn’t have issues during testing.
    • Copy the generated URL.
    • Paste the URL into your browser and authorize the bot to join a server that you have administrative access to. If needed, you can create a test server for this purpose.

    Now, your bot should be successfully added to the Discord server, allowing you to test its functionality..

Future Steps

Now that you have your bot up and running you can start adding features such as:

  • Responding to specific commands
  • Managing server events
  • Interacting with external APIs

Although we won’t delve into all the details in this post, we’re currently working on additional posts that will provide further insights. However, we’ll leave you with a glimpse of how to register a slash command.

To register the command we can grab the server_id from Discord. We can then use register_application_command and pass in the correct arguments to register the command. Later we can add listeners to handle the event of when the command is called.

Here’s an example of registering a command called say_hello:

bot.register_application_command(:say_hello, "Say hello to the server", server_id: ENV.fetch("DISCORD_SERVER_ID", nil))

When you register a slash command like this, you’re essentially informing Discord about a new slash command that your bot can handle. In this case, the command say_hello will enable users to greet others within the server.


Creating a Discord bot with Ruby on Rails opens up a world of possibilities for community engagement and management. By integrating the discordrb gem, you can easily develop powerful bots that automate tasks, moderate discussions, and provide entertainment to your Discord server.

We loved building this project in Ruby on Rails, but of course Discord bots can be built in multiple other languages. There are very robust libraries to build Discord bots in Python and JavaScript, for example.

Looking to build a Discord bot with Ruby, Python or JavaScript? Get in touch and see how we can help!.

Defining, Measuring, and Managing Technical Debt at Google2024-03-19T16:41:56-04:002024-03-19T16:41:56-04:00 2023, Ciera Jaspan and Collin Green, as part of the Engineering Productivity Research Team at Google, published a paper titled Defining, Measuring, and Managing Technical Debt based on five years of research on tech debt across different teams in their company.

In this article, I will describe the most interesting findings from that paper and how you can apply them at your company to define, measure, and manage technical debt.


Before the team designed their survey, they interviewed a number of subject matter experts at the company to try to understand what were the main components of technical debt as perceived by them:

"We took an empirical approach to understand what engineers mean when they
refer to technical debt. We started by interviewing subject matter experts
at the company, focusing our discussions to generate options for two survey
questions: one asked engineers about the underlying causes of the technical
debt they encountered, and the other asked engineers what mitigation would
be appropriate to fix this debt. We included these questions in the next
round of our quarterly engineering survey and gave engineers the option to
select multiple root causes and multiple mitigations. Most engineers selected
several options in response to each of the items. We then performed a factor
analysis to discover patterns in the responses, and we reran the survey the
next quarter with refined response options, including an “other” response
option to allow engineers to write in descriptions. We did a qualitative
analysis of the descriptions in the “other” bucket, included novel concepts
in our list, and iterated until we hit the point where <2% of the engineers
selected “other.” This provided us with a collectively exhaustive and
mutually exclusive list of 10 categories of technical debt."

As you can read, this was an iterative approach that focused on narrowing down the concept of technical debt in different categories.

Technical Debt Categories

The 10 categories of technical debt that they detected were:

Migration is needed or in progress

This might be related to architectural decisions that were made in the past, which worked fine for a while, but then eventually started causing problems.

"This may be motivated by the need to scale, due to mandates, to reduce
dependencies, or to avoid deprecated technology."

You could think about this as an integration with a third party service which is no longer maintained and/or improved. The team knows that they will need to switch to a different service, but they haven’t had the time yet to execute the migration.

Documentation on project and application programming interfaces (APIs)

This might be related to documentation that is no longer up to date. When documentation is not executed, or constantly read and improved, it tends to fall out of date quickly.

"Information on how your project works is hard to find, missing or incomplete, or may include documentation on APIs or inherited code."

Every project has some sort of documentation. In the most basic format, it could be a file in the project that tells you how to properly set up the application for development purposes.


"Poor test quality or coverage, such as missing tests or poor test data,
results in fragility, flaky tests, or lots of rollbacks."

Even at Google, teams are complaining about the lack of tests, the flakiness of test suites, and/or test cases that do not cover important edge cases.

This means that having a test suite is not enough. The tests have to be stable, they have to be thorough, and they have to help your team avoid regressions.

Code quality

"Product architecture or code within a project was not well designed. It may
have been rushed or a prototype/demo."

We have all been in this situation. An initial experiment/prototype/demo is successful and we tend to prioritize features/patches before we take a moment to adjust its architecture.

Improving the architecture of the product becomes something that will be done at some point down the line, but that moment never comes. It usually needs non-technical manager buy-in before it can happen.

Dead and/or abandoned code

"Code/features/projects were replaced or superseded but not removed."

Every now and then pieces of code become unreachable, which can create a false sense of complexity. Modules might seem too big and complex, but maybe only half of that code is actually getting used.

There are open source tools out there to help you remove dead code, but doing this takes time. Teams that report these issues often do not have time to stop and remove dead code before they continue shipping features and patching bugs.

Code degradation

"The code base has degraded or not kept up with changing standards over time.
The code may be in maintenance mode, in need of refactoring or updates."

This might be related to a change in one of the core dependencies of your application (e.g. React.js) which means that new code is expected to be written using functions instead of classes.

Open source moves fast. Using one library (e.g. Angular.js) or another library (React.js) will save us time when we are starting a new project. However, the team behind these libraries can decide to change the entire interface and core concepts from one major release to the next.

No matter what library or framework you choose, this will happen. The key to avoid this problem is to quickly (or gradually) adapt your code to comply with the new way of doing things.

Team lacks necessary expertise

"This may be due to staffing gaps and turnover or inherited orphaned

Depending on the job market, key members of a codebase might find jobs in other companies (or other teams within the same company) which will create a vacuum in the existing team.

If teams don’t take the necessary precautions, then there may be gaps where a team is waiting for the next senior hire (while still expected to continue to ship features and patches to production)


"Dependencies are unstable, rapidly changing, or trigger rollbacks."

Once again, open source moves fast. Tools like Dependabot or Depfu can help you stay up to date, but they are only good for small releases. Upgrading major releases of a framework (e.g. Rails) can take days, weeks, or even several developer months.

Non-trivial upgrades usually get postponed for a better time. Often times, this better time never comes. We have seen this firsthand at our productized services:

We have built a couple of profitable services on top of this particular issue, so we know that even the best teams struggle to keep up. It’s not because they don’t want to upgrade, it’s because other priorities get in the way.

Migration was poorly executed or abandoned

"This may have resulted in maintaining two versions."

This might happen due to a combination of the previous issues. The team started a migration project, but then suddenly there was an emergency and the team had to shift focus. Then that focus never came back to the migration of the system.

Another potential scenario is when a team expects certain promises to be true after a migration and then suddenly realizes that it won’t be the case. Rolling back the migration might end up in the back burner for months before it actually happens.

Release process

"The rollout and monitoring of production needs to be updated, migrated, or

This might be related to the way the software development lifecycle is being managed. In the past we have encountered teams that deploy to production only once a month (due to environmental factors) which causes unnecessary friction.

As much as we enjoy being an agile software development agency, every now and then we have to work with clients who are not deploying changes to production every week. This is very often the case with our clients in highly-regulated industries (e.g. finance, national security, or healthcare)

Measuring Technical Debt

Google’s Engineering Productivity Research Team explored different ways to use metrics to detect problems before they happened:

"We sought to develop metrics based on engineering log data that capture the presence of technical debt of different types, too. Our goal was then to figure out if there are any metrics we can extract from the code or development process that would indicate technical debt was forming *before* it became a significant hindrance to developer productivity."

They decided to focus on three of the 10 types of technical debt: code degradation, teams lacking expertise, and migrations being needed or in progress.

"For these three forms of technical debt, we explored 117 metrics that were proposed as indicators of one of these forms of technical debt. In our initial analysis, we used a linear regression to determine whether each metric could predict an engineer’s perceptions of technical debt."

They put all of their candidate metrics into a random forest model to see if the combination of metrics could forecast developer’s perception of tech debt.

Unfortunately their results were not positive:

"The results were disappointing, to say the least. No single metric predicted reports of technical debt from engineers; our linear regression models predicted less than 1% of the variance in survey responses."

This might be related to the way developers envision the ideal state of a system, process, architecture, and flow, and maybe also due to the difficulty related to estimating how bad the situation is and how bad the situation is going to be at the end of the quarter (when their quarterly surveys are answered)

Managing Technical Debt

As a way to help teams that struggle with technical debt, Google formed a coalition to “help engineers, managers, and leaders systematically manage and address technical debt within their teams through education, case studies, processes, artifacts, incentives, and tools.”

This coalition started efforts to improve the situation:

  1. Creating a technical debt management framework to help teams establish good practices.
  2. Creating a technical debt management maturity model and accompanying technical debt maturity assessment.
  3. Organizing classroom instruction and self-guided courses to evangelize best practices and community forums to drive continual engagement and sharing of resources.
  4. Tooling that supports the identification and management of technical debt (for example, indicators of poor test coverage, stale documentation, and deprecated dependencies).

In my opinion, the most interesting effort of this coalition is defining a maturity model around technical debt. This is similar to CMMI (a framework defined at Carnegie Mellon University) which provides a comprehensive integrated set of guidelines for developing products and services.

This defines a new way to approach the subject. Instead of relying on developer’s gut feeling and environmental factors, this maturity model has tracking at its core. This means that there are measurable metrics that will play a key part in informing an engineering team’s decision around technical debt.

Technical Debt Management Maturity Model

This model defines four different levels. From most basic to most advanced:

Reactive Level

"Teams with a reactive approach have no real processes for managing technical
debt (even if they do occasionally make a focused effort to eliminate it, for
example, through a “fixit”)."

In my experience, most engineering teams have the best intentions to make the right decisions, to ship good enough code, and to take on a reasonable amount of technical debt.

They understand that technical debt does not mean it is okay to ship bad code to production. They analyze the trade-offs of their decisions and they make their calls with that in mind.

Every now and then they will take some time (maybe a sprint or two) to pay off technical debt. When doing this, they usually address issues that they are familiar with because they’ve been hindered by those issues.

Non-technical leaders usually don’t understand the significance of taking on too much technical debt. They start to care once issues start popping up because of these issues. It might take a production outage, a security vulnerability, or extremely low development velocity to get them to react.

Proactive Level

"Teams with a proactive approach deliberately identify and track technical debt and make decisions about its urgency and importance relative to other work."

These teams understand that “if you can’t measure it, you can’t improve it.” So they have been actively identifying technical debt issues. They might have metrics related to the application, the development workflow, the release phase, and/or the churn vs. complexity in their application.

They understand that some of the metrics they’ve been tracking show potential issues moving forward. They might notice that their code coverage percentage has been steadily declining which could signal a slippage in their testing best practices.

They care about certain metrics that might help them improve their development workflow and they know that they need to first inventory their tech debt before taking action. They know that addressing some of these issues might improve their DORA metrics.

Strategic Level

"Teams with a strategic approach have a proactive approach to managing technical debt (as in the preceding level) but go further: designating specific champions to improve planning and decision making around technical debt and to identify and address root causes."

These teams have an inventory of technical debt issues. They build on top of the previous level. For example: They proactively address flaky tests in their test suite.

They might assign one person to one of the issues that they detected. They likely know how to prioritize the list of technical debt issues and focus on the most pressing ones.

Structural Level

"Teams with a structural approach are strategic (as in the preceding level) and also take steps to optimize technical debt management locally—embedding technical debt considerations into the developer workflow—and standardize how it is handled across a larger organization."

Improving the situation is a team effort. Non-technical managers treat tech debt remediation as any other task in the sprint. They likely reserve a few hours of every sprint to paying off technical debt.


After reading this paper, I wish the research team had shared more about the different maturity levels. I believe the software engineering community could greatly benefit from a “Technical Debt Management Maturity Model.”

It’s proof that while technical debt metrics may not be perfect indicators, they can allow teams who already believe they have a problem to track their progress toward fixing it.

The goal is not to have zero technical debt. It has never been the goal. The real goal is to understand the trade-offs, to identify what is and what is not debt, and to actively manage it to keep it at levels that allow your team to not be hindered by it.

Need help assessing the technical debt in your application? Need to figure out how mature you are when it comes to managing technical debt? We would love to help! Send us a message and let’s see how we can help!

Pecas: Machine Learning Problem Shaping and Algorithm Selection2024-01-22T11:51:33-05:002024-01-22T11:51:33-05:00 our previous article, Machine Learning Aided Time Tracking Review: A Business Case we introduced the business case behind Pecas, an internal tool designed to help us analyse and classify time tracking entries as valid or invalid.

This series will walk through the process of shaping the original problem as a machine learning problem and building the Pecas machine learning model and the Slackbot that makes its connection with Slack.

In this first article, we’ll talk through shaping the problem as a machine learning problem and gathering the data available to analyse and process.


This series will consist of 6 posts focusing on the development of the Pecas machine learning model:

  1. Machine Learning Problem Shaping and Algorithm Selection <- You are here
  2. Data Preparation - Data Cleaning, Feature Engineering and Pre-processing
  3. Model Selection and Training - Training a Random Forest classifier
  4. Model Selection and Training - Training a Gradient Boosting classifier
  5. Model Evaluation - Cross-Validation and Fine-Tuning
  6. Model Deployment and Integration

Recap of the Business Problem

Before we dive into the machine learning aspect of the problem, let’s briefly recap the business problem that led to the solution being built.

OmbuLabs is a software development agency providing specialized services to a variety of different customers. Accurate time tracking is an important aspect of our business model, and a vital part of our work. Still, we faced several time tracking related issues over the years, related to accuracy, quality and timeliness of entries.

This came to a head at the end of 2022, when a report indicated we lost approximately one million dollars largely due to poor time tracking, which affected our invoicing and decision-making negatively. Up to this point, several different approaches had been taken to try to solve the problems, mostly related to different time tracking policies. All of these approaches ended up having significant flaws or negative side effects that led to policies being rolled back. This time, we decided to try to solve the problem differently.

There were a variety of time tracking issues, including time left unlogged, time logged to the wrong project, billable time logged as unbillable, incorrect time allocation, vague entries, among others. Measures put in place to try to mitigate the quality-related issues also led to extensive and time-consuming manual review processes, which were quite costly.

In other words, we needed to:

  • Ensure the timeliness and quality of time entries;
  • Do it with a process that wasn’t quite as costly (and therefore not scalable) as the existing manual process;
  • Do it in a way that was fair to the team, and effective.

Our main idea was to replace (or largely replace) the manual process with an automated one. However, although the process was very repetitive, the complexity of the task (interpreting text) meant we needed a tool powerful enough to deal with that kind of data. Hence the idea to use machine learning to automate the time entry review process.

It is worth noting that machine learning powers one aspect of the solution: evaluating the quality and correctness of time entries. Other aspects such as timeliness of entries and completeness of the tracking for a given day or week are very easily solvable without a machine learning approach. Pecas is a combination of both, so it can be as effective as possible in solving the business problem as a whole.

Shaping the Machine Learning problem

The first thing we need to do is identify what part of the problem will be solved with the help of machine learning and how to properly frame that as a machine learning problem.

The component of the problem that is suitable for machine learning is the one that involves “checking” time entries for quality and accuracy, that is, the one that involves “interpreting” text. Ultimately, the goal is to understand if an entry meets the required standards or not and, if not, notify the team member who logged it to correct it.

Therefore, we have a classification problem in our hands. But what type of classification problem?

Our goal is to be able to classify entries according to pre-defined criteria. There are, in essence, two clear ways we can approach the classification:

  • Final classification of the entry as valid or invalid
  • Intermediate classification of the entry as belonging to a pre-defined category which is then checked against pre-defined criteria so validity or invalidity can be determined

Which one we want depends on a few different factors, perhaps the most important one being the existence of a finite, known number of ways in which an entry can be invalid.

If there is a finite, known number of classes an entry can belong to and a known number of ways in which each entry can be invalid, the machine learning model can be used to classify the entry as belonging to a specific category and that entry can then be checked against the specific criteria to determine validity or invalidity.

However, we don’t have that.

Time entries can belong to a wide range of categories as a mix of specific keywords in the description, project they’re logged to, tags applied to the entry, user who logged it, day the entry was logged, among many others. Too many. Therefore, intermediate classification might not be the best approach. Instead, we can use the entry’s characteristics to teach the model to identify entries that seem invalid, and let it determine validity or invalidity of the entry directly.

Thus we have in our hands a binary classification problem, whose objective is to classify time entries as valid or invalid.

Data Extraction and Initial Analysis

Now we know what kind of problem we have in our hands, but there are a wide variety of different algorithms that can help solve this problem. The decision of which one to use is best informed by the data itself. So let’s take a look at that.

The first thing we need is, of course, the time tracking data. We use Noko for time tracking, and it offers a friendly API for us to work with.

A Noko time entry as inputted by a user has a few different characteristics:

  • A duration in minutes;
  • A date the time tracked by the entry refers to;
  • A project it is logged to;
  • A description of the work performed;
  • Tags that can be associated with it;
  • A user who logged the entry.

There is also one relative characteristic of a time entry that is very important: whether it is billable or unbillable. This is controlled by one of two entities: project or tag. Projects can be billable or unbillable. By default, all entries logged to an unbillable project are unbillable and all entries logged to a billable project are billable. However, entries logged to a billable project can be unbillable when a specific tag (the #unbillable tag) is added to the entry.

There is also some metadata and information that comes from user interaction with the system that can be associated with the entry, the most relevant ones being:

  • A unique ID that identifies the entry in the system;
  • The date the entry was created in the system (can be different from the date to which the entry refers);
  • Whether the entry has been invoiced yet or not;
  • Whether the entry has been approved by an admin user or not.

Of the entities associated with an entry, as mentioned above one is of particular interest: projects. As aforementioned, projects can indicate whether an entry is billable or unbillable. And, as you can imagine, an entry that belongs to a billable project logged to an unbillable project by mistake means the entry goes uninvoiced, and we lose money in the process.

A project also has a unique ID that identifies it, a name and a flag that indicates whether it is a billable or unbillable project. The flag and the ID are what matters to us for the classification, the ID because it allows us to link the project to the entry and the flag because it is the project characteristic we want to associate with the data.

There are other data sources that have relevant data that can be used to gain context on time entries, for example calendars, GitHub pull requests, Jira tickets. For now, let’s keep it simple, and use a dataset of time entries enriched with project data, all coming from Noko.

Initial Exploration

In order to make it easier to work and explore the data, we extracted all time entries from Noko logged between January 1st, 2022 and June 30th, 2023. In addition to entries, projects, tags and users were also extracted from Noko, and the data was loaded into a Postgres database, making it easy to explore with SQL.

Pecas ML ERD

We then extracted a few key characteristics from the set:

property stat
total_entries 49451
min_value 0
max_value 720
duration_q1 30
duration_q3 90
average_duration_iq 49.39
average_duration_overall 71.33
median_duration 45
max_word_count 162
min_word_count 1
avg_word_count 9.89
word_count_q1 4
word_count_q3 11
entries_in_word_count_iq 29615
average_word_count_iq 6.63
least_used_tag: ops-client 1
most_used_tag: calls 12043
unbillable_entries 33987
billable_entries 15464
pct_unbillable_entries 68.73
pct_billable_entries 31.27

Data Interpretation

The table above allows us to get a good initial insight into the data and derive a few early conclusions:

  • Data Size: For the problem at hand, our dataset is fairly large (over 49,000 entries), providing a substantial amount of data for analysis and training;
  • Entry Duration: Duration values suggest a pretty wide range of time logged to an individual entry (varying between 0 and 720). Most entries tend towards short durations though (median of 45 minutes).
  • Word Count: There’s significant variation in word count per entry (between 1 and 162), although it tends to be on the lower side (average of 9.89), indicating longer entries are the outlier.
  • Tag Distribution: The number of least used tag indicates there are outliers in the data (tags only used once).
  • Billable vs Unbillable Entries: The majority of entries are unbillable. The table does not include enough information to let us properly infer why.

This initial set of considerations already tells us something about our data. We have a fairly large dataset, with a mix of numerical and categorical variables. There are also outliers in several features of the data and the range of values in durations and word count could indicate their relationship with validity or invalidity is not strictly linear. Our empirical knowledge confirms this assumption. Although longer entries in duration are generally expected to have longer descriptions, there are several use cases for long entries in duration to have small word counts.

Other characteristics we looked at (in similar fashion) to get a good initial idea of what we were dealing with include:

  • Number of entries per project;
  • Most frequent duration per project;
  • Number of entries per tag;
  • Most frequent duration per tag;
  • Occurrence of tags per project.

This gave us a good initial idea of what we were dealing with.

Algorithm Selection

By this point, we know we’re dealing with a binary classification problem and that we have a fairly large dataset with outliers and non-linear relationships in data. The dataset also has a mix of numerical and categorical variables. The problem we have at hand is made more complex by the presence of text data that requires interpretation.

There are a number of algorithms to choose from for binary classification, perhaps the most common being:

  • Logistic Regression;
  • K-Nearest Neighbors;
  • Naive Bayes;
  • Support Vector Machines;
  • Tree-based models (such as Random Forest or Gradient Boosting).

A quick comparison of their strengths and weaknesses shows that tree-based models are most likely the right choice for our use case:

Logistic Regression

Logistic regression’s strengths lie in its simplicity:

  • Coefficients of the model can be quite easily interpreted;
  • Relatively low computational requirements, making it efficient.

However, some of its weaknesses make it clearly not a good candidate for our use case:

  • Assumes a linear relationship between features and the log odds (logarithm, usually natural logarithm, of the odds of an event occurring) of the outcome;
  • Might not perform well with complex, non-linear relationships in the data.

K-Nearest Neighbors

Another example of a simple algorithm, with strengths associated with its simplicity:

  • It’s quite easy to understand and to implement;
  • Works really well with small datasets.

However, some of its weaknesses also make it immediately not a good choice for our problem:

  • Performs poorly on large datasets;
  • The impact of irrelevant features or noise in the data is very significant.

Naive Bayes

Naive Bayes’ core strengths are:

  • Fast training times, making it a good choice for baseline models;
  • Performs quite well with smaller datasets and limited resources.

However, two key weaknesses make it yet another unsuitable choice for our use case:

  • It assumes feature independence, which is not the case for us (and is quite rarely the case with any real-world data);
  • Performs poorly with complex, non-linear relationships between features.

Support Vector Machine (SVM)

Unlike the previous algorithms, two of SVMs core strengths apply to our use case:

  • It is effective in high-dimensional spaces;
  • Allows flexibility in modeling various types of relationships.

However, two core weaknesses make it second to tree-based models as a choice:

  • Training time on large datasets can be quite long;
  • Requires careful tuning of hyperparameters.

Tree-based Models

We have arrived at the most suitable type of algorithm for our problem at hand! The core strengths of these algorithms that make them a good choice are:

  • Good with mixed data types (numerical vs categorical variables);
  • Can effectively capture complex, non-linear relationships in data;
  • Tolerant to outliers and able to handle missing values in the set.

Some weaknesses related to them are:

  • They can be computationally intensive, especially when using gradient boosting with a large dataset;
  • Have a potential to overfit (learn too much about the training data and lose the ability to generalize), requiring careful tuning to avoid this.

Therefore, we’ll pick ensemble tree-based models as our starting point.

But which one? That’s a tale for the next post. We’ll do some more analysis in our data, pre-process it and train a few different models to pick the best one.


Framing your business problem (or business question) as a machine learning problem is a first and necessary step in understanding what kind of problem you’re dealing with and where to start solving it. It helps guide our data exploration and allows us to choose which machine learning algorithm (or family of algorithms) to start with.

A good understanding of the data available to you, the business context around the problem, and the characteristics that matter can help guide your exploration of the dataset to validate some initial questions, such as do we have enough data or is the data available enough to convey the information we need. It’s important to not be tied by these initial assumptions and this initial knowledge in your analysis though, as exploring the data might reveal additional, useful insights.

With a good understanding of the problem and dataset, you can make an informed algorithm selection, and start processing your data and engineering your features so your model can be trained. This second step is what we’ll look at in the next post.

Need help leveraging machine learning to solve complex business problems? Send us a message and let’s see how we can help!

Machine Learning Aided Time Tracking Review: A Business Case2023-12-12T13:05:41-05:002023-12-12T13:05:41-05:00 an agency, our business model revolves around time. Our client activities rely on a dedicated number of hours per week worked on a project, and our internal activities follow the same pattern. As such, time tracking is a vital part of our work. Ensuring time is tracked correctly, and time entries meet a minimum quality standard, allows us to be more data-driven in our decisions, provide detailed invoices to our clients and better manage our own projects and initiatives.

Despite being a core activity, we had been having several issues with it not being completed or not being completed properly. A report we ran at the end of 2022 showed our time tracking issues were actually quite severe. We lost approximately one million dollars in 2022 due to time tracking issues that led to decisions made on poor data. It was imperative that we solved the problem.

To help with this issue, we created an evolution of our Pecas project. We turned Pecas into a machine learning powered application capable of alerting users of issues in their time entries. In this article, we’ll talk though the business case behind it and expected benefits to our company.

The Time Tracking Problem

Our time tracking issues pre-dated the 2022 end of year report. By that point, we had been having problems for a couple of years, it just wasn’t a big priority. As the company grew, however, the issues multiplied, and got to a point where we needed to prioritize solving the problem.

A detailed analysis of our time tracking data revealed several different issues, both issues that were typically caught by internal processes relying on this information, such as invoicing, and issues that typically remained hidden:

  • Billable entries logged to non-billable projects;
  • Time that “disappeared” (no logged to a client or internal project) in a given week;
  • Potentially billable time that went to internal, non-billable projects instead (time that had been “sold” but wasn’t used as such);
  • Vague, non-descriptive entries that harm internal decision making and generate back and forth on invoicing;
  • Incorrect entries making it seem like specific activities took much longer (or much less time) than they actually did;
  • Delays in our invoicing process caused by entries needing to be fixed;
  • More time invested in manual reviews and correcting entries that should be invested elsewhere.

These were some of the main issues we were facing, and as a small company, their impact was even more significant to our projects and our operation overall. We knew it was a problem, and we attempted a few different solutions, including implementing policies around time tracking. They ended up having serious flaws that caused us to reconsider and eventually retract them. But we still had a problem to solve.

At the end of 2022, when we looked at our number for the year, we decided to dive deeper into this data. And the cost of the issues mentioned above became very clear: we lost $1,000,000 dollars due to these issues and their consequences. What this meant is that we had a million dollar problem to solve.

Time tracking issues (timing and quality of entries) are one aspect of a complex problem. Improving time tracking quality was one of the problems we had to solve, and one of significant impact. There were, however, multiple root causes that led to the loss we identified (process problems, service management, communication). Those are being addressed separately and are beyond the scope of this article.

The Solution to Time Tracking Quality

Our main issue was that specific time tracking policies we had implemented didn’t account for nuance. Although delays in time entries being entered into the system and entries logged to incorrect projects decreased, addressing some of the most costly problems we had, honest mistakes were treated in the same way as more serious issues, and the policy was found to be unfair in some cases.

This went against our core values and led us to look for a different solution. The main issue was that there was no way to be alerted of honest mistakes in entries before the information was needed, someone reviewed and found the issue manually, or we ran another comprehensive report.

Manual processes for these kinds of tasks are not great. They are expensive and take people away from other activities. We wanted an automated way to monitor and flag entries. We knew from the beginning there was always going to be a human component to it, but if we could reduce the time we spent every week running reports and reviewing and fixing entries, that was already a win.

That’s when we decided to build an internal tool to help with this. Our goal was to reduce the time invested on time tracking by our operations team by automating the bulk of the work to find these issues, and leaving human reviews to a much smaller set of of entries.

This solution would need to be able to:

  • Evaluate each time entry;
  • Decide whether it is valid or invalid;
  • Notify the user of issues found.

The complexity lied in the fact that we’re dealing with text data in free speech form (entry description) combined with several other properties (project, labels, date, billable or non-billable, duration). Accounting for all possible scenarios and issues with hard rules would not work. That’s where machine learning comes into play.

We split the entry classification part of the solution in two:

  • Hard-rules checking common entry types for commonly seen issues;
  • Binary classification model for the more complex ones.

No solution is perfect, and we knew there were going to be issues that still slipped through the cracks, as well as a need for human review. Our goal was to minimize both.

That’s how the Pecas project was born.

The Business Case

When we decided to build the solution, we were spending between 3 and 5 hours every week on time tracking reports. That meant spending between $30,000 and $50,000 every year just on these reports. As the company grew and we had more people joining the team, the time spent on this was also going to increase significantly.

In summary, we had one million dollars in losses in 2022 alone, and were looking at a current cost of $30,000 to $50,000 per year to run the process manually, increasing every time our team grew. We had a pretty solid case to invest in a solution.

Additional factors that contributed to our decision to go ahead with the project were:

  • Good opportunity for experimentation: Pecas is our first tool built with machine learning integration;
  • Learning opportunities:
    • It provided a good learning opportunity for team members looking to work with machine learning;
    • It gave our more junior engineers an opportunity to work on building a product from the ground up;
    • It gave interested team members an opportunity to work with different technologies, namely Python and Django.
  • Knowledge share: it gave our senior team members an opportunity to knowledge share and mentor more junior team members.

In order to properly evaluate whether building a solution was the right move, we also had to consider implementation and maintenance costs. We had the expertise needed within our team, so we didn’t really need to bring in external help to accomplish what we needed; and even with the added complexity of a machine learning model, we were looking at a small application. To put it into perspective:

Assuming we did nothing, we would continue to incur significant losses year after year. From our data analysis and root cause evaluation, we believe the solution could help reduce the loss by approximately 60%, saving us $600,000. Similarly, the solution can reduce the time spent reviewing time entry reports by 80%, meaning our costs would reduce to $6,000 to $10,000 per year, saving us between $24,000 and $40,000 every year, not accounting for potential growth.

Building the solution would cost approximately 50% of the total we expected to save, and maintaining it, once built, certainly wouldn’t cost as much as we were losing. Pretty good case to build it!

Add to that the knowledge and learning gains, and preserving our culture of team first, and the decision was easy.

Current State

The Pecas app’s first version went live in March and, at that time, supported only filter classification with hard rules. That allowed us to measure user interaction with it and see how (or if) things would improve. It also got us thinking about new ways to leverage the app.

A version of the app with machine learning integrated went live in August, and we have been monitoring it and collecting data. The number of common issues in entries identified has decreased significantly, and timeliness of the logging has greatly improved.

We have found additional use cases for the bot, and created additional alerts for project teams, project managers and our overall Operations team. This has allowed us to identify issues faster and react to them immediately, saving us time, money, and headache in the long run.

We’re still monitoring data and working through results, a preliminary analysis shows that the number of billable time entries logged to non-billable projects in Q3 2023 was 95% smaller than for the same period in 2022, so we’re calling this an win for now as we continue to expand the machine learning and other functionalities.


Machine learning isn’t a magic bullet to all of our problems. In fact, in many cases, it isn’t quite the right solution, and you can go very far with hard rules. There are situations, however, where it is the ideal solution. In those cases, it is a powerful tool to solve very complex problems.

As previously mentioned, an automated tool to aid time tracking quality wasn’t the only solution to this problem. Changes in process were also required, and each case was examined, separately and in conjunction with others, and addressed. But it was a core piece in the strategy, and the results are positive and quite promising.

We specialize in solving complex problems for companies looking to build interesting tools that provide meaningful results. We take a holistic look at the problem, advise on all aspects of the problem, and can help you improve your processes and build the right tool for the right problem.

Got some difficult problems you’d like to solve with software but not quite sure where to start? Unsure if machine learning is the right solution to your problem? Send us a message.

Hacktoberfest 2023: How We Merged Open Source Contributions with Learning Objectives2023-11-21T03:43:00-05:002023-11-21T03:43:00-05:00 a company, one of our core values is to be “open by default.” At the same time, one of our goals is to use our open source investment time as a way to improve our skills as engineers and designers.

In that spirit, this year we decided to organize our open source contribution time in a way that wasn’t limited to our own open source projects. This is a short post to explain how we aligned our open source contributions with our learning goals, what contributions we made, and why it mattered.

Our Motivation

Last year, as a company, we did an exercise in participating in Hacktoberfest with our team. There were positive and negative notes but, overall, feedback around the exercise itself was positive.

This year we had specific goals and topics we wanted to focus on as a team. We decided to use open source projects as a way to learn and practice while also contributing to the community.

Therefore, this year we aligned our open source contributions with our learning purposes. As a part of our company, we conduct monthly one-on-one calls with our full-time employees. In those calls, we learn about areas and skills that our direct reports would like to improve.

The problem is that sometimes client work doesn’t give us the opportunities we need to work on said skills.

That’s why we decided to use the month of October to contribute to open source projects with the following intentions:

  • For senior engineers: We wanted them to improve their upgrading and debugging skills, so that they could improve their skills when it comes to fixing medium to high complexity bugs.

  • For mid-level engineers: We wanted them to work on features so that they could improve their skills when it came to greenfield-like projects.

Hacktoberfest Plan

This year we decided not to restrict contributions to repositories that were officially participating in Hacktoberfest.

We asked everyone to suggest repositories before we started and we quickly came up with a list of approved projects.

Senior engineers were asked to work on two kinds of issues: technical debt and bugs.

Mid-level engineers were asked to work on any kind of issue they found interesting, with a focus on new features or feature changes.

To organize that:

  • We divided everyone into pairs.
  • Each pair was asked to select issues from the list of approved projects.

Our Teams

This time we decided to split in teams:

Our Own Open Source Projects

When it came to our own projects, we decided to have only Ariel and Ernesto’s team work on open source projects maintained by OmbuLabs.

We focused on these projects:

Approved Projects

We wanted to make sure that our teams focused on projects that were approved by our engineering management team. The list included some well-known and really useful tools that we’ve been using for years:


In terms of contributions, we considered activity on pull requests and issues as a valid contribution. We understand that sometimes you are looking to add value to an open source project, and after hours of research and trying many different things, all you can add is a comment to an existing issue. In our exercise, and in general, that counts as a contribution too!


Here are all the issues where we added value:

Pull requests

Here are all the pull requests we submitted:

Investment Time

In total during the month of October we invested 392 hours in our open source contributions. That represents an investment of $79,000 into open source by 10 of our senior and mid-level engineers.

Next Steps

We plan to take all of our contributions across the finish line, using our regular, monthly and paid open source investment time. Outside of Hacktoberfest, on average, as a team we invest 38 hours per month on open source contributions.

We look forward to continuing our investment in the open source projects that add value to the world and our communities. We believe this is the way to hone our craft, learn new things faster, and become better professionals.

Running Airflow on Google Kubernetes Engine without Helm2023-11-17T15:54:25-05:002023-11-17T15:54:25-05:00 Cloud Platform (GCP) can be a very good option for Airflow and, although it offers its own managed deployment of Airflow, Cloud Composer, managing our own deployment gives us more granular control over the underlying infrastructure, impacting choices such as what Python version to run and even when to upgrade Airflow itself.

The Airflow community maintains a Helm chart for Airflow deployment on a Kubernetes cluster. The Helm chart comes with a lot of resources, as it contains a full Airflow deployment with all the capabilities. We didn’t need all of that, and we wanted granular control over the infrastructure. Therefore, we chose not to use Helm, although it provides a very good starting point for the configuration.


The Airflow installation consists of five different components that interact with each other, as illustrated below:

Airflow Components (Source: Official Airflow Documentation)

In order to configure our Airflow deployment on GCP, we used a few different services:

  • Google Kubernetes Engine (GKE) for the cluster hosting the scheduler and webserver deployments
  • Postgres instance in CloudSQL for the metadata database
  • Git-sync to store DAG files in an ephemeral volume, syncing directly from GitHub

NOTE: The steps below assume you have both the Google Cloud SDK and kubectl installed, and a GCP project set up.

CloudSQL and Cluster Set Up

Before deploying Airflow, we need to configure a CloudSQL instance for the metadata database and the GKE cluster that will host the Airflow deployment. We opted to use a Virtual Private Cloud (VPC) to allow the connection between GKE and CloudSQL.

Setting up a CloudSQL instance

To create a CloudSQL instance for the Airflow database:

gcloud sql instances create airflow_metadb \
    --database-version=POSTGRES_15 \
    --tier=db-n1-standard-2 \
    --region=us-east1 \

Customize the database version, tier, region, and network to your needs. If you don’t plan on using a VPC, you don’t need the network argument. Check out the gcloud sql instances create documentation for a full list of what’s available.

Connect to the newly created instance to create a database to serve as the Airflow metadata database. Here, we’ll create a database called airflow_metadb:

gcloud beta sql connect airflow_metadb

This will open a Postgres shell, where you can create the database.

CREATE DATABASE airflow_meta;

Finally, get the instance’s IP address and port to construct the database connection URL, which will be needed for the Airflow set up. You’ll need the IP address listed as PRIVATE:

gcloud sql instances describe airflow_metadb

Your connection URL should follow the format:


for a Postgres instance.

Setting up a GKE Cluster

Before initializing a new Kubernetes cluster on GKE, make sure you have the right project set in the gcloud CLI:

gcloud config set project airflow

Create a new cluster on GKE:

gcloud container clusters create airflow-cluster \
    --machine-type e2-standard-2 \
    --num-nodes 1 \
    --region "us-east1" \
    --scopes "cloud-platform"

Choose the correct machine type for your needs. If your cluster ends up requesting more resources than you need, you’ll end up overpaying for Airflow. Conversely, if you have less resources than required, you will run into issues such as memory pressure. Also choose the number of nodes to start and the region according to your needs. The --scopes argument set to cloud-platform allows the GKE cluster to communicate with other GCP resources. If that is not needed or desired, remove it.

For a full list of the options available, check the gcloud container clusters create documentation.

Authenticate kubectl against your newly created cluster:

gcloud container clusters get-credentials airflow-cluster --region "us-east1"

and create a Kubernetes namespace for the Airflow deployment. Although not necessary, this is a good practice, and it’ll allow for the grouping and isolating of resources, enabling, for example, separation of a production and staging deployment within the same cluster.

kubectl create namespace airflow

The cluster should now be set up and ready.

Cluster Preparation

Our goal was to have Airflow deployed to a GKE cluster and the Airflow UI exposed via a friendly subdomain. In order to do that, we need to obtain and use a certificate.

To make the process of obtaining, renewing, and using certificates as easy as possible, we decided to use cert-manager, a native Kubernetes certificate management controller. For that to work, though, we need to ensure that traffic is routed to the correct service, so requests made to the cert-manager solver to confirm domain ownership reach the right service, and requests made to access the Airflow UI also reach the right service.

In order to do that, an nginx ingress controller was needed.

NGINX Ingress Controller Configuration

Unlike an Ingress, an Ingress Controller is an application running inside the cluster that configures a load balancer according to multiple ingress resources. The NGINX ingress controller is deployed in a pod along with such load balancer.

To help keep the ingress controller resources separate from the rest, let’s create a namespace for it:

kubectl create namespace ingress-nginx

The easiest way to deploy the ingress controller to the cluster is through the official Helm Chart. Make sure you have helm installed, then add the nginx Helm repository and update your local Helm chart repository cache:

helm repo add ingress-nginx
helm repo update

Install the ingress-nginx Helm chart in the cluster:

helm install nginx-ingress ingress-nginx/ingress-nginx -n ingress-nginx

where nginx-ingress is the name we’re assigning to the instance of the Helm chart we’re deploying, ingress-nginx/ingress-nginx is the chart to be installed (the ingress-nginx chart in the ingress-nginx Helm repository) and -n ingress-nginx specifies the namespace within the Kubernetes cluster in which to install the chart.

With the controller installed, run:

kubectl get services -n ingress-nginx

and look for the EXTERNAL IP of the ingress-nginx-controller service. That is the IP address of the load balancer. To expose the Airflow UI via a subdomain, we configured an A record pointing to this IP address.

Cert-manager Configuration

Now that the controller is in place, we can proceed with the installation of the cert-manager. First, apply the CRD (CustomResourceDefinition) resources:

kubectl apply -f

The cert-manager relies on its own custom resource types to work, this ensures these resources are installed.

Like with the controller, we’ll also create a separate namespace for the cert-manager resources:

kubectl create namespace cert-manager

And install cert-manager using the Helm chart maintained by Jetstack:

helm repo add jetstack
helm repo update
helm install cert-manager jetstack/cert-manager --namespace cert-manager --version v.1.13.0

With cert-manager installed, we now need two additional resources to configure it: a ClusterIssuer and Certificate.

The ClusterIssuer creates a resource to represent a certificate issuer within Kubernetes, i.e., it defines a Kubernetes resource to tell cert-manager who the certificate issuing entity is and how to connect to it. You can create a simple ClusterIssuer for Let’s Encrypt as follows:

kind: ClusterIssuer
  name: letsencrypt
      name: letsencrypt
    - http01:
          class: nginx

The Certificate resource then defines the certificate to issue:

kind: Certificate
  name: airflow-certificate
  namespace: airflow
  secretName: cert-tls-secret
    name: letsencrypt
    kind: ClusterIssuer

Apply both resources to the cluster to get the certificate issued. Assuming everything went well and the DNS records are set up correctly, when you run:

kubectl describe certificate airflow-certificate -n airflow

you should see Status: True at the bottom of the certificate’s description, indicating the certificate has been issued.

Now our cluster is ready to receive the Airflow deployment.

Deploying Airflow

The Airflow deployment includes a few different pieces, so we can get Airflow to properly work. The Airflow installation in Kubernetes ends up looking more like this:

Airflow on Kubernetes (Source: Official Airflow Documentation)

Our complete Airflow deployment resources ended up looking like this:

|---- airflow.cfg
|---- secrets.yaml
|---- persistent_volumes
      |---- airflow-logs-pvc.yaml
|---- rbac
      |---- cluster-role.yaml
      |---- cluster-rolebinding.yaml
|---- scheduler
      |---- scheduler-deployment.yaml
      |---- scheduler-serviceaccount.yaml
|---- statsd
      |---- statsd-deployment.yaml
      |---- statsd-service.yaml
|---- webserver
      |---- webserver-deployment.yaml
      |---- webserver-ingress.yaml
      |---- webserver-service.yaml
      |---- webserver-serviceaccount.yaml

Secrets and Configuration

In order to successfully deploy Airflow, we need to make sure the airflow.cfg file is available in the relevant pods. Airflow allows you to configure a variety of different things through this file (check the Configuration Reference for more detailed information).

In Kubernetes, this kind of configuration is stored in a ConfigMap, which a special kind of “volume” you can mount inside your pods and use to make configuration files available to them. The ConfigMap works together with Kubernetes secrets, meaning you can reference a Secret directly inside a ConfigMap or pass the Secret as an environment variable and reference that.

Of note: Kubernetes secrets are somewhat unsafe considering they just contain a base64 encrypted string that can be easily decrypted. If secrets need to be versioned or committed somewhere, it’s better to use GCP’s Secret Manager instead.

A ConfigMap for the airflow.cfg file can be created running:

kubectl create configmap airflow-config --from-file=airflow.cfg -n airflow

where airflow-config is the name of the ConfigMap created and the -n airflow flag is necessary to create the resource in the correct namespace.

Kubernetes secrets can be created using a secrets.yaml manifest file to declare individual secrets:

apiVersion: v1
kind: Secret
  name: airflow-metadata
type: Opaque
  connection: "your-base64-encrypted-connection-string"
  fernet-key: "your-base64-encrypted-fernet-key"


apiVersion: v1
kind: Secret
  name: git-sync-secrets
type: Opaque
  username: "your-base64-encrypted-username"
  token: "your-base64-encrypted-token"

If you decide to go with plain Kubernetes secrets, keep this yaml file private (don’t commit it to a repository). To apply it to your cluster and create all the defined secrets, run:

kubectl apply -f secrets.yaml -n airflow

This command will apply the secrets.yaml file to the Kubernetes cluster, in the airflow namespace. If secrets.yaml is a valid Kubernetes manifest file and the secrets are properly defined, all Kubernetes secrets specific within the file will be created in the cluster and namespace.

Persistent Volumes

What volumes (and how many volumes) you’ll need will depend on how you decide to store Airflow logs and how your DAGs are structured. There are, in essence, two ways to store DAG information:

  • Store DAGs in a persistent volume
  • Sync them from a git repository into an ephemeral volume mounted inside the cluster

The key point to keep in mind is that the folder the Airflow scheduler and webserver are watching to retrieve DAGs from and fill in the DagBag needs to contain built DAGs Airflow can process. In our case, our DAGs are static, built directly into DAG files. Therefore, we went with a simple git-sync approach, syncing our DAG files into an ephemeral volume and pointing the webserver and scheduler there.

This means the only persistent volume we needed was to store Airflow logs.

A PersistentVolume is a cluster resource that exists independently of a Pod, meaning the disk and data stored there will persist as the cluster changes, and Pods are deleted and created. These can be dynamically created through a PersistentVolumeClaim, which is a request for and claim to a PersistentVolume resource:

apiVersion: v1
kind: PersistentVolumeClaim
  name: airflow-logs-pvc
    - ReadWriteOnce
      storage: 10Gi
  storageClassName: standard

This creates an airflow-logs-pvc resource we can use to store Airflow logs.

Role-Based Access Control (RBAC)

Kubernetes RBAC is a security feature allowing us to manage access to resources within the cluster through defined roles. A Role is a set of rules that defines the actions allowed within a specific namespace. A RoleBinding is a way to associate a specific Role with a user or, in our case, a service account.

To define roles that apply cluster-wide rather than specific to a namespace, you can use a ClusterRole and an associated ClusterRoleBinding instead.

In the context of our Airflow deployment, a ClusterRole is required to allow the relevant service account to manage Pods. Therefore, we created an airflow-pod-operator role:

kind: ClusterRole
  namespace: airflow
  name: airflow-pod-operator
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["create", "delete", "get", "list", "patch", "watch"]

with an associated role binding:

kind: ClusterRoleBinding
  name: airflow-pod-operator
- kind: ServiceAccount
  name: airflow-service-account
  namespace: airflow
  kind: Role
  name: airflow-pod-operator

Scheduler Deployment

The scheduler is a critical component of the Airflow application, and it needs to be deployed to its own Pod inside the cluster. At its core, the scheduler is responsible for ensuring DAGs run when they are supposed to, and tasks are scheduled and ordered accordingly.

The scheduler deployment manifest file that comes with the Helm chart (you can find it inside the scheduler folder) is a good starting point for the configuration. You’ll only need to tweak it a bit to match your namespace and any specific configuration you might have around volumes.

In our case, we wanted to sync our DAGs from a GitHub repository, so we needed to configure a git-sync container. An easy way to get started is to configure the connection with a username and token, although for a production deployment it’s best to configure the connection via SSH. With git-sync configured, our scheduler deployment looked like this:

apiVersion: apps/v1
kind: Deployment
  name: airflow-scheduler
  namespace: airflow
    tier: airflow
    component: scheduler
    release: airflow
  replicas: 1
      tier: airflow
      component: scheduler
      release: airflow
        tier: airflow
        component: scheduler
        release: airflow
      annotations: "true"
      restartPolicy: Always
      terminationGracePeriodSeconds: 10
      serviceAccountName: airflow-service-account
        - name: config
            name: airflow-config
        - name: dags-volume
          emptyDir: {}
        - name: logs-volume
            claimName: airflow-logs-pvc
        - name: run-airflow-migrations
          image: apache/airflow:2.7.1-python3.11
          imagePullPolicy: IfNotPresent
          args: ["bash", "-c", "airflow db migrate"]
            - name: AIRFLOW__CORE_FERNET_KEY
                  name: airflow-metadata
                  key: fernet-key
            - name: AIRFLOW__CORE__SQL_ALCHEMY_CONN
                  name: airflow-metadata
                  key: connection
            - name: AIRFLOW_CONN_AIRFLOW_DB
                  name: airflow-metadata
                  key: connection
            - name: config
              mountPath: "/opt/airflow/airflow.cfg"
              subPath: airflow.cfg
              readOnly: true
        - name: git-sync
            - --repo=
            - --depth=1
            - --period=60s
            - --link=current
            - --root=/git
            - --ref=main
            - name: GITSYNC_USERNAME
                  name: git-username
                  key: username
            - name: GITSYNC_PASSWORD
                  name: git-token
                  key: token
            - name: dags-volume
              mountPath: /git

        - name: scheduler
          imagePullPolicy: Always
            - scheduler
            - name: AIRFLOW__CORE__DAGS_FOLDER
              value: "/git/current"
            - name: AIRFLOW__CORE__FERNET_KEY
                  name: airflow-metadata
                  key: fernet-key
            - name: AIRFLOW__CORE__SQL_ALCHEMY_CONN
                  name: airflow-metadata
                  key: connection
            - name: AIRFLOW_CONN_AIRFLOW_DB
                  name: airflow-metadata
                  key: connection
            failureThreshold: 15
            periodSeconds: 30
                - python
                - -Wignore
                - -c
                - |
                  import os
                  os.environ['AIRFLOW__CORE__LOGGING_LEVEL'] = 'ERROR'
                  os.environ['AIRFLOW__LOGGING__LOGGING_LEVEL'] = 'ERROR'
                  from import SchedulerJob
                  from import get_hostname
                  import sys
                  job = SchedulerJob.most_recent_job()
                  sys.exit(0 if job.is_alive() and job.hostname == get_hostname() else 1)
            - name: config
              mountPath: "/opt/airflow/airflow.cfg"
              subPath: airflow.cfg
              readOnly: true
            - name: dags-volume
              mountPath: /git
            - name: logs-volume
              mountPath: "/opt/airflow/logs"

The scheduler deployment is divided into two “stages”, the initContainers and the containers. When Airflow starts, it needs to run database migrations in the metadata database. That is what the init container is doing. It runs as soon as the scheduler pod starts, and ensures the database migration is completed before the main application containers start. Once the init container is done with the start up task, the git-sync and scheduler containers can run.

Notice that the scheduler container references a custom image in Artifact Registry. Given our pipeline set up and choice of executor, we replaced the official Airflow image in the deployment with our own image.

Webserver Deployment

The webserver is another critical Airflow component, it exposes the Airflow UI and manages user interaction with Airflow. Its deployment is all too similar to that of the scheduler, with minor differences, so we won’t go into it in detail. The manifest file looks like this:

apiVersion: apps/v1
kind: Deployment
  name: airflow-webserver
  namespace: airflow
    tier: airflow
    component: webserver
    release: airflow
  replicas: 1
    type: RollingUpdate
    rollingUpdate: 2023-11-17 20:54:25
      maxSurge: 3
      maxUnavailable: 1
      tier: airflow
      component: webserver
      release: airflow
        tier: airflow
        component: webserver
        release: airflow
      annotations: "true"
      restartPolicy: Always
      terminationGracePeriodSeconds: 10
      serviceAccountName: default
        - name: config
            name: airflow-config
        - name: dags-volume
          emptyDir: {}
        - name: logs-volume
            claimName: airflow-logs-pvc
        - name: run-airflow-migrations
          image: apache/airflow:2.7.1-python3.11
          imagePullPolicy: IfNotPresent
          args: ["bash", "-c", "airflow db migrate"]
            - name: AIRFLOW__CORE__FERNET_KEY
                  name: airflow-metadata
                  key: fernet-key
            - name: AIRFLOW__CORE__SQL_ALCHEMY_CONN
                  name: airflow-metadata
                  key: connection
            - name: AIRFLOW_CONN_AIRFLOW_DB
                  name: airflow-metadata
                  key: connection
            - name: config
              mountPath: "/opt/airflow/airflow.cfg"
              subPath: airflow.cfg
              readOnly: true
        - name: git-sync
            - --repo=
            - --depth=1
            - --period=60s
            - --link=current
            - --root=/git
            - --ref=main
            - name: GITSYNC_USERNAME
                  name: git-username
                  key: username
            - name: GITSYNC_PASSWORD
                  name: git-token
                  key: token
            - name: dags-volume
              mountPath: /git

        - name: webserver
          imagePullPolicy: Always
            - webserver
            - name: AIRFLOW__CORE__FERNET_KEY
                  name: airflow-metadata
                  key: fernet-key
            - name: AIRFLOW__CORE__SQL_ALCHEMY_CONN
                  name: airflow-metadata
                  key: connection
            - name: AIRFLOW_CONN_AIRFLOW_DB
                  name: airflow-metadata
                  key: connection
              value: "airflow.api.auth.backend.basic_auth"
            - name: config
              mountPath: "/opt/airflow/airflow.cfg"
              subPath: airflow.cfg
              readOnly: true
            - name: dags-volume
              mountPath: /git
            - name: logs-volume
              mountPath: "/opt/airflow/logs"
            - name: airflow-ui
              containerPort: 8080
              path: /health
              port: 8080
            initialDelaySeconds: 30
            periodSeconds: 15
              path: /health
              port: 8080
            initialDelaySeconds: 30
            periodSeconds: 15

Perhaps the most notable thing here is the presence of the AIRFLOW__WEBSERVER__AUTH_BACKEND environment variable. This allows us to use a basic authentication backend with Airflow. As part of this deployment, we didn’t configure the creation of a root user, meaning one needed to be created from within the container by the first person trying to access the UI. If you find yourself in the same situation:


kubectl exec -it <webserver-pod-name> -n airflow -c webserver -- /bin/sh

to access the shell within the webserver container. By default, running the command without the -c webserver flag will access the git-sync container, which is not what we want. Once inside the shell, run:

su airflow

To switch to the airflow user. This is needed to run airflow commands. Now you can run:

airflow users create --username <your_username> --firstname <first_name> --lastname <last_name> --role <the-user-role> --email <your-email> --password <your-password>

This will create a user with the specified role. This only needs to be run to create the first admin user after a fresh deployment, additional users can be created directly from within the interface.

Services and Ingresses

Having the webserver deployed to a pod is not enough to be able to access the UI. It needs a Service resource associated with it to allow access to the workload running inside the cluster. From our webserver manifest file, we defined an airflow-ui port name for the 8080 container port. Now we need a service that exposes this port so that network traffic can be directed to the correct pod:

kind: Service
apiVersion: v1
  name: webserver-svc
  namespace: airflow
  type: ClusterIP
    tier: airflow
    component: webserver
    release: airflow
    - name: airflow-ui
      protocol: TCP
      port: 80
      targetPort: 8080

There are five types of Kubernetes services that can be defined, with the ClusterIP type being the default. It provides an internal IP and DNS name, making the service only accessible within the cluster. This means that we now have a service associated with the webserver, but we still can’t access the UI through a friendly subdomain as a regular application.

For that, we’ll configure an ingress next. An Ingress is an API object that defines the rules and configurations to manage external access to our cluster’s services.

kind: Ingress
  name: airflow-ingress
  namespace: airflow
  annotations: "letsencrypt"
  ingressClassName: "nginx"
    - hosts:
      secretName: cert-tls-secret
    - host:
          - path: /
            pathType: Prefix
                name: webserver-svc
                  number: 80

The key configuration here that allows us to define the settings for secure HTTPS connections is the tls section. There, we can list all hosts for which to enable HTTPS and the name of the Kubernetes Secret that holds the TLS certificate and private key to use to secure the connection. This secret is automatically created by cert-manager.

Service Accounts

Finally, in order to ensure our resources have the necessary permission to spawn new pods and manage pods, we need to configure service accounts for them. You can choose to configure individual service accounts for each resource or a single service account for all resources, depending on your security requirements.

The ServiceAccount resource can be configured as:

apiVersion: v1
kind: ServiceAccount
  name: default
  namespace: airflow
    tier: airflow
    component: scheduler
    release: airflow
automountServiceAccountToken: true

Since we wanted users to be able to manage workflows directly from the UI, we also configured a service account for the webserver.

The StatsD Application

This is an optional component that collects metrics inside the Airflow application. The deployment is similar to the other two, so we won’t dive into it.


Airflow is now deployed to a GKE cluster and accessible via our chosen subdomain. This allows us to have a higher level of control over our infrastructure, while still leveraging GKE’s built-in resources to auto-scale as needed.

Introducing the Account Advocate - A Dedicated Partner for Success2023-08-16T09:00:00-04:002023-08-16T09:00:00-04:00 a company, we are committed to ensuring our client’s success and believe that maintaining strong relationships with the people who trust us with their projects is a driving force of success. One of our core values is, in fact, Continuous Improvement, and we make an effort to live it every day.

In that spirit, we are excited to announce a new role in our organization, the Account Advocate, a key role in our team fully dedicated to championing client interests, collaboration and ensuring successful partnerships that go above and beyond.

What is the Account Advocate?

The Account Advocate is a key, strategic role focused on ensuring that our clients are happy with our partnership not only from a technical and delivery perspective, but also from a business perspective. They are an advocate and representative for your business stakeholders inside our team, dedicated to connecting your vision with our delivery and ensuring your goals are met and any potential concerns are heard and addressed.

The Account Advocate works closely with the Project Manager to ensure success, but while the Project Manager focuses on delivery and the success of the existing project, the Account Advocate focuses on the overall relationship with the business, makes sure value delivery expectations are met, your team is being heard and ensures we’re delivering value to your company at every opportunity.

They also facilitate communication with senior leadership on both ends, ensuring that you have all the support you need for a successful collaboration.

How does the relationship work?

Communication is key to everything we do. We value open and honest communication with our clients and between our teams. As such, you will have plenty of contact and checkpoints with our delivery team.

The Account Advocate is focused on more strategic goals and higher-level partnership priorities , so they will aim to meet with business stakeholders quarterly. If a different frequency is preferred, we will most definitely adapt, but we believe at least quarterly contact is important to ensure success and happiness on both ends of the partnership.

While communicating and collaborating with you, the Account Advocate will focus on:

  1. Client Happiness: We are committed to understanding your goals, challenges and opportunities. Client happiness is at the core of our business, and they are your voice within our organization, ensuring your feedback is being heard and any concerns you might have are understood and addressed swiftly.

  2. Strong Partnership and Collaboration: Ongoing collaboration makes partnerships grow stronger, and we are interested in delivering as much value to your organization as we can. They will collaborate closely with your business team to foster trust and open communication and facilitate collaboration at the higher levels of leadership.

  3. New Opportunities: We are vested in your success and believe in going above and beyond in everything we do. The Account Advocate is interested in hearing what other problems we can help solve, other challenges we can help you overcome and overall other ways in which we can contribute to deliver cost-effective solutions that solve real problems and generate actual value for you and your team.

  4. Problem Resolution: We believe in Challenging Projects over Profitable Projects, that’s why we are so passionate about every project we work on. That also means we understand challenges arise and are a part of every successful collaboration. The Account Advocate is focused on solving any issues swiftly and transparently, ensuring minimal disruption.

Collaboration is Key to Success

As we introduce the Account Advocate role to our team and to our partnership, we are excited to see how it will contribute to an even more successful and strong relationship with our clients. This role strengthens our commitment to client happiness and success and our interest in building long-lasting relationships based on trust, open communication and transparency.

We look forward to working with you and your team on our next successful project! Contact us to get your next project started!