In this article, I will describe the most interesting findings from that paper and how you can apply them at your company to define, measure, and manage technical debt.
Before the team designed their survey, they interviewed a number of subject matter experts at the company to try to understand what were the main components of technical debt as perceived by them:
"We took an empirical approach to understand what engineers mean when they
refer to technical debt. We started by interviewing subject matter experts
at the company, focusing our discussions to generate options for two survey
questions: one asked engineers about the underlying causes of the technical
debt they encountered, and the other asked engineers what mitigation would
be appropriate to fix this debt. We included these questions in the next
round of our quarterly engineering survey and gave engineers the option to
select multiple root causes and multiple mitigations. Most engineers selected
several options in response to each of the items. We then performed a factor
analysis to discover patterns in the responses, and we reran the survey the
next quarter with refined response options, including an “other” response
option to allow engineers to write in descriptions. We did a qualitative
analysis of the descriptions in the “other” bucket, included novel concepts
in our list, and iterated until we hit the point where <2% of the engineers
selected “other.” This provided us with a collectively exhaustive and
mutually exclusive list of 10 categories of technical debt."
As you can read, this was an iterative approach that focused on narrowing down the concept of technical debt in different categories.
The 10 categories of technical debt that they detected were:
This might be related to architectural decisions that were made in the past, which worked fine for a while, but then eventually started causing problems.
"This may be motivated by the need to scale, due to mandates, to reduce
dependencies, or to avoid deprecated technology."
You could think about this as an integration with a third party service which is no longer maintained and/or improved. The team knows that they will need to switch to a different service, but they haven’t had the time yet to execute the migration.
This might be related to documentation that is no longer up to date. When documentation is not executed, or constantly read and improved, it tends to fall out of date quickly.
"Information on how your project works is hard to find, missing or incomplete, or may include documentation on APIs or inherited code."
Every project has some sort of documentation. In the most basic format, it could be a README.md file in the project that tells you how to properly set up the application for development purposes.
"Poor test quality or coverage, such as missing tests or poor test data,
results in fragility, flaky tests, or lots of rollbacks."
Even at Google, teams are complaining about the lack of tests, the flakiness of test suites, and/or test cases that do not cover important edge cases.
This means that having a test suite is not enough. The tests have to be stable, they have to be thorough, and they have to help your team avoid regressions.
"Product architecture or code within a project was not well designed. It may
have been rushed or a prototype/demo."
We have all been in this situation. An initial experiment/prototype/demo is successful and we tend to prioritize features/patches before we take a moment to adjust its architecture.
Improving the architecture of the product becomes something that will be done at some point down the line, but that moment never comes. It usually needs non-technical manager buy-in before it can happen.
"Code/features/projects were replaced or superseded but not removed."
Every now and then pieces of code become unreachable, which can create a false sense of complexity. Modules might seem too big and complex, but maybe only half of that code is actually getting used.
There are open source tools out there to help you remove dead code, but doing this takes time. Teams that report these issues often do not have time to stop and remove dead code before they continue shipping features and patching bugs.
"The code base has degraded or not kept up with changing standards over time.
The code may be in maintenance mode, in need of refactoring or updates."
This might be related to a change in one of the core dependencies of your application (e.g. React.js) which means that new code is expected to be written using functions instead of classes.
Open source moves fast. Using one library (e.g. Angular.js) or another library (React.js) will save us time when we are starting a new project. However, the team behind these libraries can decide to change the entire interface and core concepts from one major release to the next.
No matter what library or framework you choose, this will happen. The key to avoid this problem is to quickly (or gradually) adapt your code to comply with the new way of doing things.
"This may be due to staffing gaps and turnover or inherited orphaned
code/projects."
Depending on the job market, key members of a codebase might find jobs in other companies (or other teams within the same company) which will create a vacuum in the existing team.
If teams don’t take the necessary precautions, then there may be gaps where a team is waiting for the next senior hire (while still expected to continue to ship features and patches to production)
"Dependencies are unstable, rapidly changing, or trigger rollbacks."
Once again, open source moves fast. Tools like Dependabot or Depfu can help you stay up to date, but they are only good for small releases. Upgrading major releases of a framework (e.g. Rails) can take days, weeks, or even several developer months.
Non-trivial upgrades usually get postponed for a better time. Often times, this better time never comes. We have seen this firsthand at our productized services:
UpgradeJS: We help teams upgrade their React Native, React, Vue, or Angular applications.
FastRuby.io: We help teams upgrade their Ruby & Rails applications. We have invested over 30,000 developer/hours upgrading applications!
We have built a couple of profitable services on top of this particular issue, so we know that even the best teams struggle to keep up. It’s not because they don’t want to upgrade, it’s because other priorities get in the way.
"This may have resulted in maintaining two versions."
This might happen due to a combination of the previous issues. The team started a migration project, but then suddenly there was an emergency and the team had to shift focus. Then that focus never came back to the migration of the system.
Another potential scenario is when a team expects certain promises to be true after a migration and then suddenly realizes that it won’t be the case. Rolling back the migration might end up in the back burner for months before it actually happens.
"The rollout and monitoring of production needs to be updated, migrated, or
maintained."
This might be related to the way the software development lifecycle is being managed. In the past we have encountered teams that deploy to production only once a month (due to environmental factors) which causes unnecessary friction.
As much as we enjoy being an agile software development agency, every now and then we have to work with clients who are not deploying changes to production every week. This is very often the case with our clients in highly-regulated industries (e.g. finance, national security, or healthcare)
Google’s Engineering Productivity Research Team explored different ways to use metrics to detect problems before they happened:
"We sought to develop metrics based on engineering log data that capture the presence of technical debt of different types, too. Our goal was then to figure out if there are any metrics we can extract from the code or development process that would indicate technical debt was forming *before* it became a significant hindrance to developer productivity."
They decided to focus on three of the 10 types of technical debt: code degradation, teams lacking expertise, and migrations being needed or in progress.
"For these three forms of technical debt, we explored 117 metrics that were proposed as indicators of one of these forms of technical debt. In our initial analysis, we used a linear regression to determine whether each metric could predict an engineer’s perceptions of technical debt."
They put all of their candidate metrics into a random forest model to see if the combination of metrics could forecast developer’s perception of tech debt.
Unfortunately their results were not positive:
"The results were disappointing, to say the least. No single metric predicted reports of technical debt from engineers; our linear regression models predicted less than 1% of the variance in survey responses."
This might be related to the way developers envision the ideal state of a system, process, architecture, and flow, and maybe also due to the difficulty related to estimating how bad the situation is and how bad the situation is going to be at the end of the quarter (when their quarterly surveys are answered)
As a way to help teams that struggle with technical debt, Google formed a coalition to “help engineers, managers, and leaders systematically manage and address technical debt within their teams through education, case studies, processes, artifacts, incentives, and tools.”
This coalition started efforts to improve the situation:
In my opinion, the most interesting effort of this coalition is defining a maturity model around technical debt. This is similar to CMMI (a framework defined at Carnegie Mellon University) which provides a comprehensive integrated set of guidelines for developing products and services.
This defines a new way to approach the subject. Instead of relying on developer’s gut feeling and environmental factors, this maturity model has tracking at its core. This means that there are measurable metrics that will play a key part in informing an engineering team’s decision around technical debt.
This model defines four different levels. From most basic to most advanced:
"Teams with a reactive approach have no real processes for managing technical
debt (even if they do occasionally make a focused effort to eliminate it, for
example, through a “fixit”)."
In my experience, most engineering teams have the best intentions to make the right decisions, to ship good enough code, and to take on a reasonable amount of technical debt.
They understand that technical debt does not mean it is okay to ship bad code to production. They analyze the trade-offs of their decisions and they make their calls with that in mind.
Every now and then they will take some time (maybe a sprint or two) to pay off technical debt. When doing this, they usually address issues that they are familiar with because they’ve been hindered by those issues.
Non-technical leaders usually don’t understand the significance of taking on too much technical debt. They start to care once issues start popping up because of these issues. It might take a production outage, a security vulnerability, or extremely low development velocity to get them to react.
"Teams with a proactive approach deliberately identify and track technical debt and make decisions about its urgency and importance relative to other work."
These teams understand that “if you can’t measure it, you can’t improve it.” So they have been actively identifying technical debt issues. They might have metrics related to the application, the development workflow, the release phase, and/or the churn vs. complexity in their application.
They understand that some of the metrics they’ve been tracking show potential issues moving forward. They might notice that their code coverage percentage has been steadily declining which could signal a slippage in their testing best practices.
They care about certain metrics that might help them improve their development workflow and they know that they need to first inventory their tech debt before taking action. They know that addressing some of these issues might improve their DORA metrics.
"Teams with a strategic approach have a proactive approach to managing technical debt (as in the preceding level) but go further: designating specific champions to improve planning and decision making around technical debt and to identify and address root causes."
These teams have an inventory of technical debt issues. They build on top of the previous level. For example: They proactively address flaky tests in their test suite.
They might assign one person to one of the issues that they detected. They likely know how to prioritize the list of technical debt issues and focus on the most pressing ones.
"Teams with a structural approach are strategic (as in the preceding level) and also take steps to optimize technical debt management locally—embedding technical debt considerations into the developer workflow—and standardize how it is handled across a larger organization."
Improving the situation is a team effort. Non-technical managers treat tech debt remediation as any other task in the sprint. They likely reserve a few hours of every sprint to paying off technical debt.
After reading this paper, I wish the research team had shared more about the different maturity levels. I believe the software engineering community could greatly benefit from a “Technical Debt Management Maturity Model.”
It’s proof that while technical debt metrics may not be perfect indicators, they can allow teams who already believe they have a problem to track their progress toward fixing it.
The goal is not to have zero technical debt. It has never been the goal. The real goal is to understand the trade-offs, to identify what is and what is not debt, and to actively manage it to keep it at levels that allow your team to not be hindered by it.
Need help assessing the technical debt in your application? Need to figure out how mature you are when it comes to managing technical debt? We would love to help! Send us a message and let’s see how we can help!
]]>This series will walk through the process of shaping the original problem as a machine learning problem and building the Pecas machine learning model and the Slackbot that makes its connection with Slack.
In this first article, we’ll talk through shaping the problem as a machine learning problem and gathering the data available to analyse and process.
This series will consist of 6 posts focusing on the development of the Pecas machine learning model:
Before we dive into the machine learning aspect of the problem, let’s briefly recap the business problem that led to the solution being built.
OmbuLabs is a software development agency providing specialized services to a variety of different customers. Accurate time tracking is an important aspect of our business model, and a vital part of our work. Still, we faced several time tracking related issues over the years, related to accuracy, quality and timeliness of entries.
This came to a head at the end of 2022, when a report indicated we lost approximately one million dollars largely due to poor time tracking, which affected our invoicing and decision-making negatively. Up to this point, several different approaches had been taken to try to solve the problems, mostly related to different time tracking policies. All of these approaches ended up having significant flaws or negative side effects that led to policies being rolled back. This time, we decided to try to solve the problem differently.
There were a variety of time tracking issues, including time left unlogged, time logged to the wrong project, billable time logged as unbillable, incorrect time allocation, vague entries, among others. Measures put in place to try to mitigate the quality-related issues also led to extensive and time-consuming manual review processes, which were quite costly.
In other words, we needed to:
Our main idea was to replace (or largely replace) the manual process with an automated one. However, although the process was very repetitive, the complexity of the task (interpreting text) meant we needed a tool powerful enough to deal with that kind of data. Hence the idea to use machine learning to automate the time entry review process.
It is worth noting that machine learning powers one aspect of the solution: evaluating the quality and correctness of time entries. Other aspects such as timeliness of entries and completeness of the tracking for a given day or week are very easily solvable without a machine learning approach. Pecas is a combination of both, so it can be as effective as possible in solving the business problem as a whole.
The first thing we need to do is identify what part of the problem will be solved with the help of machine learning and how to properly frame that as a machine learning problem.
The component of the problem that is suitable for machine learning is the one that involves “checking” time entries for quality and accuracy, that is, the one that involves “interpreting” text. Ultimately, the goal is to understand if an entry meets the required standards or not and, if not, notify the team member who logged it to correct it.
Therefore, we have a classification problem in our hands. But what type of classification problem?
Our goal is to be able to classify entries according to pre-defined criteria. There are, in essence, two clear ways we can approach the classification:
Which one we want depends on a few different factors, perhaps the most important one being the existence of a finite, known number of ways in which an entry can be invalid.
If there is a finite, known number of classes an entry can belong to and a known number of ways in which each entry can be invalid, the machine learning model can be used to classify the entry as belonging to a specific category and that entry can then be checked against the specific criteria to determine validity or invalidity.
However, we don’t have that.
Time entries can belong to a wide range of categories as a mix of specific keywords in the description, project they’re logged to, tags applied to the entry, user who logged it, day the entry was logged, among many others. Too many. Therefore, intermediate classification might not be the best approach. Instead, we can use the entry’s characteristics to teach the model to identify entries that seem invalid, and let it determine validity or invalidity of the entry directly.
Thus we have in our hands a binary classification problem, whose objective is to classify time entries as valid or invalid.
Now we know what kind of problem we have in our hands, but there are a wide variety of different algorithms that can help solve this problem. The decision of which one to use is best informed by the data itself. So let’s take a look at that.
The first thing we need is, of course, the time tracking data. We use Noko for time tracking, and it offers a friendly API for us to work with.
A Noko time entry as inputted by a user has a few different characteristics:
There is also one relative characteristic of a time entry that is very important: whether it is billable or unbillable. This is
controlled by one of two entities: project or tag. Projects can be billable or unbillable. By default, all entries logged to an
unbillable project are unbillable and all entries logged to a billable project are billable. However, entries logged to a billable
project can be unbillable when a specific tag (the #unbillable
tag) is added to the entry.
There is also some metadata and information that comes from user interaction with the system that can be associated with the entry, the most relevant ones being:
Of the entities associated with an entry, as mentioned above one is of particular interest: projects. As aforementioned, projects can indicate whether an entry is billable or unbillable. And, as you can imagine, an entry that belongs to a billable project logged to an unbillable project by mistake means the entry goes uninvoiced, and we lose money in the process.
A project also has a unique ID that identifies it, a name and a flag that indicates whether it is a billable or unbillable project. The flag and the ID are what matters to us for the classification, the ID because it allows us to link the project to the entry and the flag because it is the project characteristic we want to associate with the data.
There are other data sources that have relevant data that can be used to gain context on time entries, for example calendars, GitHub pull requests, Jira tickets. For now, let’s keep it simple, and use a dataset of time entries enriched with project data, all coming from Noko.
In order to make it easier to work and explore the data, we extracted all time entries from Noko logged between January 1st, 2022 and June 30th, 2023. In addition to entries, projects, tags and users were also extracted from Noko, and the data was loaded into a Postgres database, making it easy to explore with SQL.
We then extracted a few key characteristics from the set:
property | stat |
---|---|
total_entries | 49451 |
min_value | 0 |
max_value | 720 |
duration_q1 | 30 |
duration_q3 | 90 |
average_duration_iq | 49.39 |
average_duration_overall | 71.33 |
median_duration | 45 |
max_word_count | 162 |
min_word_count | 1 |
avg_word_count | 9.89 |
word_count_q1 | 4 |
word_count_q3 | 11 |
entries_in_word_count_iq | 29615 |
average_word_count_iq | 6.63 |
least_used_tag: ops-client | 1 |
most_used_tag: calls | 12043 |
unbillable_entries | 33987 |
billable_entries | 15464 |
pct_unbillable_entries | 68.73 |
pct_billable_entries | 31.27 |
The table above allows us to get a good initial insight into the data and derive a few early conclusions:
This initial set of considerations already tells us something about our data. We have a fairly large dataset, with a mix of numerical and categorical variables. There are also outliers in several features of the data and the range of values in durations and word count could indicate their relationship with validity or invalidity is not strictly linear. Our empirical knowledge confirms this assumption. Although longer entries in duration are generally expected to have longer descriptions, there are several use cases for long entries in duration to have small word counts.
Other characteristics we looked at (in similar fashion) to get a good initial idea of what we were dealing with include:
This gave us a good initial idea of what we were dealing with.
By this point, we know we’re dealing with a binary classification problem and that we have a fairly large dataset with outliers and non-linear relationships in data. The dataset also has a mix of numerical and categorical variables. The problem we have at hand is made more complex by the presence of text data that requires interpretation.
There are a number of algorithms to choose from for binary classification, perhaps the most common being:
A quick comparison of their strengths and weaknesses shows that tree-based models are most likely the right choice for our use case:
Logistic regression’s strengths lie in its simplicity:
However, some of its weaknesses make it clearly not a good candidate for our use case:
Another example of a simple algorithm, with strengths associated with its simplicity:
However, some of its weaknesses also make it immediately not a good choice for our problem:
Naive Bayes’ core strengths are:
However, two key weaknesses make it yet another unsuitable choice for our use case:
Unlike the previous algorithms, two of SVMs core strengths apply to our use case:
However, two core weaknesses make it second to tree-based models as a choice:
We have arrived at the most suitable type of algorithm for our problem at hand! The core strengths of these algorithms that make them a good choice are:
Some weaknesses related to them are:
Therefore, we’ll pick ensemble tree-based models as our starting point.
But which one? That’s a tale for the next post. We’ll do some more analysis in our data, pre-process it and train a few different models to pick the best one.
Framing your business problem (or business question) as a machine learning problem is a first and necessary step in understanding what kind of problem you’re dealing with and where to start solving it. It helps guide our data exploration and allows us to choose which machine learning algorithm (or family of algorithms) to start with.
A good understanding of the data available to you, the business context around the problem, and the characteristics that matter can help guide your exploration of the dataset to validate some initial questions, such as do we have enough data or is the data available enough to convey the information we need. It’s important to not be tied by these initial assumptions and this initial knowledge in your analysis though, as exploring the data might reveal additional, useful insights.
With a good understanding of the problem and dataset, you can make an informed algorithm selection, and start processing your data and engineering your features so your model can be trained. This second step is what we’ll look at in the next post.
Need help leveraging machine learning to solve complex business problems? Send us a message and let’s see how we can help!
]]>Despite being a core activity, we had been having several issues with it not being completed or not being completed properly. A report we ran at the end of 2022 showed our time tracking issues were actually quite severe. We lost approximately one million dollars in 2022 due to time tracking issues that led to decisions made on poor data. It was imperative that we solved the problem.
To help with this issue, we created an evolution of our Pecas project. We turned Pecas into a machine learning powered application capable of alerting users of issues in their time entries. In this article, we’ll talk though the business case behind it and expected benefits to our company.
Our time tracking issues pre-dated the 2022 end of year report. By that point, we had been having problems for a couple of years, it just wasn’t a big priority. As the company grew, however, the issues multiplied, and got to a point where we needed to prioritize solving the problem.
A detailed analysis of our time tracking data revealed several different issues, both issues that were typically caught by internal processes relying on this information, such as invoicing, and issues that typically remained hidden:
These were some of the main issues we were facing, and as a small company, their impact was even more significant to our projects and our operation overall. We knew it was a problem, and we attempted a few different solutions, including implementing policies around time tracking. They ended up having serious flaws that caused us to reconsider and eventually retract them. But we still had a problem to solve.
At the end of 2022, when we looked at our number for the year, we decided to dive deeper into this data. And the cost of the issues mentioned above became very clear: we lost $1,000,000 dollars due to these issues and their consequences. What this meant is that we had a million dollar problem to solve.
Time tracking issues (timing and quality of entries) are one aspect of a complex problem. Improving time tracking quality was one of the problems we had to solve, and one of significant impact. There were, however, multiple root causes that led to the loss we identified (process problems, service management, communication). Those are being addressed separately and are beyond the scope of this article.
Our main issue was that specific time tracking policies we had implemented didn’t account for nuance. Although delays in time entries being entered into the system and entries logged to incorrect projects decreased, addressing some of the most costly problems we had, honest mistakes were treated in the same way as more serious issues, and the policy was found to be unfair in some cases.
This went against our core values and led us to look for a different solution. The main issue was that there was no way to be alerted of honest mistakes in entries before the information was needed, someone reviewed and found the issue manually, or we ran another comprehensive report.
Manual processes for these kinds of tasks are not great. They are expensive and take people away from other activities. We wanted an automated way to monitor and flag entries. We knew from the beginning there was always going to be a human component to it, but if we could reduce the time we spent every week running reports and reviewing and fixing entries, that was already a win.
That’s when we decided to build an internal tool to help with this. Our goal was to reduce the time invested on time tracking by our operations team by automating the bulk of the work to find these issues, and leaving human reviews to a much smaller set of of entries.
This solution would need to be able to:
The complexity lied in the fact that we’re dealing with text data in free speech form (entry description) combined with several other properties (project, labels, date, billable or non-billable, duration). Accounting for all possible scenarios and issues with hard rules would not work. That’s where machine learning comes into play.
We split the entry classification part of the solution in two:
No solution is perfect, and we knew there were going to be issues that still slipped through the cracks, as well as a need for human review. Our goal was to minimize both.
That’s how the Pecas project was born.
When we decided to build the solution, we were spending between 3 and 5 hours every week on time tracking reports. That meant spending between $30,000 and $50,000 every year just on these reports. As the company grew and we had more people joining the team, the time spent on this was also going to increase significantly.
In summary, we had one million dollars in losses in 2022 alone, and were looking at a current cost of $30,000 to $50,000 per year to run the process manually, increasing every time our team grew. We had a pretty solid case to invest in a solution.
Additional factors that contributed to our decision to go ahead with the project were:
In order to properly evaluate whether building a solution was the right move, we also had to consider implementation and maintenance costs. We had the expertise needed within our team, so we didn’t really need to bring in external help to accomplish what we needed; and even with the added complexity of a machine learning model, we were looking at a small application. To put it into perspective:
Assuming we did nothing, we would continue to incur significant losses year after year. From our data analysis and root cause evaluation, we believe the solution could help reduce the loss by approximately 60%, saving us $600,000. Similarly, the solution can reduce the time spent reviewing time entry reports by 80%, meaning our costs would reduce to $6,000 to $10,000 per year, saving us between $24,000 and $40,000 every year, not accounting for potential growth.
Building the solution would cost approximately 50% of the total we expected to save, and maintaining it, once built, certainly wouldn’t cost as much as we were losing. Pretty good case to build it!
Add to that the knowledge and learning gains, and preserving our culture of team first, and the decision was easy.
The Pecas app’s first version went live in March and, at that time, supported only filter classification with hard rules. That allowed us to measure user interaction with it and see how (or if) things would improve. It also got us thinking about new ways to leverage the app.
A version of the app with machine learning integrated went live in August, and we have been monitoring it and collecting data. The number of common issues in entries identified has decreased significantly, and timeliness of the logging has greatly improved.
We have found additional use cases for the bot, and created additional alerts for project teams, project managers and our overall Operations team. This has allowed us to identify issues faster and react to them immediately, saving us time, money, and headache in the long run.
We’re still monitoring data and working through results, a preliminary analysis shows that the number of billable time entries logged to non-billable projects in Q3 2023 was 95% smaller than for the same period in 2022, so we’re calling this an win for now as we continue to expand the machine learning and other functionalities.
Machine learning isn’t a magic bullet to all of our problems. In fact, in many cases, it isn’t quite the right solution, and you can go very far with hard rules. There are situations, however, where it is the ideal solution. In those cases, it is a powerful tool to solve very complex problems.
As previously mentioned, an automated tool to aid time tracking quality wasn’t the only solution to this problem. Changes in process were also required, and each case was examined, separately and in conjunction with others, and addressed. But it was a core piece in the strategy, and the results are positive and quite promising.
We specialize in solving complex problems for companies looking to build interesting tools that provide meaningful results. We take a holistic look at the problem, advise on all aspects of the problem, and can help you improve your processes and build the right tool for the right problem.
Got some difficult problems you’d like to solve with software but not quite sure where to start? Unsure if machine learning is the right solution to your problem? Send us a message.
]]>In that spirit, this year we decided to organize our open source contribution time in a way that wasn’t limited to our own open source projects. This is a short post to explain how we aligned our open source contributions with our learning goals, what contributions we made, and why it mattered.
Last year, as a company, we did an exercise in participating in Hacktoberfest with our team. There were positive and negative notes but, overall, feedback around the exercise itself was positive.
This year we had specific goals and topics we wanted to focus on as a team. We decided to use open source projects as a way to learn and practice while also contributing to the community.
Therefore, this year we aligned our open source contributions with our learning purposes. As a part of our company, we conduct monthly one-on-one calls with our full-time employees. In those calls, we learn about areas and skills that our direct reports would like to improve.
The problem is that sometimes client work doesn’t give us the opportunities we need to work on said skills.
That’s why we decided to use the month of October to contribute to open source projects with the following intentions:
For senior engineers: We wanted them to improve their upgrading and debugging skills, so that they could improve their skills when it comes to fixing medium to high complexity bugs.
For mid-level engineers: We wanted them to work on features so that they could improve their skills when it came to greenfield-like projects.
This year we decided not to restrict contributions to repositories that were officially participating in Hacktoberfest.
We asked everyone to suggest repositories before we started and we quickly came up with a list of approved projects.
Senior engineers were asked to work on two kinds of issues: technical debt and bugs.
Mid-level engineers were asked to work on any kind of issue they found interesting, with a focus on new features or feature changes.
To organize that:
This time we decided to split in teams:
When it came to our own projects, we decided to have only Ariel and Ernesto’s team work on open source projects maintained by OmbuLabs.
We focused on these projects:
We wanted to make sure that our teams focused on projects that were approved by our engineering management team. The list included some well-known and really useful tools that we’ve been using for years:
In terms of contributions, we considered activity on pull requests and issues as a valid contribution. We understand that sometimes you are looking to add value to an open source project, and after hours of research and trying many different things, all you can add is a comment to an existing issue. In our exercise, and in general, that counts as a contribution too!
Here are all the issues where we added value:
Here are all the pull requests we submitted:
In total during the month of October we invested 392 hours in our open source contributions. That represents an investment of $79,000 into open source by 10 of our senior and mid-level engineers.
We plan to take all of our contributions across the finish line, using our regular, monthly and paid open source investment time. Outside of Hacktoberfest, on average, as a team we invest 38 hours per month on open source contributions.
We look forward to continuing our investment in the open source projects that add value to the world and our communities. We believe this is the way to hone our craft, learn new things faster, and become better professionals.
]]>The Airflow community maintains a Helm chart for Airflow deployment on a Kubernetes cluster. The Helm chart comes with a lot of resources, as it contains a full Airflow deployment with all the capabilities. We didn’t need all of that, and we wanted granular control over the infrastructure. Therefore, we chose not to use Helm, although it provides a very good starting point for the configuration.
The Airflow installation consists of five different components that interact with each other, as illustrated below:
(Source: Official Airflow Documentation)
In order to configure our Airflow deployment on GCP, we used a few different services:
NOTE: The steps below assume you have both the Google Cloud SDK and kubectl
installed, and a GCP project set up.
Before deploying Airflow, we need to configure a CloudSQL instance for the metadata database and the GKE cluster that will host the Airflow deployment. We opted to use a Virtual Private Cloud (VPC) to allow the connection between GKE and CloudSQL.
To create a CloudSQL instance for the Airflow database:
gcloud sql instances create airflow_metadb \
--database-version=POSTGRES_15 \
--tier=db-n1-standard-2 \
--region=us-east1 \
--network=airflow_network
--root-password=admin
Customize the database version, tier, region, and network to your needs. If you don’t plan on using a VPC, you don’t need the network argument. Check out the gcloud sql instances create documentation for a full list of what’s available.
Connect to the newly created instance to create a database to serve as the Airflow metadata database. Here, we’ll create a
database called airflow_metadb
:
gcloud beta sql connect airflow_metadb
This will open a Postgres shell, where you can create the database.
CREATE DATABASE airflow_meta;
Finally, get the instance’s IP address and port to construct the database connection URL, which will be needed for the Airflow
set up. You’ll need the IP address listed as PRIVATE
:
gcloud sql instances describe airflow_metadb
Your connection URL should follow the format:
postgresql+psycopg2://username:password@instance-ip-address:port/db-name
for a Postgres instance.
Before initializing a new Kubernetes cluster on GKE, make sure you have the right project set in the gcloud
CLI:
gcloud config set project airflow
Create a new cluster on GKE:
gcloud container clusters create airflow-cluster \
--machine-type e2-standard-2 \
--num-nodes 1 \
--region "us-east1" \
--scopes "cloud-platform"
Choose the correct machine type for your needs. If your cluster ends up requesting more resources than you need, you’ll end
up overpaying for Airflow. Conversely, if you have less resources than required, you will run into issues such as memory pressure.
Also choose the number of nodes to start and the region according to your needs. The --scopes
argument set to cloud-platform
allows the GKE cluster to communicate with other GCP resources. If that is not needed or desired, remove it.
For a full list of the options available, check the gcloud container clusters create documentation.
Authenticate kubectl
against your newly created cluster:
gcloud container clusters get-credentials airflow-cluster --region "us-east1"
and create a Kubernetes namespace for the Airflow deployment. Although not necessary, this is a good practice, and it’ll allow for the grouping and isolating of resources, enabling, for example, separation of a production and staging deployment within the same cluster.
kubectl create namespace airflow
The cluster should now be set up and ready.
Our goal was to have Airflow deployed to a GKE cluster and the Airflow UI exposed via a friendly subdomain. In order to do that, we need to obtain and use a certificate.
To make the process of obtaining, renewing, and using certificates as easy as possible, we decided to use cert-manager
, a
native Kubernetes certificate management controller. For that to work, though, we
need to ensure that traffic is routed to the correct service, so requests made to the cert-manager
solver to confirm domain
ownership reach the right service, and requests made to access the Airflow UI also reach the right service.
In order to do that, an nginx ingress controller was needed.
Unlike an Ingress
, an Ingress Controller
is an application running inside the cluster that configures a load balancer according
to multiple ingress resources. The NGINX ingress controller is deployed in a pod along with such load balancer.
To help keep the ingress controller resources separate from the rest, let’s create a namespace for it:
kubectl create namespace ingress-nginx
The easiest way to deploy the ingress controller to the cluster is through the official Helm Chart.
Make sure you have helm
installed, then add the nginx Helm repository and update your local Helm chart repository cache:
helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm repo update
Install the ingress-nginx
Helm chart in the cluster:
helm install nginx-ingress ingress-nginx/ingress-nginx -n ingress-nginx
where nginx-ingress
is the name we’re assigning to the instance of the Helm chart we’re deploying, ingress-nginx/ingress-nginx
is the chart to be installed (the ingress-nginx
chart in the ingress-nginx
Helm repository) and -n ingress-nginx
specifies
the namespace within the Kubernetes cluster in which to install the chart.
With the controller installed, run:
kubectl get services -n ingress-nginx
and look for the EXTERNAL IP
of the ingress-nginx-controller
service. That is the IP address of the load balancer.
To expose the Airflow UI via a subdomain, we configured an A record pointing to this IP address.
Now that the controller is in place, we can proceed with the installation of the cert-manager
. First, apply the CRD
(CustomResourceDefinition) resources:
kubectl apply -f https://github.com/jetstack/cert-manager/releases/download/v1.13.0/cert-manager.crds.yaml
The cert-manager
relies on its own custom resource types to work, this ensures these resources are installed.
Like with the controller, we’ll also create a separate namespace for the cert-manager
resources:
kubectl create namespace cert-manager
And install cert-manager
using the Helm chart maintained by Jetstack:
helm repo add jetstack https://charts.jetstack.io
helm repo update
helm install cert-manager jetstack/cert-manager --namespace cert-manager --version v.1.13.0
With cert-manager
installed, we now need two additional resources to configure it: a ClusterIssuer
and Certificate
.
The ClusterIssuer
creates a resource to represent a certificate issuer within Kubernetes, i.e., it defines a Kubernetes
resource to tell cert-manager
who the certificate issuing entity is and how to connect to it. You can create
a simple ClusterIssuer
for Let’s Encrypt as follows:
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: letsencrypt
spec:
acme:
server: https://acme-v02.api.letsencrypt.org/directory
email: my_email@my_domain.com
privateKeySecretRef:
name: letsencrypt
solvers:
- http01:
ingress:
class: nginx
The Certificate
resource then defines the certificate to issue:
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: airflow-certificate
namespace: airflow
spec:
secretName: cert-tls-secret
issuerRef:
name: letsencrypt
kind: ClusterIssuer
commonName: airflow.my_domain.com
dnsNames:
- airflow.my_domain.com
Apply both resources to the cluster to get the certificate issued. Assuming everything went well and the DNS records are set up correctly, when you run:
kubectl describe certificate airflow-certificate -n airflow
you should see Status: True
at the bottom of the certificate’s description, indicating the certificate has been issued.
Now our cluster is ready to receive the Airflow deployment.
The Airflow deployment includes a few different pieces, so we can get Airflow to properly work. The Airflow installation in Kubernetes ends up looking more like this:
(Source: Official Airflow Documentation)
Our complete Airflow deployment resources ended up looking like this:
resources
|---- airflow.cfg
|---- secrets.yaml
|---- persistent_volumes
|---- airflow-logs-pvc.yaml
|---- rbac
|---- cluster-role.yaml
|---- cluster-rolebinding.yaml
|---- scheduler
|---- scheduler-deployment.yaml
|---- scheduler-serviceaccount.yaml
|---- statsd
|---- statsd-deployment.yaml
|---- statsd-service.yaml
|---- webserver
|---- webserver-deployment.yaml
|---- webserver-ingress.yaml
|---- webserver-service.yaml
|---- webserver-serviceaccount.yaml
In order to successfully deploy Airflow, we need to make sure the airflow.cfg
file is available in the relevant pods.
Airflow allows you to configure a variety of different things through this file (check the Configuration Reference
for more detailed information).
In Kubernetes, this kind of configuration is stored in a ConfigMap
, which a special kind of “volume” you can mount inside
your pods and use to make configuration files available to them. The ConfigMap
works together with Kubernetes secrets,
meaning you can reference a Secret
directly inside a ConfigMap
or pass the Secret
as an environment variable and
reference that.
Of note: Kubernetes secrets are somewhat unsafe considering they just contain a base64
encrypted string that can
be easily decrypted. If secrets need to be versioned or committed somewhere, it’s better to use GCP’s Secret Manager
instead.
A ConfigMap
for the airflow.cfg
file can be created running:
kubectl create configmap airflow-config --from-file=airflow.cfg -n airflow
where airflow-config
is the name of the ConfigMap
created and the -n airflow
flag is necessary to create the resource
in the correct namespace.
Kubernetes secrets can be created using a secrets.yaml
manifest file to declare individual secrets:
apiVersion: v1
kind: Secret
metadata:
name: airflow-metadata
type: Opaque
data:
connection: "your-base64-encrypted-connection-string"
fernet-key: "your-base64-encrypted-fernet-key"
---
apiVersion: v1
kind: Secret
metadata:
name: git-sync-secrets
type: Opaque
data:
username: "your-base64-encrypted-username"
token: "your-base64-encrypted-token"
If you decide to go with plain Kubernetes secrets, keep this yaml
file private (don’t commit it to a repository). To
apply it to your cluster and create all the defined secrets, run:
kubectl apply -f secrets.yaml -n airflow
This command will apply the secrets.yaml
file to the Kubernetes cluster, in the airflow
namespace. If secrets.yaml
is a valid Kubernetes manifest file and the secrets are properly defined, all Kubernetes secrets specific within the file
will be created in the cluster and namespace.
What volumes (and how many volumes) you’ll need will depend on how you decide to store Airflow logs and how your DAGs are structured. There are, in essence, two ways to store DAG information:
The key point to keep in mind is that the folder the Airflow scheduler and webserver are watching to retrieve DAGs from and
fill in the DagBag
needs to contain built DAGs Airflow can process. In our case, our DAGs are static, built directly into
DAG files. Therefore, we went with a simple git-sync approach, syncing our DAG files into an ephemeral volume and pointing
the webserver and scheduler there.
This means the only persistent volume we needed was to store Airflow logs.
A PersistentVolume
is a cluster resource that exists independently of a Pod, meaning the disk and data stored there will
persist as the cluster changes, and Pods are deleted and created. These can be dynamically created through a PersistentVolumeClaim
,
which is a request for and claim to a PersistentVolume
resource:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: airflow-logs-pvc
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
storageClassName: standard
This creates an airflow-logs-pvc
resource we can use to store Airflow logs.
Kubernetes RBAC is a security feature allowing us to manage access to resources within the cluster through defined roles.
A Role
is a set of rules that defines the actions allowed within a specific namespace. A RoleBinding
is a way to associate
a specific Role
with a user or, in our case, a service account.
To define roles that apply cluster-wide rather than specific to a namespace, you can use a ClusterRole
and an associated
ClusterRoleBinding
instead.
In the context of our Airflow deployment, a ClusterRole
is required to allow the relevant service account to manage Pods. Therefore,
we created an airflow-pod-operator
role:
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
namespace: airflow
name: airflow-pod-operator
rules:
- apiGroups: [""]
resources: ["pods"]
verbs: ["create", "delete", "get", "list", "patch", "watch"]
with an associated role binding:
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: airflow-pod-operator
subjects:
- kind: ServiceAccount
name: airflow-service-account
namespace: airflow
roleRef:
kind: Role
name: airflow-pod-operator
apiGroup: rbac.authorization.k8s.io
The scheduler is a critical component of the Airflow application, and it needs to be deployed to its own Pod inside the cluster. At its core, the scheduler is responsible for ensuring DAGs run when they are supposed to, and tasks are scheduled and ordered accordingly.
The scheduler deployment manifest file that comes with the Helm chart (you can find it inside the scheduler
folder) is a
good starting point for the configuration. You’ll only need to tweak it a bit to match your namespace and any specific
configuration you might have around volumes.
In our case, we wanted to sync our DAGs from a GitHub repository, so we needed to configure a git-sync container. An easy way to get started is to configure the connection with a username and token, although for a production deployment it’s best to configure the connection via SSH. With git-sync configured, our scheduler deployment looked like this:
apiVersion: apps/v1
kind: Deployment
metadata:
name: airflow-scheduler
namespace: airflow
labels:
tier: airflow
component: scheduler
release: airflow
spec:
replicas: 1
selector:
matchLabels:
tier: airflow
component: scheduler
release: airflow
template:
metadata:
labels:
tier: airflow
component: scheduler
release: airflow
annotations:
cluster-autoscaler.kubernetes.io/safe-to-evict: "true"
spec:
restartPolicy: Always
terminationGracePeriodSeconds: 10
serviceAccountName: airflow-service-account
volumes:
- name: config
configMap:
name: airflow-config
- name: dags-volume
emptyDir: {}
- name: logs-volume
persistentVolumeClaim:
claimName: airflow-logs-pvc
initContainers:
- name: run-airflow-migrations
image: apache/airflow:2.7.1-python3.11
imagePullPolicy: IfNotPresent
args: ["bash", "-c", "airflow db migrate"]
env:
- name: AIRFLOW__CORE_FERNET_KEY
valueFrom:
secretKeyRef:
name: airflow-metadata
key: fernet-key
- name: AIRFLOW__CORE__SQL_ALCHEMY_CONN
valueFrom:
secretKeyRef:
name: airflow-metadata
key: connection
- name: AIRFLOW_CONN_AIRFLOW_DB
valueFrom:
secretKeyRef:
name: airflow-metadata
key: connection
volumeMounts:
- name: config
mountPath: "/opt/airflow/airflow.cfg"
subPath: airflow.cfg
readOnly: true
containers:
- name: git-sync
image: registry.k8s.io/git-sync/git-sync:v4.0.0-rc5
args:
- --repo=https://github.com/ombulabs/airflow-pipelines
- --depth=1
- --period=60s
- --link=current
- --root=/git
- --ref=main
env:
- name: GITSYNC_USERNAME
valueFrom:
secretKeyRef:
name: git-username
key: username
- name: GITSYNC_PASSWORD
valueFrom:
secretKeyRef:
name: git-token
key: token
volumeMounts:
- name: dags-volume
mountPath: /git
- name: scheduler
image: us-east1-docker.pkg.dev/my_project/airflow-images/airflow-deployment:latest
imagePullPolicy: Always
args:
- scheduler
env:
- name: AIRFLOW__CORE__DAGS_FOLDER
value: "/git/current"
- name: AIRFLOW__CORE__FERNET_KEY
valueFrom:
secretKeyRef:
name: airflow-metadata
key: fernet-key
- name: AIRFLOW__CORE__SQL_ALCHEMY_CONN
valueFrom:
secretKeyRef:
name: airflow-metadata
key: connection
- name: AIRFLOW_CONN_AIRFLOW_DB
valueFrom:
secretKeyRef:
name: airflow-metadata
key: connection
livenessProbe:
failureThreshold: 15
periodSeconds: 30
exec:
command:
- python
- -Wignore
- -c
- |
import os
os.environ['AIRFLOW__CORE__LOGGING_LEVEL'] = 'ERROR'
os.environ['AIRFLOW__LOGGING__LOGGING_LEVEL'] = 'ERROR'
from airflow.jobs.scheduler_job import SchedulerJob
from airflow.utils.net import get_hostname
import sys
job = SchedulerJob.most_recent_job()
sys.exit(0 if job.is_alive() and job.hostname == get_hostname() else 1)
volumeMounts:
- name: config
mountPath: "/opt/airflow/airflow.cfg"
subPath: airflow.cfg
readOnly: true
- name: dags-volume
mountPath: /git
- name: logs-volume
mountPath: "/opt/airflow/logs"
The scheduler deployment is divided into two “stages”, the initContainers
and the containers
. When Airflow starts,
it needs to run database migrations in the metadata database. That is what the init container
is doing. It runs as soon
as the scheduler pod starts, and ensures the database migration is completed before the main application containers start.
Once the init container
is done with the start up task, the git-sync and scheduler containers can run.
Notice that the scheduler container references a custom image in Artifact Registry
. Given our pipeline set up and choice
of executor, we replaced the official Airflow image in the deployment with our own image.
The webserver is another critical Airflow component, it exposes the Airflow UI and manages user interaction with Airflow. Its deployment is all too similar to that of the scheduler, with minor differences, so we won’t go into it in detail. The manifest file looks like this:
apiVersion: apps/v1
kind: Deployment
metadata:
name: airflow-webserver
namespace: airflow
labels:
tier: airflow
component: webserver
release: airflow
spec:
replicas: 1
strategy:
type: RollingUpdate
rollingUpdate: 2023-11-17 20:54:25
maxSurge: 3
maxUnavailable: 1
selector:
matchLabels:
tier: airflow
component: webserver
release: airflow
template:
metadata:
labels:
tier: airflow
component: webserver
release: airflow
annotations:
cluster-autoscaler.kubernetes.io/safe-to-evict: "true"
spec:
restartPolicy: Always
terminationGracePeriodSeconds: 10
serviceAccountName: default
volumes:
- name: config
configMap:
name: airflow-config
- name: dags-volume
emptyDir: {}
- name: logs-volume
persistentVolumeClaim:
claimName: airflow-logs-pvc
initContainers:
- name: run-airflow-migrations
image: apache/airflow:2.7.1-python3.11
imagePullPolicy: IfNotPresent
args: ["bash", "-c", "airflow db migrate"]
env:
- name: AIRFLOW__CORE__FERNET_KEY
valueFrom:
secretKeyRef:
name: airflow-metadata
key: fernet-key
- name: AIRFLOW__CORE__SQL_ALCHEMY_CONN
valueFrom:
secretKeyRef:
name: airflow-metadata
key: connection
- name: AIRFLOW_CONN_AIRFLOW_DB
valueFrom:
secretKeyRef:
name: airflow-metadata
key: connection
volumeMounts:
- name: config
mountPath: "/opt/airflow/airflow.cfg"
subPath: airflow.cfg
readOnly: true
containers:
- name: git-sync
image: registry.k8s.io/git-sync/git-sync:v4.0.0-rc5
args:
- --repo=https://github.com/ombulabs/airflow-pipelines
- --depth=1
- --period=60s
- --link=current
- --root=/git
- --ref=main
env:
- name: GITSYNC_USERNAME
valueFrom:
secretKeyRef:
name: git-username
key: username
- name: GITSYNC_PASSWORD
valueFrom:
secretKeyRef:
name: git-token
key: token
volumeMounts:
- name: dags-volume
mountPath: /git
- name: webserver
image: us-east1-docker.pkg.dev/my_project/airflow-images/ombu-airflow-deployment:latest
imagePullPolicy: Always
args:
- webserver
env:
- name: AIRFLOW__CORE__FERNET_KEY
valueFrom:
secretKeyRef:
name: airflow-metadata
key: fernet-key
- name: AIRFLOW__CORE__SQL_ALCHEMY_CONN
valueFrom:
secretKeyRef:
name: airflow-metadata
key: connection
- name: AIRFLOW_CONN_AIRFLOW_DB
valueFrom:
secretKeyRef:
name: airflow-metadata
key: connection
- name: AIRFLOW__WEBSERVER__AUTH_BACKEND
value: "airflow.api.auth.backend.basic_auth"
volumeMounts:
- name: config
mountPath: "/opt/airflow/airflow.cfg"
subPath: airflow.cfg
readOnly: true
- name: dags-volume
mountPath: /git
- name: logs-volume
mountPath: "/opt/airflow/logs"
ports:
- name: airflow-ui
containerPort: 8080
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 15
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 15
Perhaps the most notable thing here is the presence of the AIRFLOW__WEBSERVER__AUTH_BACKEND
environment variable. This
allows us to use a basic authentication backend with Airflow. As part of this deployment, we didn’t configure the creation
of a root user, meaning one needed to be created from within the container by the first person trying to access the UI. If
you find yourself in the same situation:
Run
kubectl exec -it <webserver-pod-name> -n airflow -c webserver -- /bin/sh
to access the shell within the webserver container. By default, running the command without the -c webserver
flag will
access the git-sync container, which is not what we want. Once inside the shell, run:
su airflow
To switch to the airflow
user. This is needed to run airflow
commands. Now you can run:
airflow users create --username <your_username> --firstname <first_name> --lastname <last_name> --role <the-user-role> --email <your-email> --password <your-password>
This will create a user with the specified role. This only needs to be run to create the first admin user after a fresh deployment, additional users can be created directly from within the interface.
Having the webserver deployed to a pod is not enough to be able to access the UI. It needs a Service
resource associated
with it to allow access to the workload running inside the cluster. From our webserver manifest file, we defined an airflow-ui
port name for the 8080
container port. Now we need a service that exposes this port so that network traffic can be directed
to the correct pod:
kind: Service
apiVersion: v1
metadata:
name: webserver-svc
namespace: airflow
spec:
type: ClusterIP
selector:
tier: airflow
component: webserver
release: airflow
ports:
- name: airflow-ui
protocol: TCP
port: 80
targetPort: 8080
There are five types of Kubernetes services
that can be defined, with the ClusterIP
type being the default. It provides an internal IP and DNS name, making the service
only accessible within the cluster. This means that we now have a service associated with the webserver, but we still can’t
access the UI through a friendly subdomain as a regular application.
For that, we’ll configure an ingress next. An Ingress
is an API object that defines the rules and configurations to manage
external access to our cluster’s services.
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: airflow-ingress
namespace: airflow
annotations:
cert-manager.io/cluster-issuer: "letsencrypt"
spec:
ingressClassName: "nginx"
tls:
- hosts:
- airflow.my_domain.com
secretName: cert-tls-secret
rules:
- host: airflow.my_domain.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: webserver-svc
port:
number: 80
The key configuration here that allows us to define the settings for secure HTTPS connections is the tls
section. There,
we can list all hosts for which to enable HTTPS and the name of the Kubernetes Secret
that holds the TLS certificate and
private key to use to secure the connection. This secret is automatically created by cert-manager
.
Finally, in order to ensure our resources have the necessary permission to spawn new pods and manage pods, we need to configure service accounts for them. You can choose to configure individual service accounts for each resource or a single service account for all resources, depending on your security requirements.
The ServiceAccount
resource can be configured as:
apiVersion: v1
kind: ServiceAccount
metadata:
name: default
namespace: airflow
labels:
tier: airflow
component: scheduler
release: airflow
automountServiceAccountToken: true
Since we wanted users to be able to manage workflows directly from the UI, we also configured a service account for the webserver.
This is an optional component that collects metrics inside the Airflow application. The deployment is similar to the other two, so we won’t dive into it.
Airflow is now deployed to a GKE cluster and accessible via our chosen subdomain. This allows us to have a higher level of control over our infrastructure, while still leveraging GKE’s built-in resources to auto-scale as needed.
]]>In that spirit, we are excited to announce a new role in our organization, the Account Advocate, a key role in our team fully dedicated to championing client interests, collaboration and ensuring successful partnerships that go above and beyond.
The Account Advocate is a key, strategic role focused on ensuring that our clients are happy with our partnership not only from a technical and delivery perspective, but also from a business perspective. They are an advocate and representative for your business stakeholders inside our team, dedicated to connecting your vision with our delivery and ensuring your goals are met and any potential concerns are heard and addressed.
The Account Advocate works closely with the Project Manager to ensure success, but while the Project Manager focuses on delivery and the success of the existing project, the Account Advocate focuses on the overall relationship with the business, makes sure value delivery expectations are met, your team is being heard and ensures we’re delivering value to your company at every opportunity.
They also facilitate communication with senior leadership on both ends, ensuring that you have all the support you need for a successful collaboration.
Communication is key to everything we do. We value open and honest communication with our clients and between our teams. As such, you will have plenty of contact and checkpoints with our delivery team.
The Account Advocate is focused on more strategic goals and higher-level partnership priorities , so they will aim to meet with business stakeholders quarterly. If a different frequency is preferred, we will most definitely adapt, but we believe at least quarterly contact is important to ensure success and happiness on both ends of the partnership.
While communicating and collaborating with you, the Account Advocate will focus on:
Client Happiness: We are committed to understanding your goals, challenges and opportunities. Client happiness is at the core of our business, and they are your voice within our organization, ensuring your feedback is being heard and any concerns you might have are understood and addressed swiftly.
Strong Partnership and Collaboration: Ongoing collaboration makes partnerships grow stronger, and we are interested in delivering as much value to your organization as we can. They will collaborate closely with your business team to foster trust and open communication and facilitate collaboration at the higher levels of leadership.
New Opportunities: We are vested in your success and believe in going above and beyond in everything we do. The Account Advocate is interested in hearing what other problems we can help solve, other challenges we can help you overcome and overall other ways in which we can contribute to deliver cost-effective solutions that solve real problems and generate actual value for you and your team.
Problem Resolution: We believe in Challenging Projects over Profitable Projects, that’s why we are so passionate about every project we work on. That also means we understand challenges arise and are a part of every successful collaboration. The Account Advocate is focused on solving any issues swiftly and transparently, ensuring minimal disruption.
As we introduce the Account Advocate role to our team and to our partnership, we are excited to see how it will contribute to an even more successful and strong relationship with our clients. This role strengthens our commitment to client happiness and success and our interest in building long-lasting relationships based on trust, open communication and transparency.
We look forward to working with you and your team on our next successful project! Contact us to get your next project started!
]]>Day 5 of the design sprint is about testing your prototype and getting feedback on your ideas. That way, you can quickly learn what is or isn’t working about the concept. Yesterday, the interviewer spent time putting together a list of questions for the interview sessions. Earlier this week, your team recruited 5 participants for Friday’s research. Now you are ready to do the dang thing.
We test early with a low-fidelity prototype because it’s smart and far less expensive than waiting until something is built. It’s important to try to find test participants who are outside of your organization, or at least unfamiliar with the product. The Design Sprint can’t be considered complete before research is done, so get ready to find out how other people feel about what you’ve been working on all week.
What does the team hope to learn from these interviews? A high level goal of “Do people like this?” might become something like “What do people think about the solution? What are the positives and negatives? What do people like or dislike about our solution vs our competitor’s solution?
Start with easy open-ended interview questions that align with your research goals, such as “How long have you been doing…”.
Only ask open-ended questions, no “yes/no” questions, nor “multiple choice” questions like “would you do x?”.
You can ask things like “What was the most useful part of this prototype? What was the least useful?”.
Avoid asking any questions that might lead a participant to a particular answer. You want to learn as much as possible in the sessions, so keep the questions open-ended. You’ll be surprised how much you learn.
When you have finished writing your questions, run a pilot version of the session with a team member.
Adjust as needed if you notice any hiccups.
A laptop: A video-enabled virtual meeting tool, like Zoom, Webex, or Google Hangouts that enables your participants to share their screen and your team to observe from their computers.
A link to the prototype (like a Google slides link or something like that).
People will try to please you and will generally be kind in interviews, so assure them that you’d like them to be honest with their feedback. Make sure that your interviewees understand that you are not testing them, but rather that they are helping you test the prototype. Tell the interviewee that they are not under any scrutiny and that all difficulties or issues are useful information for the team and will help make the solution better. Plan for each interview should take about 30m or so, depending on how many questions they will be asked. Give yourself about 20m between each interview to organize your notes and prepare for the next session.
While the interviews are happening, the rest of the team should be paying attention and watching the interviews remotely. While observing, they should be taking notes on post-its of any notable comments, behaviors, or other observations. These notes will be used to determine the next course of action in terms of adjustments and fixes. Don’t worry about taking overlapping notes. The notes will be organized later and duplication will not affect the quality of the work at all.
Once the interviews are complete, the team reviews their notes together, grouping likes notes together into themes. The team will discuss these themes. You’ll learn what went well, what didn’t go so well, and what direction or changes you should try again in the next iteration. Any changes should be prioritized by the team, and then used to determine the next steps for your fledgling product.
At this point, you have successfully completed the Design Sprint! Bravo!
]]>Using environment variables to store information in the environment itself is one of the most used techniques to address some of these issues. However, if not done properly, the developer experience can deteriorate over time, making it difficult to onboard new team members. Security vulnerabilities can even be introduced if secrets are not handled with care.
In this article, we’ll talk about a few tools that we like to use at OmbuLabs and ideas to help you manage your environment variables efficiently.
In 2011, Heroku created The Twelve-Factor App methodology aimed at providing good practices to simplify the development and deployment of web applications.
As the name suggest, the methodology includes twelve factors, and the third factor states that the configuration of the application should be stored in the environment.
The idea of storing configuration in the environment was not created by Heroku but the popularity of Heroku for Ruby and Rails applications made this approach really popular.
The main benefit is that our code doesn’t have to store secrets or configuration values that can vary depending on where or how the application is run. Our code simply assumes that those values are available and correct.
The idea of storing configuration in the environment is simple for a single-app production environment, it is easy to set environment variables for the whole system.
Hosting providers like Heroku or Render have a configuration panel to manage the environment variables. However, when many applications have to run in the same system each of them may need different values for a given environment variable, and then the “environment” depends on the current project and not only on the system.
One of many tools to assist with this is the dotenv
gem, which wraps our application with specific environment values based on hidden files that can be loaded independently for each app without polluting the system’s environment variables.
The way dotenv
works is that it will read environment variables names and values from a file named .env
and will populate the ENV
hash with them.
By default, dotenv
will NOT override variables if they are already present in the ENV
hash, but that can be changed using overload
instead of load
when initializing the gem.
Since the .env
file holds information that is specific for a given environment, this file is not meant to be included in the git repository.
How do we let new engineers know that we make use of a .env
file or what the required environment variables are? The dotenv
gem provides a good solution.
The dotenv
gem provides a template feature to generate a .env.template
file with the same environment variables but without actual values.
Another common practice is to use a file called .env.sample
with similar content.
When a new developer clones the repository, they can copy the .env.template
or .env.sample
file as .env
(or any of the variants, we’ll talk about this in a moment) and replace the values as needed.
One issue that we have faced in many projects is when a new developer would need to know the environment variables (listed in a .env.sample
file), but wouldn’t know what to use as values that make sense.
In many cases any value works when the code doesn’t depend on the actual format of the value. However, when the data type or format does matter then things can go wrong.
One example we had for this issue was a third-party gem that required an API secret, the gem would verify the format of the secret against a regular expression and some actions would fail with an invalid secret format error.
To prevent this, we created and open-sourced the dotenv-validator
gem, which leverages the use of a .env.sample
file with comments for every environment variable to provide extra information about the expected format of the value for each variable.
This gem includes a mechanism to warn an engineer about missing or incorrect environment variables when the application starts.
By default, dotenv
only looks for a file named .env
, but, when using dotenv-rails
, it will provide some naming conventions that we can adopt to further differentiate the environment variables we use not only per app but also per Rails environment.
When running a Rails app with dotenv-rails
, environment variable files are looked up in this order:
root.join(".env.#{Rails.env}.local"),
(root.join(".env.local") unless Rails.env.test?),
root.join(".env.#{Rails.env}"),
root.join(".env")
Using this convention we can specify different environment variables for the same application when we run the application with rails s
or when we run the tests.
Note that all the files listed above are loaded and processed by
dotenv
in that specific order. This means you can have generic environment variables in a.env
file and be more specific overriding/defining only some of them in a file for the current Rails environment without having to copy all the variables to the new file.
New Rails application comes with a bin/dev
script that uses the foreman
gem to run multiple processes at once. foreman
is aware of the .env
file and will load it before our application loads it. However, there’s one important difference, the way foreman
parses the .env
file is not the same as the way dotenv
processes the same file.
The dotenv
gem understands comments and they are ignored when setting the values in the ENV
hash, while foreman
does not ignore them. So, a .env
file that looks like this:
MY_ENV="my value" # some comment here
Will produce different values for ENV["MY_ENV"]
depending on how the application is run:
rails s
, the comment is ignored by dotenv
and ENV["MY_ENV"]
returns the string "my value"
foreman
the comment is not ignored, so ENV["MY_ENV"]
returns the string '"my value" # some comment here'
(then, when the Rails app loads, the .env
file is parsed again by dotenv
but since the variable was already defined by foreman
, it is not replaced)One workaround for this is to rely on the naming convention of alternative files: if, for example, we use .env.development
and .env.test
files, these will only be parsed by dotenv
thanks to the dotenv-rails
convention and not by foreman
.
Another option is to configure the initialization of dotenv
to use overload
instead of load
.
Docker is a really popular solution for containerizing applications, and Docker-related files will be created by Rails for new apps (since Rails 7.1).
When using docker-compose
, it will look for a .env
file and, in some cases, it may not ignore comments or even process the values differently than dotenv
.
You can check the docs here.
If environment variables are not populated correctly by docker-compose
compared to dotenv
, the workarounds used for foreman
can be used here too.
Sometimes we have to run applications that are not aware of the .env
file but do expect some configuration in the ENV
hash. For example, a background job process running a worker that reads some information from the ENV
hash.
In that case, instead of changing our job-runner code to load dotenv
, we can use the dotenv
executable to wrap any command. For example:
dotenv -f ".env.local" bundle exec rake sidekiq
This wrapper can then be used in a Procfile
to ensure dotenv
works as expected when using foreman
for example if we don’t use a .env
file.
Another popular gem with a similar functionality is the figaro
gem. Compared to dotenv
, figaro
is focused more on Ruby on Rails applications and provides some features like ensuring the presence of specific environment variables (one of the features of dotenv-validator
).
dotenv
is not focused on Ruby on Rails applications (but can be used with no issues) and its development has been more active.
Because of the work we do at OmbuLabs with multiple clients, handling environment variables with a .env
file is key for us to quickly change between projects locally without polluting the system’s environment variables.
For our projects we don’t use a .env
file in production, since we define the environment variables in the Heroku dashboard, but we still use dotenv-validator
to ensure that the application has all the variables with correct values to avoid unexpected issues.
We try to keep the .env.sample
file with development-ready values, but it’s not always possible when some variables can be specific for a machine or developer, so adding format validation can help the developer set the correct value.
Feel free to reach out to OmbuLabs for help in your project, we offer many types of services.
]]>Now that the executives, sales team, and lawyers have signed off on the project, how do you get off to a quick start to accomplish your business goals?
Provide access for the external agency before the project kick-off call.
Access provisioning can take anywhere from 1-7 days in best case scenarios. Our projects are a fixed retainer, and we begin billing from the project start date, whether we have working accesses or not.
Be sure to select your project start date while keeping in mind how long it will take to grant the agency full access.
As soon as contracts are signed, one of our project managers will reach out to provide information and begin the access process. Clients who have documentation around external contract access provisioning and are proactive about onboarding our team are able start projects without delay.
Update your readme and communicate QA process workflow.
When was the last time you set up your application locally? What does your QA process look like? How long does it normally take to review PRs?
Our PM and engineers ask these questions at the start of any new project. We’ve found that projects begin swiftly when clients have recently updated their readme, and can explain their QA process clearly.
If it’s been a while since you updated your documentation, or if you don’t have any, now is the time to whip something up to support the collaborative effort!
Appoint a clear decision maker and escalation point to facilitate seamless communication.
Depending on your organization’s size, the decision makers could be the same people who are about to collaborate with us. In many cases, our initial communication is with executives or lead engineers.
When a contract is signed, it is important to inform the agency who they will interact with daily, who is needed to conduct check-in calls, and who is a decision maker in the case of code or other types of project changes.
Communicate the business case to the development team.
As you prepare your developers and engineers to work with us or another agency, it is useful to explain the business case, applicable scope of work information, how you foresee the workflow changing (if at all), and expectations around collaboration with external stakeholders.
Teams that understand their roles and responsibilities clearly can collaborate best.
While hiring an external agency may have some challenges, there are easy steps to take to prepare your organization and team to mitigate those risks and start projects without a hitch.
You can hit the ground running by documenting and clearly communicating you access provisioning, setup and QA processes. Having a clear project POC, and preparing you team with internal communication will make for a smooth and quick transition process.
Are you interested in working with our agency? We provide many services including staff augmentation, Ruby on Rails Upgrades, and JavaScript Upgrades. You can also checkout some of our case studies if you want to know more about past companies who have worked with us.
]]>Day 4 is a little different from the other days of the Design Sprint. Instead of a series of workshops, we will spend most of the day each working on one part of the prototype.
Towards the end of the day, we will do a test run to check on our progress and adjust from there.
Using the storyboard from Wednesday as our map, we will divide and conquer the prototype.
The team will be split into 5 roles:
Makers will create the various sections of the prototype.
How many? 2 to 3 Makers.
Makers will split the storyboard (or storyboards!) into sections.
Each maker is responsible for creating the prototype for their sections of storyboard.
Asset Collectors will gather images/icons and other assets that the makers will need.
How many? 2-3 Asset Collectors.
The asset collectors will make sure that the makers have the assets they need to continue their work. This means finding images, icons, illustrations, sounds, or anything else that the maker can stay focused on and leave these decisions to someone else.
Writer to provide the text for all the parts of the prototype.
How many? 1 Writer.
The writer fine tunes all the copy from the storyboard and provides that copy to the makers.
This might include fake text for an article about whatever you’re prototyping, an email about the product, an advertisement etc, as well as the copy in the prototype.
The stitcher is responsible for taking the sections of the prototype or prototypes and attaching them together.
Their job it is to make sure that the whole experience makes sense from end to end.
How many? 1 Stitcher.
The stitcher puts it all together and makes sure that all the pieces fit together into a seamless prototype.
Interviewer, who will write the interview script for Friday.
How many? 1 Interviewer.
The interviewer will write questions for the interviews tomorrow based on the storyboard and the prototype.
A prototype is a tool for research and discovery – not a functional app.
A good prototype feels real enough to pretty well replicate your desired experience and help your interviewees get in the headspace of the problem you’re asking them to think about.
The goal for THIS prototype is to make something that does those things AND can be ready to test after 8 hours of work.
The prototype is not a blueprint for a product. It’s a way to get feedback on an idea from people who might use a product like yours in the future, and then apply that feedback so that you have a really good idea of where to look next as you continue the process of making an idea into a service.
You don’t need any fancy design software to do this because this is intended to be accessible for everyone. You could lean on Keynote or Powerpoint.
I recommend these tools because they are basic, they are not designer-only, and because they are relatively easy to use. You can even use the transition features in both of them to show the flow of the prototype.
You need images, you need text, you need to transition between scenarios and steps, and you need a way to set up your starting scenario. You can (and absolutely should) fake things if you need to.
Does there need to be an email? Fake it.
Does an automated phone call play into your scenario? Fake that, too.
The reason for this is to keep the prototype simple and to prevent the team from conflating this exercise with a normal design process. It’s certainly part of that process, though.
Feel empowered to use a bit of hand-waving during the interview if the prototype takes a little more imagination in some areas, too. Of course you can make this prototype using design software like Sketch, Figma, Balsamiq, XD, or whatever else you like, but beware of letting your Design Sprint prototype become something bigger than it needs to be.
Don’t get too precious about the prototype. Focus on picking a tool that will be easy to use, then use the heck out of it.
If you’re working on an iPad or iPhone app, Apple provides free iOS interface elements for Keynote. There are also lots of free UI kits for PowerPoint. You can also use images of elements you need and Frankenstein your way to a functional (read: testable) prototype.
At about 3pm or so, or about 5 hours into your prototype day, try the thing out.
The prototype should be in a strong rough-draft place. Quickly put your sections together (e.g. just have each Maker play through their sections in order) so you can walk through what you have.
Walk through the full scenario and see how it looks and feels. Make note of any rough spots, then take the notes back with you as you’re round the corner on getting the prototype to a testable place.
Finally, the stitcher will take all the files and put them together into 1 whole prototype. Make sure that the interview questions line up with the prototype, and you’re ready for your interviews on Day 5.
Need to see this in action? Contact us to validate your idea with a one-week Design Sprint! 🚀
]]>