This series will walk through the process of shaping the original problem as a machine learning problem and building the Pecas machine learning model and the Slackbot that makes its connection with Slack.
In this first article, we’ll talk through shaping the problem as a machine learning problem and gathering the data available to analyse and process.
This series will consist of 6 posts focusing on the development of the Pecas machine learning model:
Before we dive into the machine learning aspect of the problem, let’s briefly recap the business problem that led to the solution being built.
OmbuLabs is a software development agency providing specialized services to a variety of different customers. Accurate time tracking is an important aspect of our business model, and a vital part of our work. Still, we faced several time tracking related issues over the years, related to accuracy, quality and timeliness of entries.
This came to a head at the end of 2022, when a report indicated we lost approximately one million dollars largely due to poor time tracking, which affected our invoicing and decision-making negatively. Up to this point, several different approaches had been taken to try to solve the problems, mostly related to different time tracking policies. All of these approaches ended up having significant flaws or negative side effects that led to policies being rolled back. This time, we decided to try to solve the problem differently.
There were a variety of time tracking issues, including time left unlogged, time logged to the wrong project, billable time logged as unbillable, incorrect time allocation, vague entries, among others. Measures put in place to try to mitigate the quality-related issues also led to extensive and time-consuming manual review processes, which were quite costly.
In other words, we needed to:
Our main idea was to replace (or largely replace) the manual process with an automated one. However, although the process was very repetitive, the complexity of the task (interpreting text) meant we needed a tool powerful enough to deal with that kind of data. Hence the idea to use machine learning to automate the time entry review process.
It is worth noting that machine learning powers one aspect of the solution: evaluating the quality and correctness of time entries. Other aspects such as timeliness of entries and completeness of the tracking for a given day or week are very easily solvable without a machine learning approach. Pecas is a combination of both, so it can be as effective as possible in solving the business problem as a whole.
The first thing we need to do is identify what part of the problem will be solved with the help of machine learning and how to properly frame that as a machine learning problem.
The component of the problem that is suitable for machine learning is the one that involves “checking” time entries for quality and accuracy, that is, the one that involves “interpreting” text. Ultimately, the goal is to understand if an entry meets the required standards or not and, if not, notify the team member who logged it to correct it.
Therefore, we have a classification problem in our hands. But what type of classification problem?
Our goal is to be able to classify entries according to pre-defined criteria. There are, in essence, two clear ways we can approach the classification:
Which one we want depends on a few different factors, perhaps the most important one being the existence of a finite, known number of ways in which an entry can be invalid.
If there is a finite, known number of classes an entry can belong to and a known number of ways in which each entry can be invalid, the machine learning model can be used to classify the entry as belonging to a specific category and that entry can then be checked against the specific criteria to determine validity or invalidity.
However, we don’t have that.
Time entries can belong to a wide range of categories as a mix of specific keywords in the description, project they’re logged to, tags applied to the entry, user who logged it, day the entry was logged, among many others. Too many. Therefore, intermediate classification might not be the best approach. Instead, we can use the entry’s characteristics to teach the model to identify entries that seem invalid, and let it determine validity or invalidity of the entry directly.
Thus we have in our hands a binary classification problem, whose objective is to classify time entries as valid or invalid.
Now we know what kind of problem we have in our hands, but there are a wide variety of different algorithms that can help solve this problem. The decision of which one to use is best informed by the data itself. So let’s take a look at that.
The first thing we need is, of course, the time tracking data. We use Noko for time tracking, and it offers a friendly API for us to work with.
A Noko time entry as inputted by a user has a few different characteristics:
There is also one relative characteristic of a time entry that is very important: whether it is billable or unbillable. This is
controlled by one of two entities: project or tag. Projects can be billable or unbillable. By default, all entries logged to an
unbillable project are unbillable and all entries logged to a billable project are billable. However, entries logged to a billable
project can be unbillable when a specific tag (the #unbillable
tag) is added to the entry.
There is also some metadata and information that comes from user interaction with the system that can be associated with the entry, the most relevant ones being:
Of the entities associated with an entry, as mentioned above one is of particular interest: projects. As aforementioned, projects can indicate whether an entry is billable or unbillable. And, as you can imagine, an entry that belongs to a billable project logged to an unbillable project by mistake means the entry goes uninvoiced, and we lose money in the process.
A project also has a unique ID that identifies it, a name and a flag that indicates whether it is a billable or unbillable project. The flag and the ID are what matters to us for the classification, the ID because it allows us to link the project to the entry and the flag because it is the project characteristic we want to associate with the data.
There are other data sources that have relevant data that can be used to gain context on time entries, for example calendars, GitHub pull requests, Jira tickets. For now, let’s keep it simple, and use a dataset of time entries enriched with project data, all coming from Noko.
In order to make it easier to work and explore the data, we extracted all time entries from Noko logged between January 1st, 2022 and June 30th, 2023. In addition to entries, projects, tags and users were also extracted from Noko, and the data was loaded into a Postgres database, making it easy to explore with SQL.
We then extracted a few key characteristics from the set:
property | stat |
---|---|
total_entries | 49451 |
min_value | 0 |
max_value | 720 |
duration_q1 | 30 |
duration_q3 | 90 |
average_duration_iq | 49.39 |
average_duration_overall | 71.33 |
median_duration | 45 |
max_word_count | 162 |
min_word_count | 1 |
avg_word_count | 9.89 |
word_count_q1 | 4 |
word_count_q3 | 11 |
entries_in_word_count_iq | 29615 |
average_word_count_iq | 6.63 |
least_used_tag: ops-client | 1 |
most_used_tag: calls | 12043 |
unbillable_entries | 33987 |
billable_entries | 15464 |
pct_unbillable_entries | 68.73 |
pct_billable_entries | 31.27 |
The table above allows us to get a good initial insight into the data and derive a few early conclusions:
This initial set of considerations already tells us something about our data. We have a fairly large dataset, with a mix of numerical and categorical variables. There are also outliers in several features of the data and the range of values in durations and word count could indicate their relationship with validity or invalidity is not strictly linear. Our empirical knowledge confirms this assumption. Although longer entries in duration are generally expected to have longer descriptions, there are several use cases for long entries in duration to have small word counts.
Other characteristics we looked at (in similar fashion) to get a good initial idea of what we were dealing with include:
This gave us a good initial idea of what we were dealing with.
By this point, we know we’re dealing with a binary classification problem and that we have a fairly large dataset with outliers and non-linear relationships in data. The dataset also has a mix of numerical and categorical variables. The problem we have at hand is made more complex by the presence of text data that requires interpretation.
There are a number of algorithms to choose from for binary classification, perhaps the most common being:
A quick comparison of their strengths and weaknesses shows that tree-based models are most likely the right choice for our use case:
Logistic regression’s strengths lie in its simplicity:
However, some of its weaknesses make it clearly not a good candidate for our use case:
Another example of a simple algorithm, with strengths associated with its simplicity:
However, some of its weaknesses also make it immediately not a good choice for our problem:
Naive Bayes’ core strengths are:
However, two key weaknesses make it yet another unsuitable choice for our use case:
Unlike the previous algorithms, two of SVMs core strengths apply to our use case:
However, two core weaknesses make it second to tree-based models as a choice:
We have arrived at the most suitable type of algorithm for our problem at hand! The core strengths of these algorithms that make them a good choice are:
Some weaknesses related to them are:
Therefore, we’ll pick ensemble tree-based models as our starting point.
But which one? That’s a tale for the next post. We’ll do some more analysis in our data, pre-process it and train a few different models to pick the best one.
Framing your business problem (or business question) as a machine learning problem is a first and necessary step in understanding what kind of problem you’re dealing with and where to start solving it. It helps guide our data exploration and allows us to choose which machine learning algorithm (or family of algorithms) to start with.
A good understanding of the data available to you, the business context around the problem, and the characteristics that matter can help guide your exploration of the dataset to validate some initial questions, such as do we have enough data or is the data available enough to convey the information we need. It’s important to not be tied by these initial assumptions and this initial knowledge in your analysis though, as exploring the data might reveal additional, useful insights.
With a good understanding of the problem and dataset, you can make an informed algorithm selection, and start processing your data and engineering your features so your model can be trained. This second step is what we’ll look at in the next post.
Need help leveraging machine learning to solve complex business problems? Send us a message and let’s see how we can help!
]]>Despite being a core activity, we had been having several issues with it not being completed or not being completed properly. A report we ran at the end of 2022 showed our time tracking issues were actually quite severe. We lost approximately one million dollars in 2022 due to time tracking issues that led to decisions made on poor data. It was imperative that we solved the problem.
To help with this issue, we created an evolution of our Pecas project. We turned Pecas into a machine learning powered application capable of alerting users of issues in their time entries. In this article, we’ll talk though the business case behind it and expected benefits to our company.
Our time tracking issues pre-dated the 2022 end of year report. By that point, we had been having problems for a couple of years, it just wasn’t a big priority. As the company grew, however, the issues multiplied, and got to a point where we needed to prioritize solving the problem.
A detailed analysis of our time tracking data revealed several different issues, both issues that were typically caught by internal processes relying on this information, such as invoicing, and issues that typically remained hidden:
These were some of the main issues we were facing, and as a small company, their impact was even more significant to our projects and our operation overall. We knew it was a problem, and we attempted a few different solutions, including implementing policies around time tracking. They ended up having serious flaws that caused us to reconsider and eventually retract them. But we still had a problem to solve.
At the end of 2022, when we looked at our number for the year, we decided to dive deeper into this data. And the cost of the issues mentioned above became very clear: we lost $1,000,000 dollars due to these issues and their consequences. What this meant is that we had a million dollar problem to solve.
Time tracking issues (timing and quality of entries) are one aspect of a complex problem. Improving time tracking quality was one of the problems we had to solve, and one of significant impact. There were, however, multiple root causes that led to the loss we identified (process problems, service management, communication). Those are being addressed separately and are beyond the scope of this article.
Our main issue was that specific time tracking policies we had implemented didn’t account for nuance. Although delays in time entries being entered into the system and entries logged to incorrect projects decreased, addressing some of the most costly problems we had, honest mistakes were treated in the same way as more serious issues, and the policy was found to be unfair in some cases.
This went against our core values and led us to look for a different solution. The main issue was that there was no way to be alerted of honest mistakes in entries before the information was needed, someone reviewed and found the issue manually, or we ran another comprehensive report.
Manual processes for these kinds of tasks are not great. They are expensive and take people away from other activities. We wanted an automated way to monitor and flag entries. We knew from the beginning there was always going to be a human component to it, but if we could reduce the time we spent every week running reports and reviewing and fixing entries, that was already a win.
That’s when we decided to build an internal tool to help with this. Our goal was to reduce the time invested on time tracking by our operations team by automating the bulk of the work to find these issues, and leaving human reviews to a much smaller set of of entries.
This solution would need to be able to:
The complexity lied in the fact that we’re dealing with text data in free speech form (entry description) combined with several other properties (project, labels, date, billable or non-billable, duration). Accounting for all possible scenarios and issues with hard rules would not work. That’s where machine learning comes into play.
We split the entry classification part of the solution in two:
No solution is perfect, and we knew there were going to be issues that still slipped through the cracks, as well as a need for human review. Our goal was to minimize both.
That’s how the Pecas project was born.
When we decided to build the solution, we were spending between 3 and 5 hours every week on time tracking reports. That meant spending between $30,000 and $50,000 every year just on these reports. As the company grew and we had more people joining the team, the time spent on this was also going to increase significantly.
In summary, we had one million dollars in losses in 2022 alone, and were looking at a current cost of $30,000 to $50,000 per year to run the process manually, increasing every time our team grew. We had a pretty solid case to invest in a solution.
Additional factors that contributed to our decision to go ahead with the project were:
In order to properly evaluate whether building a solution was the right move, we also had to consider implementation and maintenance costs. We had the expertise needed within our team, so we didn’t really need to bring in external help to accomplish what we needed; and even with the added complexity of a machine learning model, we were looking at a small application. To put it into perspective:
Assuming we did nothing, we would continue to incur significant losses year after year. From our data analysis and root cause evaluation, we believe the solution could help reduce the loss by approximately 60%, saving us $600,000. Similarly, the solution can reduce the time spent reviewing time entry reports by 80%, meaning our costs would reduce to $6,000 to $10,000 per year, saving us between $24,000 and $40,000 every year, not accounting for potential growth.
Building the solution would cost approximately 50% of the total we expected to save, and maintaining it, once built, certainly wouldn’t cost as much as we were losing. Pretty good case to build it!
Add to that the knowledge and learning gains, and preserving our culture of team first, and the decision was easy.
The Pecas app’s first version went live in March and, at that time, supported only filter classification with hard rules. That allowed us to measure user interaction with it and see how (or if) things would improve. It also got us thinking about new ways to leverage the app.
A version of the app with machine learning integrated went live in August, and we have been monitoring it and collecting data. The number of common issues in entries identified has decreased significantly, and timeliness of the logging has greatly improved.
We have found additional use cases for the bot, and created additional alerts for project teams, project managers and our overall Operations team. This has allowed us to identify issues faster and react to them immediately, saving us time, money, and headache in the long run.
We’re still monitoring data and working through results, a preliminary analysis shows that the number of billable time entries logged to non-billable projects in Q3 2023 was 95% smaller than for the same period in 2022, so we’re calling this an win for now as we continue to expand the machine learning and other functionalities.
Machine learning isn’t a magic bullet to all of our problems. In fact, in many cases, it isn’t quite the right solution, and you can go very far with hard rules. There are situations, however, where it is the ideal solution. In those cases, it is a powerful tool to solve very complex problems.
As previously mentioned, an automated tool to aid time tracking quality wasn’t the only solution to this problem. Changes in process were also required, and each case was examined, separately and in conjunction with others, and addressed. But it was a core piece in the strategy, and the results are positive and quite promising.
We specialize in solving complex problems for companies looking to build interesting tools that provide meaningful results. We take a holistic look at the problem, advise on all aspects of the problem, and can help you improve your processes and build the right tool for the right problem.
Got some difficult problems you’d like to solve with software but not quite sure where to start? Unsure if machine learning is the right solution to your problem? Send us a message.
]]>In that spirit, this year we decided to organize our open source contribution time in a way that wasn’t limited to our own open source projects. This is a short post to explain how we aligned our open source contributions with our learning goals, what contributions we made, and why it mattered.
Last year, as a company, we did an exercise in participating in Hacktoberfest with our team. There were positive and negative notes but, overall, feedback around the exercise itself was positive.
This year we had specific goals and topics we wanted to focus on as a team. We decided to use open source projects as a way to learn and practice while also contributing to the community.
Therefore, this year we aligned our open source contributions with our learning purposes. As a part of our company, we conduct monthly one-on-one calls with our full-time employees. In those calls, we learn about areas and skills that our direct reports would like to improve.
The problem is that sometimes client work doesn’t give us the opportunities we need to work on said skills.
That’s why we decided to use the month of October to contribute to open source projects with the following intentions:
For senior engineers: We wanted them to improve their upgrading and debugging skills, so that they could improve their skills when it comes to fixing medium to high complexity bugs.
For mid-level engineers: We wanted them to work on features so that they could improve their skills when it came to greenfield-like projects.
This year we decided not to restrict contributions to repositories that were officially participating in Hacktoberfest.
We asked everyone to suggest repositories before we started and we quickly came up with a list of approved projects.
Senior engineers were asked to work on two kinds of issues: technical debt and bugs.
Mid-level engineers were asked to work on any kind of issue they found interesting, with a focus on new features or feature changes.
To organize that:
This time we decided to split in teams:
When it came to our own projects, we decided to have only Ariel and Ernesto’s team work on open source projects maintained by OmbuLabs.
We focused on these projects:
We wanted to make sure that our teams focused on projects that were approved by our engineering management team. The list included some well-known and really useful tools that we’ve been using for years:
In terms of contributions, we considered activity on pull requests and issues as a valid contribution. We understand that sometimes you are looking to add value to an open source project, and after hours of research and trying many different things, all you can add is a comment to an existing issue. In our exercise, and in general, that counts as a contribution too!
Here are all the issues where we added value:
Here are all the pull requests we submitted:
In total during the month of October we invested 392 hours in our open source contributions. That represents an investment of $79,000 into open source by 10 of our senior and mid-level engineers.
We plan to take all of our contributions across the finish line, using our regular, monthly and paid open source investment time. Outside of Hacktoberfest, on average, as a team we invest 38 hours per month on open source contributions.
We look forward to continuing our investment in the open source projects that add value to the world and our communities. We believe this is the way to hone our craft, learn new things faster, and become better professionals.
]]>The Airflow community maintains a Helm chart for Airflow deployment on a Kubernetes cluster. The Helm chart comes with a lot of resources, as it contains a full Airflow deployment with all the capabilities. We didn’t need all of that, and we wanted granular control over the infrastructure. Therefore, we chose not to use Helm, although it provides a very good starting point for the configuration.
The Airflow installation consists of five different components that interact with each other, as illustrated below:
(Source: Official Airflow Documentation)
In order to configure our Airflow deployment on GCP, we used a few different services:
NOTE: The steps below assume you have both the Google Cloud SDK and kubectl
installed, and a GCP project set up.
Before deploying Airflow, we need to configure a CloudSQL instance for the metadata database and the GKE cluster that will host the Airflow deployment. We opted to use a Virtual Private Cloud (VPC) to allow the connection between GKE and CloudSQL.
To create a CloudSQL instance for the Airflow database:
gcloud sql instances create airflow_metadb \
--database-version=POSTGRES_15 \
--tier=db-n1-standard-2 \
--region=us-east1 \
--network=airflow_network
--root-password=admin
Customize the database version, tier, region, and network to your needs. If you don’t plan on using a VPC, you don’t need the network argument. Check out the gcloud sql instances create documentation for a full list of what’s available.
Connect to the newly created instance to create a database to serve as the Airflow metadata database. Here, we’ll create a
database called airflow_metadb
:
gcloud beta sql connect airflow_metadb
This will open a Postgres shell, where you can create the database.
CREATE DATABASE airflow_meta;
Finally, get the instance’s IP address and port to construct the database connection URL, which will be needed for the Airflow
set up. You’ll need the IP address listed as PRIVATE
:
gcloud sql instances describe airflow_metadb
Your connection URL should follow the format:
postgresql+psycopg2://username:password@instance-ip-address:port/db-name
for a Postgres instance.
Before initializing a new Kubernetes cluster on GKE, make sure you have the right project set in the gcloud
CLI:
gcloud config set project airflow
Create a new cluster on GKE:
gcloud container clusters create airflow-cluster \
--machine-type e2-standard-2 \
--num-nodes 1 \
--region "us-east1" \
--scopes "cloud-platform"
Choose the correct machine type for your needs. If your cluster ends up requesting more resources than you need, you’ll end
up overpaying for Airflow. Conversely, if you have less resources than required, you will run into issues such as memory pressure.
Also choose the number of nodes to start and the region according to your needs. The --scopes
argument set to cloud-platform
allows the GKE cluster to communicate with other GCP resources. If that is not needed or desired, remove it.
For a full list of the options available, check the gcloud container clusters create documentation.
Authenticate kubectl
against your newly created cluster:
gcloud container clusters get-credentials airflow-cluster --region "us-east1"
and create a Kubernetes namespace for the Airflow deployment. Although not necessary, this is a good practice, and it’ll allow for the grouping and isolating of resources, enabling, for example, separation of a production and staging deployment within the same cluster.
kubectl create namespace airflow
The cluster should now be set up and ready.
Our goal was to have Airflow deployed to a GKE cluster and the Airflow UI exposed via a friendly subdomain. In order to do that, we need to obtain and use a certificate.
To make the process of obtaining, renewing, and using certificates as easy as possible, we decided to use cert-manager
, a
native Kubernetes certificate management controller. For that to work, though, we
need to ensure that traffic is routed to the correct service, so requests made to the cert-manager
solver to confirm domain
ownership reach the right service, and requests made to access the Airflow UI also reach the right service.
In order to do that, an nginx ingress controller was needed.
Unlike an Ingress
, an Ingress Controller
is an application running inside the cluster that configures a load balancer according
to multiple ingress resources. The NGINX ingress controller is deployed in a pod along with such load balancer.
To help keep the ingress controller resources separate from the rest, let’s create a namespace for it:
kubectl create namespace ingress-nginx
The easiest way to deploy the ingress controller to the cluster is through the official Helm Chart.
Make sure you have helm
installed, then add the nginx Helm repository and update your local Helm chart repository cache:
helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm repo update
Install the ingress-nginx
Helm chart in the cluster:
helm install nginx-ingress ingress-nginx/ingress-nginx -n ingress-nginx
where nginx-ingress
is the name we’re assigning to the instance of the Helm chart we’re deploying, ingress-nginx/ingress-nginx
is the chart to be installed (the ingress-nginx
chart in the ingress-nginx
Helm repository) and -n ingress-nginx
specifies
the namespace within the Kubernetes cluster in which to install the chart.
With the controller installed, run:
kubectl get services -n ingress-nginx
and look for the EXTERNAL IP
of the ingress-nginx-controller
service. That is the IP address of the load balancer.
To expose the Airflow UI via a subdomain, we configured an A record pointing to this IP address.
Now that the controller is in place, we can proceed with the installation of the cert-manager
. First, apply the CRD
(CustomResourceDefinition) resources:
kubectl apply -f https://github.com/jetstack/cert-manager/releases/download/v1.13.0/cert-manager.crds.yaml
The cert-manager
relies on its own custom resource types to work, this ensures these resources are installed.
Like with the controller, we’ll also create a separate namespace for the cert-manager
resources:
kubectl create namespace cert-manager
And install cert-manager
using the Helm chart maintained by Jetstack:
helm repo add jetstack https://charts.jetstack.io
helm repo update
helm install cert-manager jetstack/cert-manager --namespace cert-manager --version v.1.13.0
With cert-manager
installed, we now need two additional resources to configure it: a ClusterIssuer
and Certificate
.
The ClusterIssuer
creates a resource to represent a certificate issuer within Kubernetes, i.e., it defines a Kubernetes
resource to tell cert-manager
who the certificate issuing entity is and how to connect to it. You can create
a simple ClusterIssuer
for Let’s Encrypt as follows:
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: letsencrypt
spec:
acme:
server: https://acme-v02.api.letsencrypt.org/directory
email: my_email@my_domain.com
privateKeySecretRef:
name: letsencrypt
solvers:
- http01:
ingress:
class: nginx
The Certificate
resource then defines the certificate to issue:
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: airflow-certificate
namespace: airflow
spec:
secretName: cert-tls-secret
issuerRef:
name: letsencrypt
kind: ClusterIssuer
commonName: airflow.my_domain.com
dnsNames:
- airflow.my_domain.com
Apply both resources to the cluster to get the certificate issued. Assuming everything went well and the DNS records are set up correctly, when you run:
kubectl describe certificate airflow-certificate -n airflow
you should see Status: True
at the bottom of the certificate’s description, indicating the certificate has been issued.
Now our cluster is ready to receive the Airflow deployment.
The Airflow deployment includes a few different pieces, so we can get Airflow to properly work. The Airflow installation in Kubernetes ends up looking more like this:
(Source: Official Airflow Documentation)
Our complete Airflow deployment resources ended up looking like this:
resources
|---- airflow.cfg
|---- secrets.yaml
|---- persistent_volumes
|---- airflow-logs-pvc.yaml
|---- rbac
|---- cluster-role.yaml
|---- cluster-rolebinding.yaml
|---- scheduler
|---- scheduler-deployment.yaml
|---- scheduler-serviceaccount.yaml
|---- statsd
|---- statsd-deployment.yaml
|---- statsd-service.yaml
|---- webserver
|---- webserver-deployment.yaml
|---- webserver-ingress.yaml
|---- webserver-service.yaml
|---- webserver-serviceaccount.yaml
In order to successfully deploy Airflow, we need to make sure the airflow.cfg
file is available in the relevant pods.
Airflow allows you to configure a variety of different things through this file (check the Configuration Reference
for more detailed information).
In Kubernetes, this kind of configuration is stored in a ConfigMap
, which a special kind of “volume” you can mount inside
your pods and use to make configuration files available to them. The ConfigMap
works together with Kubernetes secrets,
meaning you can reference a Secret
directly inside a ConfigMap
or pass the Secret
as an environment variable and
reference that.
Of note: Kubernetes secrets are somewhat unsafe considering they just contain a base64
encrypted string that can
be easily decrypted. If secrets need to be versioned or committed somewhere, it’s better to use GCP’s Secret Manager
instead.
A ConfigMap
for the airflow.cfg
file can be created running:
kubectl create configmap airflow-config --from-file=airflow.cfg -n airflow
where airflow-config
is the name of the ConfigMap
created and the -n airflow
flag is necessary to create the resource
in the correct namespace.
Kubernetes secrets can be created using a secrets.yaml
manifest file to declare individual secrets:
apiVersion: v1
kind: Secret
metadata:
name: airflow-metadata
type: Opaque
data:
connection: "your-base64-encrypted-connection-string"
fernet-key: "your-base64-encrypted-fernet-key"
---
apiVersion: v1
kind: Secret
metadata:
name: git-sync-secrets
type: Opaque
data:
username: "your-base64-encrypted-username"
token: "your-base64-encrypted-token"
If you decide to go with plain Kubernetes secrets, keep this yaml
file private (don’t commit it to a repository). To
apply it to your cluster and create all the defined secrets, run:
kubectl apply -f secrets.yaml -n airflow
This command will apply the secrets.yaml
file to the Kubernetes cluster, in the airflow
namespace. If secrets.yaml
is a valid Kubernetes manifest file and the secrets are properly defined, all Kubernetes secrets specific within the file
will be created in the cluster and namespace.
What volumes (and how many volumes) you’ll need will depend on how you decide to store Airflow logs and how your DAGs are structured. There are, in essence, two ways to store DAG information:
The key point to keep in mind is that the folder the Airflow scheduler and webserver are watching to retrieve DAGs from and
fill in the DagBag
needs to contain built DAGs Airflow can process. In our case, our DAGs are static, built directly into
DAG files. Therefore, we went with a simple git-sync approach, syncing our DAG files into an ephemeral volume and pointing
the webserver and scheduler there.
This means the only persistent volume we needed was to store Airflow logs.
A PersistentVolume
is a cluster resource that exists independently of a Pod, meaning the disk and data stored there will
persist as the cluster changes, and Pods are deleted and created. These can be dynamically created through a PersistentVolumeClaim
,
which is a request for and claim to a PersistentVolume
resource:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: airflow-logs-pvc
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
storageClassName: standard
This creates an airflow-logs-pvc
resource we can use to store Airflow logs.
Kubernetes RBAC is a security feature allowing us to manage access to resources within the cluster through defined roles.
A Role
is a set of rules that defines the actions allowed within a specific namespace. A RoleBinding
is a way to associate
a specific Role
with a user or, in our case, a service account.
To define roles that apply cluster-wide rather than specific to a namespace, you can use a ClusterRole
and an associated
ClusterRoleBinding
instead.
In the context of our Airflow deployment, a ClusterRole
is required to allow the relevant service account to manage Pods. Therefore,
we created an airflow-pod-operator
role:
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
namespace: airflow
name: airflow-pod-operator
rules:
- apiGroups: [""]
resources: ["pods"]
verbs: ["create", "delete", "get", "list", "patch", "watch"]
with an associated role binding:
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: airflow-pod-operator
subjects:
- kind: ServiceAccount
name: airflow-service-account
namespace: airflow
roleRef:
kind: Role
name: airflow-pod-operator
apiGroup: rbac.authorization.k8s.io
The scheduler is a critical component of the Airflow application, and it needs to be deployed to its own Pod inside the cluster. At its core, the scheduler is responsible for ensuring DAGs run when they are supposed to, and tasks are scheduled and ordered accordingly.
The scheduler deployment manifest file that comes with the Helm chart (you can find it inside the scheduler
folder) is a
good starting point for the configuration. You’ll only need to tweak it a bit to match your namespace and any specific
configuration you might have around volumes.
In our case, we wanted to sync our DAGs from a GitHub repository, so we needed to configure a git-sync container. An easy way to get started is to configure the connection with a username and token, although for a production deployment it’s best to configure the connection via SSH. With git-sync configured, our scheduler deployment looked like this:
apiVersion: apps/v1
kind: Deployment
metadata:
name: airflow-scheduler
namespace: airflow
labels:
tier: airflow
component: scheduler
release: airflow
spec:
replicas: 1
selector:
matchLabels:
tier: airflow
component: scheduler
release: airflow
template:
metadata:
labels:
tier: airflow
component: scheduler
release: airflow
annotations:
cluster-autoscaler.kubernetes.io/safe-to-evict: "true"
spec:
restartPolicy: Always
terminationGracePeriodSeconds: 10
serviceAccountName: airflow-service-account
volumes:
- name: config
configMap:
name: airflow-config
- name: dags-volume
emptyDir: {}
- name: logs-volume
persistentVolumeClaim:
claimName: airflow-logs-pvc
initContainers:
- name: run-airflow-migrations
image: apache/airflow:2.7.1-python3.11
imagePullPolicy: IfNotPresent
args: ["bash", "-c", "airflow db migrate"]
env:
- name: AIRFLOW__CORE_FERNET_KEY
valueFrom:
secretKeyRef:
name: airflow-metadata
key: fernet-key
- name: AIRFLOW__CORE__SQL_ALCHEMY_CONN
valueFrom:
secretKeyRef:
name: airflow-metadata
key: connection
- name: AIRFLOW_CONN_AIRFLOW_DB
valueFrom:
secretKeyRef:
name: airflow-metadata
key: connection
volumeMounts:
- name: config
mountPath: "/opt/airflow/airflow.cfg"
subPath: airflow.cfg
readOnly: true
containers:
- name: git-sync
image: registry.k8s.io/git-sync/git-sync:v4.0.0-rc5
args:
- --repo=https://github.com/ombulabs/airflow-pipelines
- --depth=1
- --period=60s
- --link=current
- --root=/git
- --ref=main
env:
- name: GITSYNC_USERNAME
valueFrom:
secretKeyRef:
name: git-username
key: username
- name: GITSYNC_PASSWORD
valueFrom:
secretKeyRef:
name: git-token
key: token
volumeMounts:
- name: dags-volume
mountPath: /git
- name: scheduler
image: us-east1-docker.pkg.dev/my_project/airflow-images/airflow-deployment:latest
imagePullPolicy: Always
args:
- scheduler
env:
- name: AIRFLOW__CORE__DAGS_FOLDER
value: "/git/current"
- name: AIRFLOW__CORE__FERNET_KEY
valueFrom:
secretKeyRef:
name: airflow-metadata
key: fernet-key
- name: AIRFLOW__CORE__SQL_ALCHEMY_CONN
valueFrom:
secretKeyRef:
name: airflow-metadata
key: connection
- name: AIRFLOW_CONN_AIRFLOW_DB
valueFrom:
secretKeyRef:
name: airflow-metadata
key: connection
livenessProbe:
failureThreshold: 15
periodSeconds: 30
exec:
command:
- python
- -Wignore
- -c
- |
import os
os.environ['AIRFLOW__CORE__LOGGING_LEVEL'] = 'ERROR'
os.environ['AIRFLOW__LOGGING__LOGGING_LEVEL'] = 'ERROR'
from airflow.jobs.scheduler_job import SchedulerJob
from airflow.utils.net import get_hostname
import sys
job = SchedulerJob.most_recent_job()
sys.exit(0 if job.is_alive() and job.hostname == get_hostname() else 1)
volumeMounts:
- name: config
mountPath: "/opt/airflow/airflow.cfg"
subPath: airflow.cfg
readOnly: true
- name: dags-volume
mountPath: /git
- name: logs-volume
mountPath: "/opt/airflow/logs"
The scheduler deployment is divided into two “stages”, the initContainers
and the containers
. When Airflow starts,
it needs to run database migrations in the metadata database. That is what the init container
is doing. It runs as soon
as the scheduler pod starts, and ensures the database migration is completed before the main application containers start.
Once the init container
is done with the start up task, the git-sync and scheduler containers can run.
Notice that the scheduler container references a custom image in Artifact Registry
. Given our pipeline set up and choice
of executor, we replaced the official Airflow image in the deployment with our own image.
The webserver is another critical Airflow component, it exposes the Airflow UI and manages user interaction with Airflow. Its deployment is all too similar to that of the scheduler, with minor differences, so we won’t go into it in detail. The manifest file looks like this:
apiVersion: apps/v1
kind: Deployment
metadata:
name: airflow-webserver
namespace: airflow
labels:
tier: airflow
component: webserver
release: airflow
spec:
replicas: 1
strategy:
type: RollingUpdate
rollingUpdate: 2023-11-17 20:54:25
maxSurge: 3
maxUnavailable: 1
selector:
matchLabels:
tier: airflow
component: webserver
release: airflow
template:
metadata:
labels:
tier: airflow
component: webserver
release: airflow
annotations:
cluster-autoscaler.kubernetes.io/safe-to-evict: "true"
spec:
restartPolicy: Always
terminationGracePeriodSeconds: 10
serviceAccountName: default
volumes:
- name: config
configMap:
name: airflow-config
- name: dags-volume
emptyDir: {}
- name: logs-volume
persistentVolumeClaim:
claimName: airflow-logs-pvc
initContainers:
- name: run-airflow-migrations
image: apache/airflow:2.7.1-python3.11
imagePullPolicy: IfNotPresent
args: ["bash", "-c", "airflow db migrate"]
env:
- name: AIRFLOW__CORE__FERNET_KEY
valueFrom:
secretKeyRef:
name: airflow-metadata
key: fernet-key
- name: AIRFLOW__CORE__SQL_ALCHEMY_CONN
valueFrom:
secretKeyRef:
name: airflow-metadata
key: connection
- name: AIRFLOW_CONN_AIRFLOW_DB
valueFrom:
secretKeyRef:
name: airflow-metadata
key: connection
volumeMounts:
- name: config
mountPath: "/opt/airflow/airflow.cfg"
subPath: airflow.cfg
readOnly: true
containers:
- name: git-sync
image: registry.k8s.io/git-sync/git-sync:v4.0.0-rc5
args:
- --repo=https://github.com/ombulabs/airflow-pipelines
- --depth=1
- --period=60s
- --link=current
- --root=/git
- --ref=main
env:
- name: GITSYNC_USERNAME
valueFrom:
secretKeyRef:
name: git-username
key: username
- name: GITSYNC_PASSWORD
valueFrom:
secretKeyRef:
name: git-token
key: token
volumeMounts:
- name: dags-volume
mountPath: /git
- name: webserver
image: us-east1-docker.pkg.dev/my_project/airflow-images/ombu-airflow-deployment:latest
imagePullPolicy: Always
args:
- webserver
env:
- name: AIRFLOW__CORE__FERNET_KEY
valueFrom:
secretKeyRef:
name: airflow-metadata
key: fernet-key
- name: AIRFLOW__CORE__SQL_ALCHEMY_CONN
valueFrom:
secretKeyRef:
name: airflow-metadata
key: connection
- name: AIRFLOW_CONN_AIRFLOW_DB
valueFrom:
secretKeyRef:
name: airflow-metadata
key: connection
- name: AIRFLOW__WEBSERVER__AUTH_BACKEND
value: "airflow.api.auth.backend.basic_auth"
volumeMounts:
- name: config
mountPath: "/opt/airflow/airflow.cfg"
subPath: airflow.cfg
readOnly: true
- name: dags-volume
mountPath: /git
- name: logs-volume
mountPath: "/opt/airflow/logs"
ports:
- name: airflow-ui
containerPort: 8080
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 15
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 15
Perhaps the most notable thing here is the presence of the AIRFLOW__WEBSERVER__AUTH_BACKEND
environment variable. This
allows us to use a basic authentication backend with Airflow. As part of this deployment, we didn’t configure the creation
of a root user, meaning one needed to be created from within the container by the first person trying to access the UI. If
you find yourself in the same situation:
Run
kubectl exec -it <webserver-pod-name> -n airflow -c webserver -- /bin/sh
to access the shell within the webserver container. By default, running the command without the -c webserver
flag will
access the git-sync container, which is not what we want. Once inside the shell, run:
su airflow
To switch to the airflow
user. This is needed to run airflow
commands. Now you can run:
airflow users create --username <your_username> --firstname <first_name> --lastname <last_name> --role <the-user-role> --email <your-email> --password <your-password>
This will create a user with the specified role. This only needs to be run to create the first admin user after a fresh deployment, additional users can be created directly from within the interface.
Having the webserver deployed to a pod is not enough to be able to access the UI. It needs a Service
resource associated
with it to allow access to the workload running inside the cluster. From our webserver manifest file, we defined an airflow-ui
port name for the 8080
container port. Now we need a service that exposes this port so that network traffic can be directed
to the correct pod:
kind: Service
apiVersion: v1
metadata:
name: webserver-svc
namespace: airflow
spec:
type: ClusterIP
selector:
tier: airflow
component: webserver
release: airflow
ports:
- name: airflow-ui
protocol: TCP
port: 80
targetPort: 8080
There are five types of Kubernetes services
that can be defined, with the ClusterIP
type being the default. It provides an internal IP and DNS name, making the service
only accessible within the cluster. This means that we now have a service associated with the webserver, but we still can’t
access the UI through a friendly subdomain as a regular application.
For that, we’ll configure an ingress next. An Ingress
is an API object that defines the rules and configurations to manage
external access to our cluster’s services.
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: airflow-ingress
namespace: airflow
annotations:
cert-manager.io/cluster-issuer: "letsencrypt"
spec:
ingressClassName: "nginx"
tls:
- hosts:
- airflow.my_domain.com
secretName: cert-tls-secret
rules:
- host: airflow.my_domain.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: webserver-svc
port:
number: 80
The key configuration here that allows us to define the settings for secure HTTPS connections is the tls
section. There,
we can list all hosts for which to enable HTTPS and the name of the Kubernetes Secret
that holds the TLS certificate and
private key to use to secure the connection. This secret is automatically created by cert-manager
.
Finally, in order to ensure our resources have the necessary permission to spawn new pods and manage pods, we need to configure service accounts for them. You can choose to configure individual service accounts for each resource or a single service account for all resources, depending on your security requirements.
The ServiceAccount
resource can be configured as:
apiVersion: v1
kind: ServiceAccount
metadata:
name: default
namespace: airflow
labels:
tier: airflow
component: scheduler
release: airflow
automountServiceAccountToken: true
Since we wanted users to be able to manage workflows directly from the UI, we also configured a service account for the webserver.
This is an optional component that collects metrics inside the Airflow application. The deployment is similar to the other two, so we won’t dive into it.
Airflow is now deployed to a GKE cluster and accessible via our chosen subdomain. This allows us to have a higher level of control over our infrastructure, while still leveraging GKE’s built-in resources to auto-scale as needed.
]]>In that spirit, we are excited to announce a new role in our organization, the Account Advocate, a key role in our team fully dedicated to championing client interests, collaboration and ensuring successful partnerships that go above and beyond.
The Account Advocate is a key, strategic role focused on ensuring that our clients are happy with our partnership not only from a technical and delivery perspective, but also from a business perspective. They are an advocate and representative for your business stakeholders inside our team, dedicated to connecting your vision with our delivery and ensuring your goals are met and any potential concerns are heard and addressed.
The Account Advocate works closely with the Project Manager to ensure success, but while the Project Manager focuses on delivery and the success of the existing project, the Account Advocate focuses on the overall relationship with the business, makes sure value delivery expectations are met, your team is being heard and ensures we’re delivering value to your company at every opportunity.
They also facilitate communication with senior leadership on both ends, ensuring that you have all the support you need for a successful collaboration.
Communication is key to everything we do. We value open and honest communication with our clients and between our teams. As such, you will have plenty of contact and checkpoints with our delivery team.
The Account Advocate is focused on more strategic goals and higher-level partnership priorities , so they will aim to meet with business stakeholders quarterly. If a different frequency is preferred, we will most definitely adapt, but we believe at least quarterly contact is important to ensure success and happiness on both ends of the partnership.
While communicating and collaborating with you, the Account Advocate will focus on:
Client Happiness: We are committed to understanding your goals, challenges and opportunities. Client happiness is at the core of our business, and they are your voice within our organization, ensuring your feedback is being heard and any concerns you might have are understood and addressed swiftly.
Strong Partnership and Collaboration: Ongoing collaboration makes partnerships grow stronger, and we are interested in delivering as much value to your organization as we can. They will collaborate closely with your business team to foster trust and open communication and facilitate collaboration at the higher levels of leadership.
New Opportunities: We are vested in your success and believe in going above and beyond in everything we do. The Account Advocate is interested in hearing what other problems we can help solve, other challenges we can help you overcome and overall other ways in which we can contribute to deliver cost-effective solutions that solve real problems and generate actual value for you and your team.
Problem Resolution: We believe in Challenging Projects over Profitable Projects, that’s why we are so passionate about every project we work on. That also means we understand challenges arise and are a part of every successful collaboration. The Account Advocate is focused on solving any issues swiftly and transparently, ensuring minimal disruption.
As we introduce the Account Advocate role to our team and to our partnership, we are excited to see how it will contribute to an even more successful and strong relationship with our clients. This role strengthens our commitment to client happiness and success and our interest in building long-lasting relationships based on trust, open communication and transparency.
We look forward to working with you and your team on our next successful project! Contact us to get your next project started!
]]>Day 5 of the design sprint is about testing your prototype and getting feedback on your ideas. That way, you can quickly learn what is or isn’t working about the concept. Yesterday, the interviewer spent time putting together a list of questions for the interview sessions. Earlier this week, your team recruited 5 participants for Friday’s research. Now you are ready to do the dang thing.
We test early with a low-fidelity prototype because it’s smart and far less expensive than waiting until something is built. It’s important to try to find test participants who are outside of your organization, or at least unfamiliar with the product. The Design Sprint can’t be considered complete before research is done, so get ready to find out how other people feel about what you’ve been working on all week.
What does the team hope to learn from these interviews? A high level goal of “Do people like this?” might become something like “What do people think about the solution? What are the positives and negatives? What do people like or dislike about our solution vs our competitor’s solution?
Start with easy open-ended interview questions that align with your research goals, such as “How long have you been doing…”.
Only ask open-ended questions, no “yes/no” questions, nor “multiple choice” questions like “would you do x?”.
You can ask things like “What was the most useful part of this prototype? What was the least useful?”.
Avoid asking any questions that might lead a participant to a particular answer. You want to learn as much as possible in the sessions, so keep the questions open-ended. You’ll be surprised how much you learn.
When you have finished writing your questions, run a pilot version of the session with a team member.
Adjust as needed if you notice any hiccups.
A laptop: A video-enabled virtual meeting tool, like Zoom, Webex, or Google Hangouts that enables your participants to share their screen and your team to observe from their computers.
A link to the prototype (like a Google slides link or something like that).
People will try to please you and will generally be kind in interviews, so assure them that you’d like them to be honest with their feedback. Make sure that your interviewees understand that you are not testing them, but rather that they are helping you test the prototype. Tell the interviewee that they are not under any scrutiny and that all difficulties or issues are useful information for the team and will help make the solution better. Plan for each interview should take about 30m or so, depending on how many questions they will be asked. Give yourself about 20m between each interview to organize your notes and prepare for the next session.
While the interviews are happening, the rest of the team should be paying attention and watching the interviews remotely. While observing, they should be taking notes on post-its of any notable comments, behaviors, or other observations. These notes will be used to determine the next course of action in terms of adjustments and fixes. Don’t worry about taking overlapping notes. The notes will be organized later and duplication will not affect the quality of the work at all.
Once the interviews are complete, the team reviews their notes together, grouping likes notes together into themes. The team will discuss these themes. You’ll learn what went well, what didn’t go so well, and what direction or changes you should try again in the next iteration. Any changes should be prioritized by the team, and then used to determine the next steps for your fledgling product.
At this point, you have successfully completed the Design Sprint! Bravo!
]]>Using environment variables to store information in the environment itself is one of the most used techniques to address some of these issues. However, if not done properly, the developer experience can deteriorate over time, making it difficult to onboard new team members. Security vulnerabilities can even be introduced if secrets are not handled with care.
In this article, we’ll talk about a few tools that we like to use at OmbuLabs and ideas to help you manage your environment variables efficiently.
In 2011, Heroku created The Twelve-Factor App methodology aimed at providing good practices to simplify the development and deployment of web applications.
As the name suggest, the methodology includes twelve factors, and the third factor states that the configuration of the application should be stored in the environment.
The idea of storing configuration in the environment was not created by Heroku but the popularity of Heroku for Ruby and Rails applications made this approach really popular.
The main benefit is that our code doesn’t have to store secrets or configuration values that can vary depending on where or how the application is run. Our code simply assumes that those values are available and correct.
The idea of storing configuration in the environment is simple for a single-app production environment, it is easy to set environment variables for the whole system.
Hosting providers like Heroku or Render have a configuration panel to manage the environment variables. However, when many applications have to run in the same system each of them may need different values for a given environment variable, and then the “environment” depends on the current project and not only on the system.
One of many tools to assist with this is the dotenv
gem, which wraps our application with specific environment values based on hidden files that can be loaded independently for each app without polluting the system’s environment variables.
The way dotenv
works is that it will read environment variables names and values from a file named .env
and will populate the ENV
hash with them.
By default, dotenv
will NOT override variables if they are already present in the ENV
hash, but that can be changed using overload
instead of load
when initializing the gem.
Since the .env
file holds information that is specific for a given environment, this file is not meant to be included in the git repository.
How do we let new engineers know that we make use of a .env
file or what the required environment variables are? The dotenv
gem provides a good solution.
The dotenv
gem provides a template feature to generate a .env.template
file with the same environment variables but without actual values.
Another common practice is to use a file called .env.sample
with similar content.
When a new developer clones the repository, they can copy the .env.template
or .env.sample
file as .env
(or any of the variants, we’ll talk about this in a moment) and replace the values as needed.
One issue that we have faced in many projects is when a new developer would need to know the environment variables (listed in a .env.sample
file), but wouldn’t know what to use as values that make sense.
In many cases any value works when the code doesn’t depend on the actual format of the value. However, when the data type or format does matter then things can go wrong.
One example we had for this issue was a third-party gem that required an API secret, the gem would verify the format of the secret against a regular expression and some actions would fail with an invalid secret format error.
To prevent this, we created and open-sourced the dotenv-validator
gem, which leverages the use of a .env.sample
file with comments for every environment variable to provide extra information about the expected format of the value for each variable.
This gem includes a mechanism to warn an engineer about missing or incorrect environment variables when the application starts.
By default, dotenv
only looks for a file named .env
, but, when using dotenv-rails
, it will provide some naming conventions that we can adopt to further differentiate the environment variables we use not only per app but also per Rails environment.
When running a Rails app with dotenv-rails
, environment variable files are looked up in this order:
root.join(".env.#{Rails.env}.local"),
(root.join(".env.local") unless Rails.env.test?),
root.join(".env.#{Rails.env}"),
root.join(".env")
Using this convention we can specify different environment variables for the same application when we run the application with rails s
or when we run the tests.
Note that all the files listed above are loaded and processed by
dotenv
in that specific order. This means you can have generic environment variables in a.env
file and be more specific overriding/defining only some of them in a file for the current Rails environment without having to copy all the variables to the new file.
New Rails application comes with a bin/dev
script that uses the foreman
gem to run multiple processes at once. foreman
is aware of the .env
file and will load it before our application loads it. However, there’s one important difference, the way foreman
parses the .env
file is not the same as the way dotenv
processes the same file.
The dotenv
gem understands comments and they are ignored when setting the values in the ENV
hash, while foreman
does not ignore them. So, a .env
file that looks like this:
MY_ENV="my value" # some comment here
Will produce different values for ENV["MY_ENV"]
depending on how the application is run:
rails s
, the comment is ignored by dotenv
and ENV["MY_ENV"]
returns the string "my value"
foreman
the comment is not ignored, so ENV["MY_ENV"]
returns the string '"my value" # some comment here'
(then, when the Rails app loads, the .env
file is parsed again by dotenv
but since the variable was already defined by foreman
, it is not replaced)One workaround for this is to rely on the naming convention of alternative files: if, for example, we use .env.development
and .env.test
files, these will only be parsed by dotenv
thanks to the dotenv-rails
convention and not by foreman
.
Another option is to configure the initialization of dotenv
to use overload
instead of load
.
Docker is a really popular solution for containerizing applications, and Docker-related files will be created by Rails for new apps (since Rails 7.1).
When using docker-compose
, it will look for a .env
file and, in some cases, it may not ignore comments or even process the values differently than dotenv
.
You can check the docs here.
If environment variables are not populated correctly by docker-compose
compared to dotenv
, the workarounds used for foreman
can be used here too.
Sometimes we have to run applications that are not aware of the .env
file but do expect some configuration in the ENV
hash. For example, a background job process running a worker that reads some information from the ENV
hash.
In that case, instead of changing our job-runner code to load dotenv
, we can use the dotenv
executable to wrap any command. For example:
dotenv -f ".env.local" bundle exec rake sidekiq
This wrapper can then be used in a Procfile
to ensure dotenv
works as expected when using foreman
for example if we don’t use a .env
file.
Another popular gem with a similar functionality is the figaro
gem. Compared to dotenv
, figaro
is focused more on Ruby on Rails applications and provides some features like ensuring the presence of specific environment variables (one of the features of dotenv-validator
).
dotenv
is not focused on Ruby on Rails applications (but can be used with no issues) and its development has been more active.
Because of the work we do at OmbuLabs with multiple clients, handling environment variables with a .env
file is key for us to quickly change between projects locally without polluting the system’s environment variables.
For our projects we don’t use a .env
file in production, since we define the environment variables in the Heroku dashboard, but we still use dotenv-validator
to ensure that the application has all the variables with correct values to avoid unexpected issues.
We try to keep the .env.sample
file with development-ready values, but it’s not always possible when some variables can be specific for a machine or developer, so adding format validation can help the developer set the correct value.
Feel free to reach out to OmbuLabs for help in your project, we offer many types of services.
]]>Now that the executives, sales team, and lawyers have signed off on the project, how do you get off to a quick start to accomplish your business goals?
Provide access for the external agency before the project kick-off call.
Access provisioning can take anywhere from 1-7 days in best case scenarios. Our projects are a fixed retainer, and we begin billing from the project start date, whether we have working accesses or not.
Be sure to select your project start date while keeping in mind how long it will take to grant the agency full access.
As soon as contracts are signed, one of our project managers will reach out to provide information and begin the access process. Clients who have documentation around external contract access provisioning and are proactive about onboarding our team are able start projects without delay.
Update your readme and communicate QA process workflow.
When was the last time you set up your application locally? What does your QA process look like? How long does it normally take to review PRs?
Our PM and engineers ask these questions at the start of any new project. We’ve found that projects begin swiftly when clients have recently updated their readme, and can explain their QA process clearly.
If it’s been a while since you updated your documentation, or if you don’t have any, now is the time to whip something up to support the collaborative effort!
Appoint a clear decision maker and escalation point to facilitate seamless communication.
Depending on your organization’s size, the decision makers could be the same people who are about to collaborate with us. In many cases, our initial communication is with executives or lead engineers.
When a contract is signed, it is important to inform the agency who they will interact with daily, who is needed to conduct check-in calls, and who is a decision maker in the case of code or other types of project changes.
Communicate the business case to the development team.
As you prepare your developers and engineers to work with us or another agency, it is useful to explain the business case, applicable scope of work information, how you foresee the workflow changing (if at all), and expectations around collaboration with external stakeholders.
Teams that understand their roles and responsibilities clearly can collaborate best.
While hiring an external agency may have some challenges, there are easy steps to take to prepare your organization and team to mitigate those risks and start projects without a hitch.
You can hit the ground running by documenting and clearly communicating you access provisioning, setup and QA processes. Having a clear project POC, and preparing you team with internal communication will make for a smooth and quick transition process.
Are you interested in working with our agency? We provide many services including staff augmentation, Ruby on Rails Upgrades, and JavaScript Upgrades. You can also checkout some of our case studies if you want to know more about past companies who have worked with us.
]]>Day 4 is a little different from the other days of the Design Sprint. Instead of a series of workshops, we will spend most of the day each working on one part of the prototype.
Towards the end of the day, we will do a test run to check on our progress and adjust from there.
Using the storyboard from Wednesday as our map, we will divide and conquer the prototype.
The team will be split into 5 roles:
Makers will create the various sections of the prototype.
How many? 2 to 3 Makers.
Makers will split the storyboard (or storyboards!) into sections.
Each maker is responsible for creating the prototype for their sections of storyboard.
Asset Collectors will gather images/icons and other assets that the makers will need.
How many? 2-3 Asset Collectors.
The asset collectors will make sure that the makers have the assets they need to continue their work. This means finding images, icons, illustrations, sounds, or anything else that the maker can stay focused on and leave these decisions to someone else.
Writer to provide the text for all the parts of the prototype.
How many? 1 Writer.
The writer fine tunes all the copy from the storyboard and provides that copy to the makers.
This might include fake text for an article about whatever you’re prototyping, an email about the product, an advertisement etc, as well as the copy in the prototype.
The stitcher is responsible for taking the sections of the prototype or prototypes and attaching them together.
Their job it is to make sure that the whole experience makes sense from end to end.
How many? 1 Stitcher.
The stitcher puts it all together and makes sure that all the pieces fit together into a seamless prototype.
Interviewer, who will write the interview script for Friday.
How many? 1 Interviewer.
The interviewer will write questions for the interviews tomorrow based on the storyboard and the prototype.
A prototype is a tool for research and discovery – not a functional app.
A good prototype feels real enough to pretty well replicate your desired experience and help your interviewees get in the headspace of the problem you’re asking them to think about.
The goal for THIS prototype is to make something that does those things AND can be ready to test after 8 hours of work.
The prototype is not a blueprint for a product. It’s a way to get feedback on an idea from people who might use a product like yours in the future, and then apply that feedback so that you have a really good idea of where to look next as you continue the process of making an idea into a service.
You don’t need any fancy design software to do this because this is intended to be accessible for everyone. You could lean on Keynote or Powerpoint.
I recommend these tools because they are basic, they are not designer-only, and because they are relatively easy to use. You can even use the transition features in both of them to show the flow of the prototype.
You need images, you need text, you need to transition between scenarios and steps, and you need a way to set up your starting scenario. You can (and absolutely should) fake things if you need to.
Does there need to be an email? Fake it.
Does an automated phone call play into your scenario? Fake that, too.
The reason for this is to keep the prototype simple and to prevent the team from conflating this exercise with a normal design process. It’s certainly part of that process, though.
Feel empowered to use a bit of hand-waving during the interview if the prototype takes a little more imagination in some areas, too. Of course you can make this prototype using design software like Sketch, Figma, Balsamiq, XD, or whatever else you like, but beware of letting your Design Sprint prototype become something bigger than it needs to be.
Don’t get too precious about the prototype. Focus on picking a tool that will be easy to use, then use the heck out of it.
If you’re working on an iPad or iPhone app, Apple provides free iOS interface elements for Keynote. There are also lots of free UI kits for PowerPoint. You can also use images of elements you need and Frankenstein your way to a functional (read: testable) prototype.
At about 3pm or so, or about 5 hours into your prototype day, try the thing out.
The prototype should be in a strong rough-draft place. Quickly put your sections together (e.g. just have each Maker play through their sections in order) so you can walk through what you have.
Walk through the full scenario and see how it looks and feels. Make note of any rough spots, then take the notes back with you as you’re round the corner on getting the prototype to a testable place.
Finally, the stitcher will take all the files and put them together into 1 whole prototype. Make sure that the interview questions line up with the prototype, and you’re ready for your interviews on Day 5.
Need to see this in action? Contact us to validate your idea with a one-week Design Sprint! 🚀
]]>This is a clear sign that we need another layer of abstraction. We need something that can hold our maze data and to take care of placing the rooms and connecting them according to the rules we establish. After some research, I think I found the right alternative: the Graph.
Generally speaking, a graph consists of a set of vertices or nodes that can be interconnected by a set of edges. There are many types of graphs, but, as a data type, graphs usually implement the concepts of undirected graphs and directed graphs.
Undirected graphs are those whose edges don’t have a specific direction. As such, if nodes 1 and 2 of a graph are connected this way, it means we have a path going from 1 to 2 and that same path would allow us to go back, from 2 to 1.
In a directed graph, however, edges do have directions, and an edge that goes from 1 to 2 won’t allow you to move back, you’ll need to add a directed edge from 2 to 1.
The only other concepts concerning graphs that are of interest to us are the definitions of adjacency and a path.
Adjacency, as the name says, is the characteristic that two nodes can have which means that they are connected by an edge.
Finally, a path merely represents a sequence of edges that connect any two nodes of our graph.
In general, graphs are the data structure to go to when we care not so much about how data is stored but more about how it’s connected.
In our specific case, we want to generate a set of rooms that should be connected to each other. Indeed, if we wanted a really quirky experience, we could just have them all connect to other rooms without any regard for positioning and, while I do kind of like that idea, I want to be able to have a more structured approach.
Basically, my requirements are:
Rooms (nodes) must have a limit to the number of doors (edges) they can have (4, for starters)
I want them to be in a square grid. This implies that some pairs of rooms can’t have a door connecting them
I want to be able to “grow” the maze by starting with a single room and then randomly adding adjacent rooms until I have my maze
Also, since my doors are edges, it makes sense to me that the graph we implement is undirected, since I want the player to be able to just cross back and forth through any given door.
Now, graphs don’t implement all the rules that satisfy these requirements, but they make it way easier to encode this information. That’s what we call separation of concerns. Our little maze builder will take care of determining the rules of generating rooms, walls and doors, and the Graph class will be responsible for holding the connection information itself.
Now that we know what a graph is and why we’d want to use it, I want to briefly discuss what we need our graph to do.
Like other data structures, a graph implementation must provide us with a minimal set of operations that will allow us to use it. Since this list may vary according to one’s needs, I decided to go with this list:
adjacent(x, y)
- Tests whether there’s an edge between nodes x
and y
neighbors(x)
- Lists all nodes adjacent to x
add_node(x)
- Adds node x
, if it isn’t presentremove_node(x)
- Removes node x
, if it is presentadd_edge(x, y)
- Adds an edge between nodes x
and y
remove_edge(x, y)
- Removes the edge between nodes x
and y
get_node_value(x)
- Returns the value associated with node x
set_node_value(x, v)
- Sets the value v
to node x
get_edge_value(x, y)
- Returns the value associated with the edge between x
and y
set_edge_value(x, y, v)
- Sets the value v
to the edge between x
and y
Not all graphs are equal, however. There are a few different ways to implement the functionality listed above and which one is best will depend on what it is we’ll be using this graph for.
The main decision comes to how we’ll represent this graph in memory (and in code, for that matter).
There are two common ways to do so:
Adjacency list: Nodes are stored as records or objects and each node holds a list of adjacent nodes. If we desire to store data on edges, each node must also hold a list of their edges and each edge will store its incident nodes.
Adjacency matrix: A two-dimensional matrix where the rows represent source nodes and columns represent the destination nodes. The data for nodes or edges must be stored separately.
Each approach has pros and cons to it.
The adjacency list is usually the one used for most applications, since it’s faster to add new nodes and edges and, if it’s not too big, removing nodes and edges are operations that take a time proportional to the number of nodes or edges. However, if your main use case is to add or remove edges, or lookup if two nodes are adjacent, the adjacency matrix is best. Its main drawback is that it’s the one that consumes the most memory.
The rule of thumb is: if you have a lot of edges compared to nodes (a dense graph), the adjacency matrix is preferred. If your graph is sparse (way more nodes than edges), the adjacency list is the way to go.
In our case, our graph can be either. But since mazes are usually better if we have more paths (i.e. more edges), we’ll usually have more dense graphs than not, which is why I intend to implement our graph using the adjacency matrix.
I’ll end this article here. I felt that the subject matter was too extensive to cover graph theory, our design decisions and our implementation. In Part 2 I’ll write about the code related to graphs.
]]>