Let’s walk through the four phases of a bug, zoom in a little deeper, and explain why and how some of our processes came to be, the technologies we use, and ultimately how you can apply our flow to your own work.
Phase 1: Identification and Escalation
The brunt of the work of this phase falls onto our Quality Assurance team. They are tasked with properly identifying and categorizing a reported issue. You can imagine with a feature set as large as ours that it becomes critically important to save engineering time by debugging and answering some of the most common questions an engineer would ask when faced with an issue. These are the required fields on which we collect data before escalating the ticket to engineers:
- Priority Level: Critical, Major, Minor
- Defcon Level: Internal level detailing not only severity but impact and timeline needed to address
- Environment: OS version, browser version, system type
- Issue Affects: Single account, small number of accounts, all accounts
- Component: What pieces of the app are affected
- Account ID: Reporting account ID
- Description: Detailed description of the bug and steps to reproduce the issue
- Subject: Brief description of the bug, used for release notes
- User: Log-in that reported the problem
- Version: Version number of the app where problem occurred (This is important because we may have already fixed the issue and, if an older version is reported, it makes it easier to see if it is still occurring.)
If any of these fields are unable to be collected, QA reports back to Customer Support to gather the necessary information. If questions still remain after consulting Customer Support, then the questions go back to the client.
This checklist may seem excessive but like many things in any business, hindsight gives you the opportunity and clarity to improve on previous mistakes. Through past experience we’ve learned that the closer you can push the capturing of this information, the better the debugging and cost to team and the less back and forth you have between the client, customer service, support and engineers, and ultimately the faster we can solve the issue.
Once QA is confident about the information received, they will create a JIRA ticket. Based on the component, it will be automatically put on one of our developer’s dashboard. This begins Phase 2.
Phase 2: Fixing the Problem
Let’s introduce two pieces of software that are critically important for our organization: JIRA and Confluence.
JIRA – Jira is a ticket-tracking software that allows for some pretty cool customization around automatic escalation, great plug-in support, and a good permission system.
Confluence – Confluence is a tool we use for shared collaboration on processes, specs, meeting notes, files and dashboards around JIRA ticket filters.
At its heart, the primary value of project management (PM) software, such as JIRA, is to provide a centralized place for conversations, documents and tickets to live and employees to manage them. We’ve tried Basecamp, Redmine, Google Docs, SmartSheet. All of them have their pros and cons, but ultimately the biggest benefit is to get your team to buy in and make sure your chosen system provides value rather than frustration.
JIRA has some key features which set it apart:
- Automatic ticket escalation based on component
- Ability to make fields required when creating a ticket (Simple I know, but you’d be surprised how many bad tickets you get without that auto enforcement.)
- Flexibility around the workflow (We went from Sprints to Kanban board and then ultimately to just a dashboard, based on priority.)
- Customizable escalation paths depending on ticket type
These aside, when re-assessing our PM software needs, we really focused on the experience for developers using the tools.
- Is it easy to use and understand? For both the project managers and engineers?
- Does it encompass everything?
- Is everyone trained to use it properly?
- Who can everyone go to when they have an issue with the system?
If you can satisfactorily address these concerns during the infancy of adoption, it really goes a long way when training new people to use the system or dealing with challenges as they come up.
I digress. Back to the Phase 2.
When engineers at Ontraport arrive in the morning, the first thing they do, besides grab some complimentary breakfast and a tall mug of coffee, is open up Confluence and log into their personalized dashboard.
This dashboard, which is integrated with JIRA, lays out the highest priority items for the individual to address. We’ve found that developers want to come in and know what’s on their plate and their highest priority item for the day/week/month.
Once a ticket is picked up from an engineer, they transition the ticket to “Work in Progress.” By doing this, we’ve given context to other engineers — this ticket cannot be taken from this person — and then the work begins.
Aided by all the information provided by the Phase 1, most tickets don’t need much input from the other engineers to achieve a resolution, but when an engineer is working on an unfamiliar area of the app, they may need some help from another team member. This is where HipChat steps in.
Communication is one of the most challenging issues that a company faces as they grow. Communication between three guys working out of a yurt turns into chaos when you have dozens of engineers in the same room.
Shouting across the room, instead of being THE method to get problem solved, becomes a distraction for others. This is where a tool like Hipchat comes into play. When you boil it down, it’s a chat room. Prior to using this too, we had other chat rooms — Skype, Gmail, AIM, etc. So what sets this apart?
Far and away the biggest benefit of HipChat is the ability to create functional rooms with discussions that have the history archived and readily available for review. Here is a scenario that played out before the days of Hipchat:
A critical issue arrives in the evening — when engineers are not present. Usually our Systems Operation team, via some automated alerts, or someone from our Customer Support team escalates a ticket. If it requires a code change, an engineer is called/texted and brought online. Usually a gchat is initiated and the SysOp person explains the issue to the engineer, and they try solving it together. If the engineer is unable to solve it, he or she reaches out to another engineer who then comes online and has to be briefed on the entire scenario before work begins. The next morning, the issue may be resolved but others on the team don’t know what happened, what the solution was, and maybe even if it was the right solution.
Always available archived history changes the entire above scenario. People can now join the conversation at any point and scroll up to get briefed on the issue, what was tried, and where everyone is. There is a clear audit trail that is shared among everyone so others can join in and look at the solution and propose others if it was not the most elegant. Finally, there is a nice recorded history that anyone can reference at any point, and future action items can be created going forward.
Another thing that makes a system like this ideal is dedicated rooms for specific topics. For example we have a NOC (Network Operation Channel), Front End Team, Back End Team, Quality Assurance and also a dedicated channel for each project we work on.
HipChat also has a very flexible API that allows us to send messages from our application for various things such as system outages, deployment statuses and others.
During Phase 2, when working on an issue or when issues may require some collaboration or expertise to unblock blockers, it’s recommended to post a question in Hipchat.
When the issue is finally resolved, it’s time to commit that code into our code repository.
Note: If you’re not currently using a code repository at your company, you should do so immediately. I don’t care if you’re building a website or it’s your own personal code, you should always be using a repository.
At Ontraport we use Stash (Git) for our repository. We moved to Git from SVN about three years ago. Git affords us a number of advantages over older engines:
- Better, more lightweight branching model
- No need to be connected to the primary repo to do commits, branches, etc.
- More modern repository system
When a ticket comes to this stage, we have a process around how to get code back into Stash.
First an engineer is required to pull down the latest version of the development build, and then create a dedicated Git branch for the ticket.
For example “git checkout -b ONTRA-1234’
Next, they do the work and fix the issue. If it’s a back end bug, a unit test is required along with the fix. If it’s a front end bug, QA must be notified if an interface change is needed to fix the bug. That way when selenium runs, they can get ahead of the failures and account for the new UX.
Once the work is done, they commit the code locally. We have a variety of Smart Tags we use when committing code into the code base. These Smart Tags allow an automated scripting bot we call “Sir Walter Raleigh” to read the commit message and do a variety of automated tasks, such as transitioning tickets in JIRA to code review, setting the time and comments on what the fix was, or even creating a pull-request to be merged into a particular branch.
At Ontraport, we operate off of three main branches: Dev, Staging and Master. Dev is primarily for developers to have a local latest build with everyone’s latest changes. Staging is for release candidate builds that have to go through the rigmarole of the testing framework. Finally, Master is for what is out in our production environments.
Typically, the pull-request is created against Dev. When creating the pull-request, we require the engineer to put in some information about the ticket, specifically the “what”, and the “fix.” The “what” should briefly describe the code issue causing the bug, and the “fix” should talk about what the changes are and how it solves the problem.
Adding this step to the process dramatically impacted our code review performance. We require that all pull-requests have at least two other engineers sign off on the change and provide context around what happened and the implemented fix, allowing the auditor to more quickly grasp the changes and the impact it may have.
Along with solving the problem, the code reviewer is also looking for syntax errors, code style infractions, code reuse and good programming practices.
If all is good, the code is merged in, and the Phase 3 begins.
Phase 3 – Testing
We try to automate our testing as much as possible. Selenium and Jenkins are our go-tos. So when the previous build successfully finishes, QA then uses Bamboo (Atlassian deployment software) to deploy the latest version of the code base into a staging environment.
Once on the staging servers, we fire up our Selenium suites which have 1000+ tests to automatically test our app. Selenium is an open source framework that lets you simulate a browser. Using that, it will try to navigate around your app simulating actions that an end user would do. You can read more about it here.
If tests are unsuccessful, QA works to figure out whether it was an issue with the tests they used or an actual bug. If it’s a bug, a staging blocker bug is sent up and put at the top of an engineer’s stack to fix.
If all the tests are successful, QA decides if there is anything new to test (new features or changes to existing features) and works to get those tests in before deployment.
Once that is buttoned up, it goes to Phase 4.
Phase 4 – Deployment
In the deployment phase, as in the previous phase, QA goes into Bamboo and deploys the code to the production servers. This involves sending the code to a senior engineer to “code review” the changes, and then, once merged, they start working on the release notes.
Once all this is approved, the code will be released to trust.ontraport.com, will be updated with the latest release notes, and all tickets included in the build will be followed up with by QA.
At this stage the bug is considered fixed!