This post follows the Refactor your Team Part 1 which I encourage you to read to understand the current situation and where we’re tending to.
We continue our trek to improve practices and process of our development team. Three months went by since Part 1, everybody could get familiar with the new process and we could level up the tests cases using acceptance testing. It is time to implement the second wave of changes and hope to jump up into another level of resiliency, quality and hopefully work enjoyment for the team!
We’re now a team with acceptance tests for the main business use-cases, a all day of manual review by the team, deployment scripts that do the job, and a workcycle clear and documented from code writing to production deployment. (vagrant -> test -> staging -> prod)
The hardest part were :
- Writing tests for each new functionality to prevent regression problems
- Respect of the “Definition of Done”. Bad habits die hard and in those time it was more than necessary to have an understanding but also firm boss to enforce this Definition of Done to the developers and ensure that they respect it for their own good and for the product sake.
Staging environment was created as a strict replica of the production (Thanks to AWS copy-paste and some config did the trick) which is launched on demand every wednesday for the review day. The review workflow is as follows:
- We present the new features of the week at 10 am in front of the all team
- Developers test the features from 10:30 to 3pm
- From 3pm the developers responsible for the feature in which bug has been discovered has to fix it by 5:30pm.
- We deploy once again at 6pm to validate the fix and mark the bug as resolved.
- If he couldn’t fix it the developer has to explain why and ideally a plan to action
- There is a vote from the team to determine if we reschedule the feature or deploy it with the current flaw
For the team it became a real quest for bug tracking and they rapidly became quite severe with themselves without blaming each other which is really impressive for such a “young” agile team !
However we could already feel the limits of the first stage and we had to move onto the next step.
One of the recurring critic about this new workflow is the heavy load put on the person in charge of the project, called here “Deploy Master”. He is indeed in charge of all the review phase from preparation, merging and until the production deployment. Typically a deploy master is responsible for the following:
- Create the “releases notes” where developers write in the features they want to deploy this week
- Merge feature branches into the “staging” branch
- Launch the “staging” environment
- Deploy onto the “staging” environment with the staging branch
- Check features by running the acceptance tests
- Create the review spreadsheet (summarize the review results)
- Merge the “staging” branch onto “release” branch
- Deploy onto Production
- And animate the whole review process
Yeah it is a goddamn lot to do alone in one day. What’s more the rest of the time the “Deploy Master” is a regular developer and might have his own feature to look after during the process. We therefore decided to implement tools for Continuous Build or Continuous Integration in order to automate as much as possible the boring tasks. In order to do that we simply created a server from parts and installed Jenkins. It is mainly used to clone the “repository”, launch the tests at fixed times so that at a given time we can know if the codebase is still properly working.
Jenkins will also be in charge of deploying the code onto test and staging environments if the tests pass. For that intent we created deployment scripts (mix of Ansible and bash) which are in charge of setting up everything rights. So basically the deployment in any environment can be done in one line with the correct parameters!
We also launched a project called “Butler” which consists of automating all the tasks related to the review process:
- Create release documents (release note, review sheets)
- Create on-demand environment (servers and database) such as staging
- Notify Jenkins to launch the tests and deployment
- Emulate pull request system and internal code review like Gerrit to validate the features
- Display a dashboard to monitor environments and display logs without having to login to the machine.
For now it’s still an ongoing project but the objective is clearly to make the life of the “deploy master” simpler and at last make everything automated but configurable. The “deploy master” should be animating the process and spend time focusing on improving the process rather than doing repetitive tasks. In order to do that he needs metrics and feedback!
Creating Metrics and Feedback
A rather big problem until now was the lack of metrics on projects. Indeed, the only metric in the company was the sessions number and various Google Analytics counters. No quantitative or qualitative measures on the code of projects. That lets little to no chance for developers to improve themselves or the quality of their work. It’s the same as if Toyota was only looking at sales figures without considering any other aspect in their production process…
In order to solve that we added several components to help improving the overall output with Jenkins modules , PhpSniffer, phploc and checkstyle for PHP. This allowed us to have a better visibility on the quality of the code we were writing and improve readability. Don’t forget we had a lot of legacy code (inherited) and to be able to quantify it is already a big step !
Then an unexpected effect is that developers got excited by solving right away the problems detected! Everybody hate red alerts and they suddenly all went to solve code and readability to make Jenkins green as soon as possible!
The first quick-win was to set up formatting rules in everybody’s IDE and that alone reduce the errors by half once projects were reformatted. But seeing the curve of problems diminishing by half was a really good motivation for the team and made them trust they could do it! I think we’re able to create a lot of things around this theme, such as Bug Day or any other event to create team building and improve the work environment at same time.
Next step on improving feedback, we set up interviews with business actor inside our organization. The team opened up to, not only care about the engineering level of the feature, but also at a business level to know what was needed. A company needs money to run and even engineers need to understand that features should either answer a user need or answer a business requirement. Of course ideal would be to answer both.
While doing these interviews we soon noticed that business side was also eager to have more metrics about the products. They expressed the will to have their own Analytics platform to see how the users use the products, what feature is the most popular and so on. Metrics at the business level was needed too and the solution would be technological then de facto handled by engineering.We decided to group our need to look for a solution that would work for operations AND for business metrics!
Of course everything is not perfect and business actors are still asking for impossible things with short deadline and don’t get the complexity involved. However the dialogue is now open and each side is making a step forward! We can notice that little by little, IT is changing its communication and therefore its environment too. And that’s really interesting phenomenon to watch!
Extend the code retreat
We also kept going with the “code retreat” activities. I was the one who started it by doing all the first sessions to bootstrap the process and motivate people but I soon passed the baton to the members of the team. We had great times with a lot of different subjects : documentation handling, SEO, social marketing and soon we’ll move back to some technical with “GIT operations” and “Operational for Developers” to bootstrap the DevOps culture inside the organization.
You have to know that contrary to France, companies have no obligation to train their employees or send them to training. Culturally too it’s very rare to spend money to send them one week away to train. They mainly do it by themselves or internally or make some consultant come to help on specific problems. This said, they do a lot of internal training or study events held by the engineers themselves. They often present what they studied at home or share some tips they discovered. Usually we organize 1-2 hour session per month which is not too long, does not interfere with work and allows nice “team building” events.
Share our values with business stakeholders
We’ve also seen this practice of internal training extends to other departments quite fast. For instance salesmen attending management lessons, marketing attending SQL trainings to perform their analytics on their own. I don’t say that we alone were the reason of this culture shift but it’s really thrilling to see them adopt ideas like trainings or daily standup meeting to synchronize their teams. I have to say that the middle management did an awesome job to step up and change things!
About Kaizen (continuous improvement) everybody already know what it is and how important it is. As everywhere they’re frustrated to always have to do the same task without being listened or any hope for solutions. Business actors are still pretty shy to ask developers for solutions or automation and on the other side developers always consider themselves busy enough not to go asking for jobs. We try to get over that, taking into account the needs of everybody and at least list them to keep track, explain them and possibly handle them until release.
The business stakeholders also taught me A LOT about the actual business logic layer. Yes we are developers but usually we serve a business in particular. In software we do have a business logic layer that WE MUST MASTER at any cost!! You can’t call yourself professional if you code with no purpose! For yourself, your product and your company you must know what people are doing and how they’re doing it.
I know, lately, everybody want to think out of the box and break the rules. But first look at the box, see the rules and understand what the business is about ! After you can challenge it your way. That’s the way you’ll discover how customer perceive you, why they like your product or why they leave it.
We’re going a bit over the development subject but a development team should at least participate to internal interviews or orientation to know what departments are doing and how they’re doing it. For me that’s an element of building a culture and also in the future that’s the best way to improve communication and understanding inside your company. This will never be wasted time! What’s more it will allow you to create a business logic in adequation with the players and the market!
First pass on security
Another topic that has been put aside for a while because of the lack of skills was security. After fainting a few times looking at the code (Yeah I am security engineer), I decided to get to it and challenge the current system. It doesn’t necessarily require a lot of skill but I conducted an audit using the ASVS methodology from OWASP. Completed with my home-made checklist I could provide a first report of the security level of the applications and the next steps to implement.
I already had done a code retreat about security when I joined the company but following this review I also decided to create a Hacker Day where I’ll train people to attack their application and present tools to help them detect security breaches or risks. Since we created staging environment we can safely break everything and rebuild from scratch in minutes! I’ll also present to them the risks they should pay attention to when they’re writing code in specific languages, in this case in PHP. I’ll write a longer post only about this topic in a near future 🙂
Dealing with logs
Considering the multiplication of environments and hacking trainings, we noticed pretty quickly how logs become REALLY important on one hand and how hard they were to search into on the other hand. Since we scripted the deployment and the configuration of the machines, that is the only remaining reason why someone would connect on the production servers. An attack? A failure?A bug?Someone touched the configuration? Only logs can tell us.
It became quickly important to group and filter all those logs in order to make them usable and also restrict all the accesses to the production servers. The main reason is to prevent human error or “big thumbs syndrome” when we perform sudo commands on the servers. It is also to standardize how we handle software and be able to track any change made to the system by removing dirty shortcuts.
To fulfill this purpose, as a huge fan of open-source solutions, I integrated a solution based on logstash+kibana, first because it’s awesome and secondly because it fits exactly what I needed. I receive logs through syslog, filter with Elastic Search and display metrics and results through Kibana dashboards. Once done we added a sh*t load of metrics at infrastructure and application levels. We gained global visibility on environments and it became much easier to predict failures, detect immediately problems after deployment etc.. the whole being automated. Honestly this was the change which had the biggest impact on the whole IT management. It allowed us to be compliant , follow changes on the system, create automated detection of bugs.
The time of the 12 tabs opened with tail -f on production servers is finally OVER!!
Reappropriate legacy code
With a trustworthy team you can challenge anything ! Even fight with infamous code such as… PHP3 legacy code with spaghetti HTML built into it, no objects, no routing, no namespaces! The kind of code you wrote when you were learning web development at 13 years old.
We decided to launch a crusade to reappropriate ourselves with this monster no one wanted to touch anymore, fearing it would explode..
When I say reappropriate I don’t mean only coding. We started little by listing who was still using what and how. The first option was to re-code only the strict necessity to be able to delete this code of hell without bothering anyone as soon as possible.
Unfortunately the world is not ideal and this application was still much used and it was unthinkable to replace it in a near future. The second option was then to re-create the functionalities one by one and redirect the old screens to the new application once the feature would be ready. The objective is to make people migrate to the new application as soon as possible. We prefer to correct bugs on the new maintainable version rather than touching the old.
Of course it creates some tensions when you deliver prematurely but this application was strictly restrained to employees, so the best solution we found to avoid frictions was to offer candies and chocolate when you go apologize for the mistake 🙂 When a user discovers a bug(sometime big) the developers goes face to face and bring some chocolates 😉 Sometimes it does end up in front of a beer at night !
To reappropriate the code can be painful and sometimes we need to touch the incriminated code, and in those case it’s really important to have a management supporting you and ensure that they are aware that it will generate frictions and headaches. But as I said with a trustworthy team you can do anything, we are samurai and everything is possible ! Even migrating from PHP3 to PHP5.6!
There we go, our second phase is engaged. We automate the whole process, we focus on the code, metrics and continuous improvement. We take into account business requirements and we make our contributions more visible : release notes shared to all collaborators, metrics on every level, analytics on business features and inter-department sharing.
In the next phase we still have challenges to tackle:
- Reinforce the DevOps culture. A lot has been automated but developers aren’t aware yet of the final goal and they stress really easily when it comes to touching the infrastructure. (Code Retreat?)
- Reinforce the security knowledge. I created beautiful reports on the state of the applications but the creation of the whole security process is still far ahead and will require big changes to the codebase.
- Finally be able to focus on the features and innovations we can bring to our products. To have the time to innovate you shouldn’t be able to crash production all the time or create regressions. Personally I really think those steps were necessary to allow the whole team to finally take risks without always in fear of breaking something.(Deployment, acceptance tests)
- Formalize internal training. And when I write formalize I mean make it a culture peculiarity or even write it in everyone’s contract!
So tell me, what about you? Where are you at?
Do not hesitate to leave me some comments or your own experience on the subject !
You can follow with Part 3 just here !