How we took a very unconventional approach to product management but which felt right to its core and solved one of our problems of platform stability
In Dec 2018, when I joined the organization AppitSimple, there was an internal concern regarding CallHippo (One of our product)
The platform was going down frequently which affected our uptime and customers being unable to make calls.
One night at 2:00 AM on 9th Jan 2019 the platform went down and the immediate meeting was called in and I was handed over the task to resolve the platform stability.
We took a very unconventional approach but it worked, every literature that I have studied on product management never brought this up and never made the use of it. So I can politely say I might be the first one in the world to use this for Product Management
Identifying risk and taking the first steps
Our platform was going down due to one reason or another, mostly connected to the way, how we were growing. This, in turn, put pressure on us to deliver more and meet the expectations. Our engineering, support had to move at the speed which can sustain the growth.
In Jan 2019 we started an initiative called FMEA to address this.
FMEA stands for Failure Mode and Effect Analysis
In simple terms, identify a situation in which a product or a particular event will fail and note down the effect if the event occurs.
How we implemented FMEA
For CallHippo product it was clear that the product fails if we are not able to make a call. A small team was set up comprising of Product, Tech, and Support and we gathered all the reasons because of which a person will not make a call through CallHippo.
The team came up with 35+ reasons in a brief moment due to which the calling can stop with some of them which can be solved through minor UI changes and some of them involved the architecture change to a product.
We created a simple Google spreadsheet with all the reasons and started collecting 3 values for each reason, these values were Probability, Severity, and Detection
|Probability||On the scale of 1-5, how frequently is the particular situation happening, with 5 being very frequent and 1 being extremely unlikely|
|Severity||If the incident occurs, what effect can it cause on the scale of 1-5, with 1 being no relevant effect and 5 being catastrophic|
|Detection||On the scale of 1-5 how easy it is to detect the event, with 1 being certain and 5 standing for the event is undetected to the user.|
*To read more about FMEA and these values you can refer to the wikipedia article
Once we gave a numeric value we got a new value called Risk Probability Number or RPN which is calculated as a product of Probability, Severity, and Detection
RPN = Probability x Severity x Detection
The higher the RPN, the higher the priority to address it. Once we got the RPN values for all the cases we started implementing them and taking them in sprints of development.
In 3 months (at the time of writing this article) from the beginning of the work, we were able to get the risk down considerably which helped us in improving the stability of the platform.
The side effect of this exercise is that whenever the team finds a reason why a customer may not be able to place a call, they add it in FMEA tracking sheet and we meet every month to review and reduce the risk
The team comprised of following members
Vishal from Product
What are the future steps
During the month the team would start entering reasons as and when they arise and then during the meeting we discuss on ways to address the same. The future involves keeping the practice ongoing because new reasons will come as we move to a different tech architecture or keep on improving our product.
What led to the success of this?
In hindsight, it seems very normal and obvious to do this, but the main reason why it succeeded was because of the team and how everyone contributed.