Here is the message sent by the CEO of WorldApp, Inc. concerning last Friday’s Key Survey down time. (Key Survey is a software program used to create and distribute surveys, as well as collect & analyze responses.)
From: Oleg Matsko
Sent: Monday, May 18, 2015 9:36 AM
Subject: An update on Friday’s disruption – a message from our CEO
Last Friday’s issues have been some of the most severe issues to affect WorldAPP since we launched Key Survey in 2002. As CEO, I take immense pride in serving organizations across the world in fulfilling their requirements and I feel immensely sorry and hurt that we let those customers down. As such, I feel it is only right that we be completely open, honest and transparent about what happened, and what we are doing to make sure it doesn’t happen again.
A few weeks ago we noticed that one of the storage components of our production environment had started to fail. This in itself doesn’t cause an immediate issue, our production environment is built with multiple layers of redundancy, and despite one of the critical elements of this environment not functioning, our applications continued to work in the manner they should, without any impact on availability. It is important though that when these issues occur, we rectify them as quickly as we can, so that should other components of our environment fail, there isn’t any impact on service.
So for the past few weeks we have been preparing our secondary storage components to take over, allowing us to complete the necessary works on the primary components. Our applications collect a lot of data, in fact the equivalent of 11,000 pages of paper an hour, and this amount of data takes a lot of time to transfer. In an absolutely emergency we can complete this transfer in about 12 hours, but as our primary setup was still stable, and the risks of transferring such a huge amount of data in a relatively short amount of time being quite high, we took our time and completed this transfer over a period of a few weeks.
This transfer was completed on Thursday evening, our secondary storage components went live without issue, and our primary storage components were taken offline to allow the required maintenance to be completed. For a few hours, everything worked fine, and then at around 08:00 EDT on Friday morning, without notice our secondary storage components failed. At the moment, the reason why they failed is still unclear, there doesn’t appear to be an obvious cause. We will work hard with our infrastructure partners, to find out why this happened – but the most important thing for us to do on Friday was to get our applications back online.
Key Survey and Form.com are incredibly large and complex applications, and restarting them isn’t a simple operation. The applications are made up of many separate modules, each relating to an area of their functionality, such as reporting, voting or our API. The effort required to restart them is large, so much so that they cannot all be restarted at once. As such, modules were restarted individually, in order of priority. Our main Key Survey and Form.com environments were operational by 15:00 EDT, with all of our reporting modules online by 21:30 EDT and specific instances of our applications for individual customers back online by 00:30 EDT on Saturday morning.
As a result of Friday’s disruption, I have instructed our teams to rebuild our storage infrastructure to include additional layers of redundancy with built in instant failover capabilities. This is no easy challenge, implementing this infrastructure and migrating all our applications will take about a week, but we should be able to complete this without additional disruption. Once these changes are implemented, we will be able to recover our systems in a matter of minutes. This is in addition to the construction of the remote disaster recovery infrastructure which is already underway and estimated to be completed early next year.
Unfortunately, until these changes have been completed, our secondary storage components could fail again, and this leaves us in a precarious position. Whilst the probability of such a failure is low, and we have taken all possible precautions to ensure it doesn’t reoccur, our teams are prepared to restore services as quickly as possible in the event of a second failure. As the amount of data that is migrated to the new infrastructure increases throughout the week, the amount of time to restore services in the event of an issue reduces. This does mean though that should a similar issue occur early this week, we could experience a similar outage as to what happened on Friday.
As mentioned, I want to be transparent about the challenges we face, and honest about what could happen while we take steps to improve our services. We will let you know as soon as this new environment is fully functional and we can be sure that such issues do not cause as much disruption as they have. In the meantime our team are working diligently to monitor and manage our applications to avoid such issues, and are prepared to restore services as quickly as possible in the event of a reoccurrence of Friday’s troubles. I can also assure you that we will investigate thoroughly what caused these components to fail, but for the time being I want to concentrate all our resources on implementing these changes and improving our service to you.
We will support you as much as we can as a result of this disruption – if there is anything WorldAPP can do to assist you from work you weren’t able to complete last week, such as building surveys, forms or reports, please let your account manager know. We’ll endeavour to accommodate as many requests as we can.
Once again I would like to reiterate my thanks for your patience and understanding, and my genuine sorrow that we have let you down. WorldAPP have been a trusted provider of survey, forms and inspection solutions for over 12 years now, and I hope my explanation of what happened, and assurances of the actions we’re taking to ensure it doesn’t happen again, go some way to rebuilding that trust.
161 Forbes Rd Ste 300, Braintree, MA, 02184, US