Tags » Helpdesk Alerts

 
 
 

Key Survey / WorldApp Update: Message from the CEO

Categories: Midd Blogosphere

Here is the message sent by the CEO of WorldApp, Inc. concerning last Friday’s Key Survey down time.  (Key Survey is a software program used to create and distribute surveys, as well as collect & analyze responses.)

KeySurvey Logo

From: Oleg Matsko
Sent: Monday, May 18, 2015 9:36 AM
Subject: An update on Friday’s disruption – a message from our CEO

Last Friday’s issues have been some of the most severe issues to affect WorldAPP since we launched Key Survey in 2002. As CEO, I take immense pride in serving organizations across the world in fulfilling their requirements and I feel immensely sorry and hurt that we let those customers down. As such, I feel it is only right that we be completely open, honest and transparent about what happened, and what we are doing to make sure it doesn’t happen again.

A few weeks ago we noticed that one of the storage components of our production environment had started to fail. This in itself doesn’t cause an immediate issue, our production environment is built with multiple layers of redundancy, and despite one of the critical elements of this environment not functioning, our applications continued to work in the manner they should, without any impact on availability. It is important though that when these issues occur, we rectify them as quickly as we can, so that should other components of our environment fail, there isn’t any impact on service.

So for the past few weeks we have been preparing our secondary storage components to take over, allowing us to complete the necessary works on the primary components. Our applications collect a lot of data, in fact the equivalent of 11,000 pages of paper an hour, and this amount of data takes a lot of time to transfer. In an absolutely emergency we can complete this transfer in about 12 hours, but as our primary setup was still stable, and the risks of transferring such a huge amount of data in a relatively short amount of time being quite high, we took our time and completed this transfer over a period of a few weeks.

This transfer was completed on Thursday evening, our secondary storage components went live without issue, and our primary storage components were taken offline to allow the required maintenance to be completed. For a few hours, everything worked fine, and then at around 08:00 EDT on Friday morning, without notice our secondary storage components failed. At the moment, the reason why they failed is still unclear, there doesn’t appear to be an obvious cause. We will work hard with our infrastructure partners, to find out why this happened – but the most important thing for us to do on Friday was to get our applications back online.

Key Survey and Form.com are incredibly large and complex applications, and restarting them isn’t a simple operation. The applications are made up of many separate modules, each relating to an area of their functionality, such as reporting, voting or our API. The effort required to restart them is large, so much so that they cannot all be restarted at once. As such, modules were restarted individually, in order of priority. Our main Key Survey and Form.com environments were operational by 15:00 EDT, with all of our reporting modules online by 21:30 EDT and specific instances of our applications for individual customers back online by 00:30 EDT on Saturday morning.

As a result of Friday’s disruption, I have instructed our teams to rebuild our storage infrastructure to include additional layers of redundancy with built in instant failover capabilities. This is no easy challenge, implementing this infrastructure and migrating all our applications will take about a week, but we should be able to complete this without additional disruption. Once these changes are implemented, we will be able to recover our systems in a matter of minutes. This is in addition to the construction of the remote disaster recovery infrastructure which is already underway and estimated to be completed early next year.

Unfortunately, until these changes have been completed, our secondary storage components could fail again, and this leaves us in a precarious position. Whilst the probability of such a failure is low, and we have taken all possible precautions to ensure it doesn’t reoccur, our teams are prepared to restore services as quickly as possible in the event of a second failure. As the amount of data that is migrated to the new infrastructure increases throughout the week, the amount of time to restore services in the event of an issue reduces. This does mean though that should a similar issue occur early this week, we could experience a similar outage as to what happened on Friday.

As mentioned, I want to be transparent about the challenges we face, and honest about what could happen while we take steps to improve our services. We will let you know as soon as this new environment is fully functional and we can be sure that such issues do not cause as much disruption as they have. In the meantime our team are working diligently to monitor and manage our applications to avoid such issues, and are prepared to restore services as quickly as possible in the event of a reoccurrence of Friday’s troubles. I can also assure you that we will investigate thoroughly what caused these components to fail, but for the time being I want to concentrate all our resources on implementing these changes and improving our service to you.

We will support you as much as we can as a result of this disruption – if there is anything WorldAPP can do to assist you from work you weren’t able to complete last week, such as building surveys, forms or reports, please let your account manager know. We’ll endeavour to accommodate as many requests as we can.

Once again I would like to reiterate my thanks for your patience and understanding, and my genuine sorrow that we have let you down. WorldAPP have been a trusted provider of survey, forms and inspection solutions for over 12 years now, and I hope my explanation of what happened, and assurances of the actions we’re taking to ensure it doesn’t happen again, go some way to rebuilding that trust.

Sincerely,
Oleg Matsko
CEO
WorldAPP, Inc.
161 Forbes Rd Ste 300, Braintree, MA, 02184, US

Key Survey / WorldApp Services Restored

Categories: Midd Blogosphere

KeySurvey Logo

As of 8:15 pm today (Fri, 5/15/15), Key Survey functionality has been restored.  WorldApp is conducting a thorough investigation and will be sharing full details with us as soon as they are available.

 

 

 

Key Survey / WorldAPP Service Interruption – Update

Categories: Midd Blogosphere

KeySurvey Logo

The login and survey access issues with Key Survey have not yet been resolved.  Here is the latest information received from their support team:

From: WorldAPP Support [mailto:support@worldapp.com]
Sent: Friday, May 15, 2015 1:30 PM
Subject: WorldAPP System Interruptions

Today, WorldAPP services, including Key Survey, Form.com and associated applications, have been subject to a service disruption. Below is a brief overview of what caused the issue and the actions we’re taking to restore services as quickly as possible.

Recently, a CPU on one of the servers that our applications use to access our database started failing. Whilst the failure of one CPU doesn’t cause disruption to our services, it does require maintenance so that should the others fail, our applications aren’t impacted. Yesterday evening, our team migrated services to our disaster recovery environment to enable the required maintenance to take place. This is common practice during periods of maintenance to enable continuation of service and has been regularly implemented without effect.

After a few hours of operating on the disaster recovery environment, for reasons yet unknown, the disaster recovery environment failed. Our team took immediate steps to bring the environment back online and are working very hard on restoring services in order of priority, with the most critical services being the first to be restored. As this process continues, we’ll provide further updates on our community pages here.

As we continue to experience service disruption, our applications will remain unavailable and respondents attempting to complete a survey or form will be directed to an error page. We are incredibly sorry for the frustration that this disruption is causing you, and assure you we’re working as hard as we can to restore full service as quickly as possible.

Yours sincerely,
Teresa Crisci
Director of Client Services

By: WorldAPP, Inc.
161 Forbes Rd Ste 300, Braintree, MA, 02184, US

Key Survey Issues — Login and Survey Access Unavailable

Categories: Midd Blogosphere

KeySurvey LogoWe are currently experiencing issues with Key Survey (hosted by WorldApp).  Users who try to log on will not be presented with the usual login screen; the page simply does not load.  Survey recipients will not be able to access surveys and respond at this time.

WorldApp has been notified of these problems.  Updates will be shared here as soon as they are available.

Remember that go/techalerts can be used for quick access to system up/down information and posts concerning outages.

[As of 9:15 am – WorldApp currently estimates that services will be restored in about 30 minutes.  All modules are affected; surveys and reports are not accessible as well.]

Systems Maintenance this Sunday, May 17th

Categories: Midd Blogosphere

During our regular maintenance window this Sunday, May 17th  we have the following activities scheduled:

 

  • One of our two Internet service providers will be preforming network maintenance starting at Midnight that will impact our circuit. We have sufficient redundancy for Internet circuits and there is no expected downtime.

 

We appreciate your patience as we continuously strive to keep our systems functioning optimally.

 

Regards,

Billy

 

 

Billy Sneed

ITS – Central Systems & Network Services

Middlebury College

Systems Maintenance Sunday, May 3rd

Categories: Midd Blogosphere

During our regular maintenance window this Sunday, May 3rd  we have the following activities scheduled:

 

  • Middfiles will be patched to the latest vendor software release
    • Between 7:00am and 8:30am there will be two short service outages, together totaling up to 15 minutes
    • This will include the ORGS directories, user home directories, and all class folders
  • The wireless controllers will be updated
    • Between 6:00am and 7:30am all wireless access points on campus will be rebooted, resulting in up to 15 minutes of wireless connectivity downtime throughout the campus, Bread Loaf and the Snowbowl.

 

We appreciate your patience as we continuously strive to keep our systems functioning optimally.

 

Regards,
Billy

 

 

 

 

Billy Sneed

ITS – Central Systems & Network Services

Middlebury College

Systems Maintenance, Sunday April 12th

Categories: Midd Blogosphere

During our regular maintenance window this Sunday, April 12th we have the following activities scheduled:

 

  • Middfiles will be patched to the latest vendor software release
    • Between 7:30am and 9:00am there will be two short service outages, together totaling up to 15 minutes
    • This will include the ORGS directories, user home directories, and all class folders
  • EZProxy will be updated to the latest vendor software release

 

We appreciate your patience as we continuously strive to keep our systems functioning optimally.

 

Regards,

Billy

 

 

Billy Sneed

ITS – Central Systems & Network Services