Transparency is key ... Azure South Central US Outage
Transparency is difficult at the best of times. When it comes to Post Mortems, it can dictate the difference between a customer staying with you or leaving for another service. With the big Azure outage in early September 2018, let’s look at the post mortems from that event.
Hello to all of my readers. I wanted to reach out as I, like many of you, was heavily impacted as a customer of Azure services being down in the South Central US region (San Antonio, TX). When I got to work that morning, my team was definitely in fire-fighting mode as we had many of our services offline or impacted during the outage.
While planning for business continuity is important, reacting with the best information possible is the first step in the response. After I logged into the system, I did a check of Twitter, tech blogs, and news sites to see what was being published about the outage and what I saw was horrible. Much like the AWS Eastern US Storage outage of February 28th, 2017, many companies were knocked offline by this outage including systems at Microsoft, both internally and exterally focused.
One of the keys of any technology team has to be transparency with its customers. As a former Director of IT and current member of SRE team, the balance of transparency versus putting out too much information to scare your customers is a tight rope we have to walk. Many folks feel too much information will scare users and customers away. On the other side of the spectrum, not enough information makes users and customers leave the service because the feel the service "is a black box" and get no information about it.
After having read the Post Mortem from the Azure DevOps Team (formerly Visual Studio Team Service) and the preliminary Post Mortem from Azure, I think that transparency has been reached. I have always been proud to be part of VSTS/Azure DevOps teams in our transparency to internal and external customers. At the same time, I have desired more transparency from other teams at Microsoft and now I am seeing that from Azure.
Give both of these post mortems a quick read and you can determine if they are transparent enough or too transparent for your tastes. Figure out with your teams how much transparency to give to your customers and plan for that in your communicaitons including post mortems. Remember that you want a certain level of transparency from your providers so think about what your customers want from you.
Lightning Image - Copyright 2007, Mike Switzerland
What Does Your DR Look Like? Or "Holy #$*%! Everything is down!"
This is a topic near and dear to me these days. Having suffered a recent outage at my job with over 9 hours of downtime, this is now a major issue for me to work through. Everyone always gives disaster recovery (DR) lip service. They come up with ways to backup data, provide alternative networking access as they can afford, and try to create plans. My feeling, much like what I have dealt with at prior positions, is that no one really invests into DR. I hope to provide a few cautionary tales to help you convince your management to make the investment.
Disaster recovery is insurance. All the investment that is made in DR is insurance against a downtime. At the same time, everyone keeps saying "it will never happen to me." I can provide references that it does happen and the outcomes can be brutal for a business. Downtime can lead to loss of business opportunity, change in customer perception reducing their business with you, loss of customers entirely, or complete collapse and closure of the business. To offset these outcomes, businesses invest in disaster recovery to mitigate the impact of downtimes.
When looking at DR, first thing that people need to determine is what are those critical systems; what systems do you have that if you lost them would impact the business most. For a manufacturing company, it could be their control systems for their machinery. For a datacenter, it could be the power and networking systems to keep the hosted systems online. For a healthcare company, it could be all the systems involved with patient care. The IT team needs to sit with the business and management teams to determine which systems are those critical systems and all of the infrastructure that supports it.
Now that the critical systems are identified and their infrastructure is determined, a full risk assessment of those systems and infrastructure needs to be completed. Are there devices that have single points of failure? Can servers be connected to the network in diverse paths, also known as teaming? Can the software be setup in clustering technologies to allow more than one server to be setup and kept in sync? What equipment is the oldest and have a higher possibility of failure? Working through the risk assessment with knowledgeable team members in both the IT and business teams will help find the answers quickly.
Now that the risks are identified, the professionals need to step in and make some plans to mitigate those risks. That planning can include duplicate systems, cluster creation, backup and recovery techniques, additional networking equipment and lines, and warm/cold spare hardware to name a few. Each of these plans need to be fully thought out including the costs of creation and ongoing maintenance.
Part of the maintenance of backup systems is using them, a largely overlooked step of DR planning. Both business and IT teams need to role-play disasters to ensure these policies, procedures, and systems will work. These sorts of tests interrupt normal business operations but should be done on a regular basis to ensure all systems are go for a real disaster. After each test, the affected teams should get together and review the test event to improve policies, procedures, or systems in the future.
I know that what I have said so far is something that everyone else has said to their management to push for better DR planning and testing. I have said it myself at times. Having gone through a large outage that affected my company's business has brought it to the forefront for me and gotten the attention of my company, a company that runs 24x7 for our business. We lost our primary datacenter, the hosting location for primary servers and the hub of our network, for approximately 9 hours on a Thursday night, which is our busiest times of the week. While we had some basic processes and procedures in place, it was thanks to the hard working teams at my company that we made it through the outage.
During the outage, the primary datacenter lost its primary power at the Automatic Transfer Switch (ATS) that allowed them to select either the utility company or their generators as the power source. Not only did they lose the power there, the ATS literally blew up blowing out part of the wall behind it. In trying to get the datacenter power back online, they also found that a fuse in the transformer was bad, possibly causing the whole problem. To correct the transformer fuse, they would have to fail their second power source from the utility to generator to allow the utility to pull a fuse from that second transformer as the utility crew did not have a spare on hand instead of waiting up to 2.5 hours for them to go get one at their warehouse and returning.
While seeming a simple fix, this would have impacted part of the datacenter that was still operational and hosting one of their biggest customers. That customer did not want any more change introduced into their hosting systems. As a customer impacted by the continued outage, I pushed on the datacenter to start the change with haste. This put the datacenter in the middle between customers.
Eventually, this was resolved and the generator added to the second circuit, allowing the utility to repair the primary circuit. This is where good process and planning helped out my team because we knew which systems had to be started first and what order to effectively restart our business. Once we got our systems up, the business teams started in cleaning up their issues from the outage.
After the outage, an emphasis was placed on all parts of my company to determine ways to improve our business resilience to outages. This includes alternative network connectivity for outages, secondary datacenters, hardened systems, and improved policies and procedures to reduce the impact on our customers if we have another outage.
I will admit that I wrote this blog entry a while ago but could not finish it off until now. It was difficult to read what I wrote because it would make me go back and remember all that happened; reading my blog entry brought back all of those memories and feelings as if they were happening again. Major service interruptions are difficult for any group. What made this worse for me was that there was nothing I could do but wait for our hosting provider to fix their facility and services. Since this occurred, they also have taken some steps to improve their offering to ensure clients like my company do not suffer through something like this again. Improvement can happen for you directly or for your providers and partners.
The key takeaway is that outages will occur. The better your systems and networks are designed and the more time is invested in both business and IT policies and procedures, downtime impact can be reduced and customers can be kept happy during those outages. The best outcome that IT and business teams can hope for is no impacts for their customers at all while systems are offline or unavailable. No single system can stay 100% available forever but well-designed systems and networks can offer the "Five 9's of availability" (99.999%) or no more than just over 5 minutes per year of downtime.
What are you doing for your disaster recovery? Is it even a thought for you or your company?
The Good, The Bad, The Ugly … The Wonderful World of Compliance in IT
"Compliance" is often perceived as such a dirty word to IT professionals that it might as well be censored. The mere mention of "compliance" brings about visions of additional paperwork and processes that slow down everyday tasks and project schedules for many IT pros. With newer regulations, be them federal laws such as Sarbanes-Oxley Act of 2002 (SOX or SOX404)[1] or Health Information Portability and Accountability Act of 1996 (HIPAA) [2]; association or vendor regulations like Payment Card Industry Data Security Standard (PCI DSS) [3]; or internal standards created by management, IT teams in both engineering and operations have to work to meet these regulations and standards as a part of their project and daily work. This was a great discussion topic for Denny Cherry and me on his People Talking Tech Podcast. [4]
Want to make an IT team squirm? Create a meeting about "compliance" or introduce an auditor or consultant. Let’s be honest; we’re technical people, and we want to make things work as quickly and efficiently as possible. It’s uncomfortable to have someone looking over your shoulder to verify that you are “doing things right," either through the operations team installing and configuring, or developers writing code that is deployed to users for saving data into the server or the cloud. But get this: it is actually in our best interest to see it from a different angle. Compliance, standards and regulations are the IT professional's friend. Much like all other aspects of IT, with proper planning and execution, complying with standards and regulations ensures that you have "air cover" for everything you do.
Proper planning for compliance is just like any other IT project: the earlier it can be integrated in plans, the easier the execution can occur. This is true with engineered projects by developers or with integrated projects by operation teams. Compliance can range from smaller tasks such as documenting what is built or installed, all the way to deep logging and intricate permissions management systems. In most situations, there is no "silver bullet" solution to comply with regulations or standards, and any vendor offering this to you or your company should be reviewed carefully. These sorts of vendors typically try to engage with non-technical management to sell them on solutions that the IT team has to later "figure out how to integrate them." (If you have good horror stories around this exact situations, feel free to share in the comments below.) To get ahead of those "snake oil salesmen", be ready to show your management how you are currently or will meet your standards and regulations.
Identify the Requirements
First step in creating a good compliance plan is to understand what you need to comply with. This can be fairly straightforward for some, such HIPAA for healthcare, while being much more complicated for others, combining SOX with PCI DSS or US state and federal regulations with other countries. This step is very critical and will require IT professionals to reach out to the business users they support and possibly consultants like lawyers or compliance officers. It may sound simple, but this can be the most difficult step as many regulations are not black-and-white. Each person can read the same words and interpret it differently. Documenting this interpretation as it is being reviewed will only help you in the future if you have to defend that interpretation from regulators or re-interpretation with new staff or consultants.
Create and Socialize the Plan
Second step is creating internal processes, procedures, and standards that meet all of the items you found in the discovery. Many companies have unwritten ways to do things, rules that everyone follows without question, and systems they use to track what is done and how they do it. While some companies do a great job in documenting their processes, procedures and standards, most do not , and getting teams to change this practice can mean a cultural shift.
Where you often see this issue is when smaller companies grow larger. Small companies consider their IT teams agile because they are all generalists, and any member can and does fill all possible roles. As companies and their IT teams grow, specializations occur. The company's business users yearn for those old days when they could call one of the IT team members to get help and the IT team just got it done. The fight to follow processes and procedures while "getting things done" is a constant struggle. Selling the benefits of following processes and procedures internally is one of the toughest things for IT management to do in situations like this.
Document and Organize
The third step is to take all of the documentation and organize it so that it can be reviewed, executed and tracked. This includes processes, procedures and standards, along with how the team is executing against those standards. In many cases, regulators can show up without notice and audit the company's compliance. Without having the information in an easy to access system or storage, it is useless to both the IT team and the auditors. Again, there is no "sliver bullet" solution for this. Every team needs to find their own solution that works for them. For some teams, it could simply be a file share with Word and Excel documents; for others, it might take a self-developed or commercially available software package.
All of this seems easy on paper, but there is no quick solution or answers. IT teams, management and business users need to take their time to understand what needs to be done, and when they need to meet regulations and standards. Sometimes, one group may ask for more than what is required in the regulations. This needs to be tempered with timelines and costs. Lastly, remember that it takes time for people to adopt changes, and anticipate that when creating the project execution plan. By working together with realistic timelines and good communication, a proper compliance plan can be executed.
Plan for Continuous Improvement
Once the compliance plan is pulled together and the users and IT team are following the plan, the best thing to show most regulators and auditors is a continual improvement processes. Regulators, auditors and compliance officers love to see improvement. In some ways, it is better to create your initial plan and then slowly improve that plan over time rather than trying to create the "perfect plan." Improvement around compliance is seen as a good sign of compliance in an organization. This makes the improvement process just as important as the initial compliance.
IT compliance does not need to be a dirty word. Everyone has his or her stories around the good, the bad and the ugly of compliance; the stories of well executed plans, stories of bad or no plans, stories of regulators imposing large fines and sanctions. Take your time to prepare and execute the best compliance plan you can with the resources available. Once that plan is in place, create an environment that makes ongoing improvements as painless as possible, so that compliance is something everyone understands and wants, and not seen as an impediment to their work.
How do you feel about IT compliance? Do you have stories to share, whether good or bad? If so, put them into the comments below.
This was cross-posted by Veronica Wei Sopher on the Born to Learn Blog at MS Learning. You can check this one out specificially at http://borntolearn.mslearn.net/btl/b/weblog/archive/2013/01/28/the-wonderful-world-of-compliance-in-it.aspx and other great posts over at http://borntolearn.mslearn.net/. Special thanks to Veronica for helping with this posting.
Notes:
1 - More information about Sarbanes-Oxley Act of 2002 (SOX or SOX404)
2 - More information about Health Information Portability and Accountability Act of 1996
- http://www.hhs.gov/ocr/privacy/hipaa/understanding/index.html
- http://www.cms.gov/Regulations-and-Guidance/HIPAA-Administrative-Simplification/HIPAAGenInfo/index.html
3 - More information about Payment Card Industry Data Security Standard
4 - Direct link to my appearance on People Talking Tech, January 22nd, 2013