bright ideas

The 5 key steps to evolve your Infrastructure team in the Cloud Era


ViFX has the benefit of working with many customers who rely on large scale IT infrastructure. The ViFX Managed Operations team has worked alongside many of them as they transition business systems and their dependencies away from on-premise Virtual Infrastructure to either cloud services or cloud-based IaaS.

For large enterprise, both on-premise and cloud-based services will be in place for a long time. Yes, I’ve avoided the cliché description as many large enterprises will never have cloud-like internal infrastructure – in many cases the value is not there. I’ve also omitted containers from this blog – we’ll cover that off in a separate blog entry soon.

The goal of any effective operational team boils down to three basic principles:

  • Optimise performance
  • Maximise availability
  • Minimise cost

Cost minimisation aside, the performance and availability objectives can be encapsulated in the eradication of Preventable Service Disruptions (PSD’s). If, after a service disruption, Root Cause Analysis informs that the disruption was preventable, then the Ops team are deemed not effective.

So with those fundamental objectives defined, and assuming they are effectively in place, how does an organisation begin to transition services to cloud from an operational perspective, assuming architectural choices have already defined the particular cloud service(s) to consume?

Step 1 - Learn the destination technology

Topical at this point in the evolution of cloud services, is that most organisations moving to cloud services will not be familiar with them. Before any management tooling, role definition or recruiting can occur, the organisation needs to comprehend the destination technology, its idiosyncrasies, strengths and weaknesses, and the particular use case of the organisation’s requirement. In many cases it may be beneficial, or even necessary, to bring in specialist expertise from outside the organisation to map this out.

Step 2 - Design operational management processes, techniques, tools, and automations

Once the destination technology is understood, the operational management techniques, systems, processes, and tooling can be selected and put in to place. No large scale public cloud service offers comprehensive out-of-the-box management capability, so 3rd party tooling is necessary.

In a previous blog post on why people matter when it comes to public cloud, I call out the implications of shrinking infrastructure visibility when managing cloud services, so Application Performance & Cloud Management skills (think New Relic, AppDynamics,, Stackify) become necessary. You’ll also need a centralised log aggregation tool (Splunk, ELK stack, Stackify to an extent) to enable correlation of events when troubleshooting.

At scale, organisations will also find using Public Cloud GUI’s to facilitate outcomes is cumbersome, so advanced scripting is required. Organisations moving to cloud services should adopt an "automate first" policy, in order to start to build out the script library that can be later consolidated into tools such as Chef, Puppet, Salt or Ansible. The Ops engineer in the cloud era needs to swap the screwdriver for a git repository!

In an earlier blog we covered off a variety of Azure automations to control sprawl. Organisations will find that without them, bill shock is almost guaranteed. It is not unusual to receive usage statements from Public Cloud providers at tens of thousands of line items – one of our customer’s monthly usage statements is 40,000+ line items! It is necessary to tag all objects in order to begin to apply interpretation to the usage statement. Microsoft has made good strides forward in this regard, as it’s now possible to get a reasonable amount of detail through the PowerBI plugin, however in most cases further analysis is required, particularly for service providers.

Step 3 - Define roles and KPI's and recruit to them

As also mentioned in a previous blog, it is necessary to specify the roles required to operationally manage cloud consumption, as the skills, interests and attributes required are quite different from traditional infrastructure engineer skills. Even if the projected utilisation of cloud services is of low volume initially, the role can be assigned to someone with two roles initially - although beware the dangers of mixed accountabilities, as per the next step.

Step 4 - Partition duties

Leading on from the previous step, the skills, attributes and, most importantly, interests are in many cases quite different to traditional infrastructure engineers. Some will make the change, but many won’t. It is essential to ensure that if someone is accountable for two roles, that they are given time, training and opportunity to execute the requirements of the role. It can be a difficult transition so patience is often required.

Step 5 - Manage closely

The last step, but definitely not the least important, is ongoing management. There are a few factors that make this unique, and if ignored will cause pain:

  • Public Cloud consumption incurs costs immediately – bill shock is real and easy to achieve.
  • There will likely be operational resources new to the technology who are learning and possibly not making the best initial decisions.
  • The technology is evolving at an astonishing rate – the right way to do something today is often obsolete or un-necessarily expensive tomorrow.

Setting clear reporting metrics, staying in frequent contact with the cloud ops team, and staying in close contact with all aspects of cloud consumption (for example proactive alerting on deployment of expensive resources, daily if not weekly tracking of costs etc.) are important to manage cloud consumption and prevent bill shock. So too is staying in contact with relevant app owners to ensure performance is adequate and therefore being managed correctly.

This blog intends to highlight the 5 key considerations; however it is not intended to be an exhaustive list. ViFX are specialists in both on-premise and cloud-based Managed Services. If any of the topics discussed in this blog resonate, or highlight that some assistance could be required, please feel free to contact us.

Do you agree with these steps? What else is your team focusing on to evolve in the cloud era?

Author: Symon Thurlow

Symon is responsible for the evolution of our managed services for private and public cloud, virtual infrastructure and data protection services.

16 March 2016 / 0 Comments