11 problems with AWS – part 11

It’s good to finally bring this series to a close. Launching an 11-part series in hindsight was quite a slog… And I caught the plague that has been going around Melbourne, which disrupted my blogging schedule.

We’ve talked previously about cost management in the cloud, but there are other financial considerations worth considering that present as problems – or opportunities – depending on the context.

Two seldom-considered issues around cloud adoption are tax and insurance.

Tax

DEATH and TAXES (paul stumpr via flickr)

How do traditional taxing methods apply in AWS and cloud solutions in general? Some areas to consider include:
  • Residency – Does moving to the cloud enlarge your company’s presence enough that it needs to establish a permanent taxable establishment in another country?
  • Source of income – Most jurisdictions tax an entity if the income has been ‘sourced’ in that jurisdiction. A business conducted in the cloud creates practical difficulties in determining where income is sourced.
  • Transnational tax agreements – Nations assign taxing rights to one jurisdiction or another to avoid double-taxing of people and companies. These types of laws weren’t created with cloud computing in mind so it can be difficult to work out where, and by whom, tax is collected.
  • Indirect taxes – Indirect sales and consumptions taxes may or may not be included on your bill if Amazon is required to include it and a significant liability might accrue. Best be careful!
  • Changes in in-house infrastructure: In-house infrastructure is a long-term asset that can be written down and is sometimes tax deductible. Reducing your in-house IT might impact these long term write-downs.
I’m an IT guy so this definitely isn’t advice. Get some knowledgeable, and probably expensive, experts to check things out, even if they aren’t much fun to hang out with. KPMG have some useful resources too if you’re really keen to read further (Video, PDF)

Insurance

Another new curious new area of finance to consider is cloud insurance. These products have evolved because cloud providers don’t cover the real cost of service outages and data loss. Specifically cloud insurance works to mitigate two risks:

  • Service Outage – Providers like AWS only provide service credits when an SLA is breached. The real cost of an outage has to be covered by the customer. The cost will depend very much on what business function was being performed on AWS. Cloud insurance in this instance can offset some of the financial pain thereby limiting or mitigating some risk. Insurance could be sought independently or through the cloud provider itself.
  • Data Breach/Loss – This doesn’t necessarily apply only to AWS/Cloud. It could apply to in-house infrastructure too. A company’s information and data are worth many times more than the computing equipment itself. Insurance provides some financial protection against a company’s data risk in the cloud, which can assist with timely restoration and rectification in the event of data loss or breach.

Cloud insurance can be bought through the MSPAlliance and companies like Cloud Insure. Check them out.

**********

Really though, I find all this finance malarkey incredibly boring. Researching this article I felt there were valid points to consider but also I had the feeling there was a fair smattering of FUD – which we would never resort to in IT of course!

What do you guys think? Are these issues real, and these products worth investigating?

11 problems with AWS – part 10

It’s been a slow writing week. Southern-hemisphere winter colds have taken over my family and sleep has been scarce.

Today’s problem is that of “Cloud skills”. Cloud is an IT sector that is growing rapidly. Many IT professionals are unsure of how to get skilled in this discipline, despite the many opportunities. Many people also see cloud computing as a continuation of the offshoring/outsourcing paradigm and therefore a threat! And many companies themselves are unsure of how to skill their people for the job.

1..2..3! Pokot and flash – Kenya (Eric Lafforgue via Flickr)

Amazon Web Services is the de facto standard and dominant market player, but its technology is still quite new. As such there’s a shortage of skilled people. AWS only recently announced their Global Certification Program and as stated, it is aimed at Solutions Architects, SysOps Administrators (The future of sysadmin?) and Developers. Having completed the AWS certification myself, it confirmed that cloud computing is the ultimate generalist skill set.

A good “cloud person” will have a many of these skills:

  • Technical: Duh! The ability to spin up applications quickly on the Internet. Developers will have strong Java, .NET and Open Source skills. Also an understanding of sysadmin, caching, networking, orchestration, security and virtualisation.
  • Security: Will be able to assess the risk and consequences of using AWS and make specific security recommendations for these deployments (encryption at rest, key management, data obfuscation, user management, audit logging)
  • Service Management: Able to assess the service impact of AWS deployments and integrate with existing Service Management tool sets and practices. Able to read the fine print in vendors’ contracts and call them on their shortcomings.
  • Business: Have skills in one of the following areas: Enterprise Architecture, Business Analyst or Project Manager. Be able to speak the language of IT and business and interpret between these two worlds.
  • Data Integration: Understand the best methods of integrating data between cloud and in-house platforms. A good understanding of SOA principles.
  • Mobile: Understand the key drivers and constraints of mobile development and platforms.
  • Financial and Contract. A companies legal, commercial and project management teams will need training. IT staff will need to be able to assess and compare costs of in-house (High fixed cost, low ongoing) and AWS-based solutions (low fixed cost, ongoing costs potentially high – see point about having good Project Management skills). Be able to contribute to business case development.
  • The Cloud Roadmap: Technical recommendations and alternatives in cloud implementation. Understanding where your enterprise is on the path to cloud maturity.

If the silos of big IT (Servers, Networking, Storage, DBA etc.) were deep and impenetrable, these new AWS skill sets are very wide and possibly shallow. I suspect specialists will never go away though despite my own tendency to all things general.

Of course you only ever need the skills to do the job at hand. My list hints at where your job, and possibly career, could could possibly evolve to. Has anyone any personal experience of this? Did I miss anything?

11 problems with AWS – part 9

Today’s sticky topic is that of SLAs.

As stated in their EC2 and EBS SLA, “AWS will use commercially reasonable efforts to make Amazon EC2 and Amazon EBS each available with a Monthly Uptime Percentage (defined below) of at least 99.95%”. If an SLA is not met a percentage service credit is given, not a refund. An outage is thus:

  • Service Elevator (Sam Howzit via Flickr)

    EC2 outage – all of your instances have no external connectivity and this is occurring in more than just one Availability Zone in a particular Region. There is no per-instance SLA target.

  • EBS outage – all of your attached volumes perform zero read write IO, with pending IO in the queue.

The SLA was only recently updated to include EBS. A failure in EBS precipitated some of the more infamous AWS failures.  It’s no surprising as many AWS services depend on EBS  (Elastic Load Balancer, Relational Database Service, Elastic Beanstalk and others) so when EBS fails they fail.

AWS makes the following SLA commitments:

[table th=”1″]

AWS service, SLA, Notes

EC2/EBS,99.95%,

S3,99.9%,measured by the error rate percentage

RDS,99.95%,only applies to Multi-AW RDS instances

Route 53,100%,

[/table]

In regards to the S3 SLA, a service credit is given when uptime drops below 99.9% but even when uptime drops to 99%, only a 25% credit is given. This is all rather interesting because S3 is actually designed for 99.99% availability.

All of the AWS SLAs apply within a single region so if you require a better SLA you need to spread your application across multiple Amazon regions, or other providers. This reduces the likelihood of an outage, based on common sense and history, but Amazon make no actual uptime commitment in this case.

If you decide you need two or more regions to meet your availability target and you live pretty much anywhere apart from the USA you’re sort of stuck – especially if you have data sovereignty and latency requirements – because most countries only have the one region.

The gold standard in SLAs is the famous five-nines (99.999%), which amounts to 5 minutes of outage each year. This has been around, and achievable, since the 80s. Why can AWS only commit to 99.95% (about 4 hours a year outage)? Well it’s mainly due to complexity. Getting five nines reliability out of a single server, or a single program, isn’t as hard because there’s only one part to control. Cloud computing has many moving parts. Faced with this reality, many people are asking the question, “Do I really need five-nines?” Good question. A lot of the time probably not.

So accept the SLA limitations of AWS, if you can, and move on because if you’re not happy with them it’s not like you can say anything publicly. As it states in the Customer Agreement, “You will not issue any press release or make any other public communication with respect to this Agreement or your use of the Service Offerings”.

The saving grace is that Amazon uses AWS itself and in the past has been more generous to its customers than required by its own SLAs when there were significant outages.

[stextbox id=”info”]In many presentation slides Amazon mentions the durability of S3 being 99.999999999%. Eleven nines! As they state in there FAQ, “if you store 10,000 objects with Amazon S3, you can on average expect to incur a loss of a single object once every 10,000,000 years.” A big number. Very impressive. Pretty pointless. S3 may go down but they’re pretty sure your data is safe. Durability is about data loss and not about service availability.

The important part is that S3 is designed to survive the loss of data in two data centres. You can purchase reduced redundancy storage (S3 RRS) to save on cost, but this is designed to only survive the loss of one data centre, which is still pretty good.

In any case, I think you, or your application, are probably more likely to stuff up your data. Actually with eleven nines it’s probably more likely that your business, AWS or the world economy, will fail such is the pointlessness of planning to eleven nines![/stextbox]

11 problems with AWS – part 8

The saga continues… Today I’d like to talk about the issue of software licensing in AWS.

Software licenses used to restrict the use of the software to hardware owned by the licensee. Some software licenses are still this way and you may find yourself in a situation where the software runs fine in AWS but you are in breach. This is becoming less of a problem though as more software companies allow their software to run on AWS.

Let’s look at Oracle as an example. If you wish to run Oracle RDBMS in AWS you have the following options:

  1. “Bring Your Own License” – (BYOL) – Companies with existing Oracle Database licenses can run Oracle RDBMS on EC2 instances.
  2. Oracle RDS instances (both On-Demand and Reserved)  – Requires no pre-existing licenses. Companies pay a simple, hourly rate per RDS instance. The rate depends upon the database edition and instance size.
Sonya

Sonya (http://www.flickr.com/photos/joshuacraig)

If you choose the BYOL option you must follow the rules outlined by Oracle. These rules look simple but limit your elasticity and flexibility and are difficult to track. But at least you get to continue to use a software asset you have already bought.

If you choose the second RDS option you get elasticity and flexibility but no asset. You are effectively renting Oracle RDBMS. This means the financial costing for projects needs to be amortised against a project or charged to IT (suckers!). This may also “break” the business case for hosting a solution in AWS to begin with as ongoing costs factoring in predicted usage, future growth and also allowing for short-term spikes in demand, can inflate costs dramatically.

In the past Enterprise software has been licensed by many methods including the number of users, the hardware and even the operating system. Enterprises might also have had multi-year enterprise agreements covering many software products. These existing investments and arrangements act as constraints on your AWS plans.

To confuse things more, other vendors have different arrangements. Microsoft only licenses certain products and you must go through a verification process to bring your own license.

If you’re a developer using licensed software all this seems rather silly and peripheral to your job. If you’ve ever been audited for licensing compliance by a major vendor though you know what a potentially expensive and time-consuming pain this is.

The AWS Marketplace, where software companies pimp their wares, is growing rapidly and now includes most of the major software vendors. The software licensing issues will converge over time and be better understood, but for now they are confusing and difficult to manage.

11 problems with AWS – part 7

I’ve been busy studying for the AWS Certified Solutions Architect certification so it’s been a couple of weeks since my last blog. There are no study guides or sample exams for this so I’ll put together my notes here. Plonk me in your twitter, RSS or email subscriptions if you’re interested. And I crashed my scooter… And then Bluehost crashed my server. It’s been eventful.

We’ve got to Part 7 in this series and not mentioned the S word: Security!

Talk to your AWS representative and they will swear until they’re blue in the face that their services are as secure as an on-premises solution (or even better). They’ll quote security standards such as SAS 70 (replaced by SSAE 16 SOC 2), ISO 27001/27002, PCI-DSS and CSA. And it’s all true… sort of. For the elements of infrastructure AWS look after (physical security and virtualisation layers) I’m confident they do a great job. But the problems with Security in AWS have less to do with Amazon and more to do with the nature of “cloud disruption”.

Security product development and Enterprise security models have developed around the “rings of trust model” shown below. Security products are typically located at, and built for the periphery and enterprise security is typically like a castle (big outside walls to keep the bad guys out but once they get in they can do pretty much whatever). There are some great papers at The Jericho Forum about this phenomenon and its implications (They call it ‘deperimeterization‘! Thanks guys… that’ll make it easy to explain to my boss)

Rings of Trust

With cloud computing you have multiple sets of these “rings” possibly in different clouds, and you may not control all the “rings”! For example, you don’t have full control of the network “ring” in AWS. Security practices around protection of data at rest, data in transit, global identity management and ubiquitous data classification must be considered for all cloud-based platforms. How much will this cost to implement?

The Information Security industry is playing catch-up. There are interesting developments in Security-as-a-Service (eg. Imperva), federated identity management, dedicated VPNs etc., but other problems are hard to solve. How do you manage DLP, log management (hint: data transfer costs can kill you) and encryption (urgh key management!) in a cloud environment without limiting or removing the advantages of using cloud services in the first place?

There is an inherent security problem in the AWS operating model too. To efficiently utilise infrastructure customer workloads run shared on physical servers. The hypervisor boundary that separates AWS customers must be secured. Hypervisor exploits do exist! There’s also the problem of the “noisy neighbour” that can impact the availability your platform. Amazon provides a solution to this. You can purchase a dedicated instance that runs on hardware dedicated to a single customer but you’ll pay 30-50% more.

Another school of thought is that security can be “managed” contractually but this has problems.

  1. AWS is a retail provider of IT infrastructure that makes money by providing a standard platform. Writing different contracts with different provisions for different customers costs money, to them and ultimately you.
  2. Amazon quite plainly states that they run a shared responsibility when it comes to security. Companies can and do install insecure software and configure API access insecurely.
  3. You cannot outsource your reputation. If there is a security breach your name will be in the newspapers – and even worse the blogosphere – not your providers.

That’s my brief-ish brain dump on security issues relating to AWS. Your thoughts? What have I missed? Or overstated? Or understated?

11 problems with AWS – part 6

Today’s issue may also be described as a great opportunity. Deploying applications to AWS requires that developers lift their game. I’m a systems guy from way back so I’m sceptical about the ability of development teams to change. I’ve seen a lot of dodgy code. The problem is that AWS forces change on you as a developer, and there are many development groups who won’t be able to.

Geek Graffiti

Geek Graffiti (Pablo Barrera via Flickr)

A few years ago Platform-as-a-service providers such as Heroku and Engine Yard started appearing that would only concern themselves with application platforms. The entire infrastructure stack, including hardware, operating systems, middleware and databases, were managed by the cloud provider.

In response to this, Amazon created Elastic Beanstalk which manages the deployment, capacity provisioning, load balancing, auto-scaling and monitoring of your application thereby behaving as a “PaaS-like” service whilst still leaving you with control of the underlying infrastructure. It’s also free.

Amazon have done a lot of work to minimise the transition costs of applications from internal IT To AWS by providing integration with popular IDEs such as Eclipse, and support for common languages such as Java, PHP and Node.JS.

Despite all this there are some gotchas that need to be overcome. Your development teams need to overcome these or the benefits of AWS and cloud will not realised. In no particular order:

  1. Amazon Auto Scaling allows application infrastructures to scale up (for days when marketing has done their job) and down (for night time). Therefore applications must be stateless and handle fluid creation and destruction of application servers. Previously an in-house deployment would be pretty static. Developers will now need to “get” what load balancing and session persistence mean. There is little point developing for AWS if you don’t use Auto Scaling.
  2. Amazon integrates CloudFront and Caching services for better performance of static objects and applications with a high read requirement. Content Deliver Networks (CDNs) are also used to distribute static objects closer to the end user wherever they are globally. Internal IT Infrastructure teams have typically managed application caching but this now needs to be considered and understood upfront by the development team.
  3. Amazon provide a Relational Database Service (RDS). Effectively Amazon becomes your DBA. Any application performance issues relating to the database tier can typically be hard to analyse. Development teams will need to be cognizant of database performance issues.
  4. Amazon provide a bunch of other native services such as Amazon Fulfilment Web Service (FWS), AWS Identity and Access Management (IAM), Amazon CloudSearch, Amazon Simple Workflow Service (SWF), Amazon Simple Queue Service (SQS), Amazon Simple Notification Service (SNS) and Amazon Simple Email Service (SES) that need to understood to be exploited by developers. This also represents a kind of vendor lock-in that makes extraction from AWS difficult.
  5. Amazon provide a detailed API for creating and managing Amazon instances. The developer can now manipulate these objects as part of the deployment cycle or even as part of the application. SDKs for most major languages are available that allow deep integration with native Amazon services such as DynamoDB, EC2 and S3 directly. APIs provide great flexibility but your developers need to understand how to exploit them.

11 problems with AWS – Part 5

As catch-up, I’m writing two blogs this week. This series will end one day, and then I’ll be able to write some positive stuff about AWS too. I am actually a big fan. Today’s topic is about web services integration.

A good definition of a web service is: a software function provided at a network address. It is designed to support interoperable machine-to-machine interaction over a network. There are a range of different standards including SOAP (XML/HTTP), JSON and RESTful interfaces. Web services are used everywhere and for everything these days.

Applications can be constructed by mashing together multiple web services that can be hosted anywhere. Web services are used to manage all layers of a system: the infrastructure (a great little article on AWS APIs), supporting software (eg. Heroku) and the application itself (eg. Salesforce). These are all examples of APIs represented through a web service.

Over time as platforms and applications are constructed from many disparate web services, that are hosted both internally and externally, your organisation’s architecture, (including application, security and network) becomes increasingly entangled. I’ve tried to represent this in the diagram below.

Web Service SprawlThis confusion was already a problem when an application was predominantly in-house, but is an order of magnitude more complex when mashed with the cloud. Microsoft Azure allows integration via an outbound relay connection. Amazon allows integration via a network level VPN. All web services can be connected via web proxies, and there are many ways of implementing these.

One way of limiting this confusion is by implementing an integration bus somewhere in the stack. This would limit the number of point-to-point web services and therefore drive standardisation of a common set of security and network integration points.

My question today is: How would you go about limiting and controlling this confusion? Thoughts?

11 problems with AWS – Part 4

It’s been a few weeks since my last update, with a holiday in Fiji, a sick child and a house renovation. Today’s issue is “Cost Management” (which I can relate to with those renovations!)

Most IT departments have at one stage considered implementing a charge-back model for their services. Some mad people have actually implemented one. The idea is that an IT department charges back costs to business units who then become magically accountable for their spending.

kevin dooley (via Flickr)

But tell business folks you’re going to bill them and they push back. Business managers don’t like bills, especially for things that used to be free. Charge-back can also be complicated and expensive to implement.

Enter the Cloud! AWS and other cloud services let business units spin up their own instances, possibly on someone’s credit card. They have to pay. Charge-back is built into the cloud.

The poorly executed project than needed a mountain of infrastructure to run, will appear against that project and business unit in a nice dollar amount.

When there is a live problem with performance, people under pressure will “throw hardware at the problem” and spin up more EC2 instances. The bill will arrive and it will be a “surprise”.

If no management process is put in place, business units will spin up cloud instances and never shut them down.

And if they decide to shift these costs out of their business unit and into IT where the costs have usually resided, IT will measure and record the costs.

Spending accountability processes that have been built up over time will also need adjustment to deal with this nimble nature of cloud spending.

IT and business units will have to go through the exercise of turning off IT components not in use. This can take the form of shutting down servers at night, or turning off DR services when not in an actual disaster or test.

How do you compare this resultant AWS cost against internally hosted services? How do you keep track of it when both internal IT and cloud costs continually drop each year?

In some ways today’s “problem” will become tomorrow’s benefit as the cloud forces more cost transparency and accountability on the business. In the mean time it’s going to be a headache.

I’m really keen to understand, and for people to share, how they measure AWS (or similar) costs against an internally hosted platform. When you’re at the business case development stage how to do you model the costs for both options and make a decision? Please share.

11 problems with AWS – Part 3

The topic of part 3 in this blog series is vendor lock-in.

Roosevelt Elk, Redwood National Park (stevedunleavy via Flickr)

One of the promises of cloud computing is the ability to move computing in and out of the cloud as business demand requires.

That is, when you need some computing power you dial up some cycles and use it, and when you are finished you return those cycles.

Cloud computing should make it easy to mix and match between AWS and on-premises infrastructure and to let data flow between as needed. That is, when you move to AWS, you should be increasing your choices, not decreasing them.

But if you decided you’d had enough of AWS and it was time to give Rackspace a go, would it be easy to move on?

The migration process would be to build a replica infrastructure at your new provider, install the applications, test it and then migrate your data (In this era of big data this could be a big problem! If you put a ton of data up in the cloud, the time and expense to migrate out itself can be prohibitive.)

If you use higher-value services such as Amazon’s Simple Workflow Service or DynamoDB it gets more difficult to migrate as you have to find an equivalent service. The more your organisation uses unique AWS services the harder it is to move on. For example, If you choose to use AWS Elastic Load Balancing capability instead of an open load balancing choice such as HAProxy, or a commercial offering like F5 or Citrix Netscaler, that part of your application cannot “burst” between clouds without re-architecting or maintaining multiple infrastructure architectures.

The conundrum is, as it has always been with cool new tech trends, that many high-value services increase your productivity at the same time as they increase your lock-in.

Some points you should consider regarding vendor lock-in for cloud providers in general are:

  • Has your provider pledged support to emerging industry standards, such as the Cloud Data Management Interface (CDMI)? As far as I could tell Amazon hasn’t signed on yet.
  • Have they provided suitable data migration tools? (Eg. Amazon Import/Export)
  • Have you read the fine print in the provider’s policies regarding data management? http://aws.amazon.com/serviceterms/
  • Do the provider have open and well-supported APIs?
  • Does your provide support they same type of heterogeneous environments you do in-house? Amazon do in many aspects but also provide their own unique services.
  • Is their support for orchestration and templates suitable? The last thing you want is to have to redesign the thousands of workflows or templates. Standards such as TOSCA are emerging.
  • Can your provider’s platform be customised to suit your business? Integration of existing user identity and approval processes is a good example.
  • Can web services hosted in AWS be integrated with your business? How will this work?

What have I missed? Is this a problem for your business? Is lock-in a showstopper for you?

11 problems with AWS – Part 2

Today’s blog focuses on the problem of latency when hosting applications or services within a cloud provider’s network like AWS.

Have you ever noticed the awkward delays on international phone calls? It’s mostly network latency. This is the time it takes for your voice to travel through the air to the phone, be converted to an electrical signal, transmitted to the other phone, converted to sound and transmitted through the air to the other person’s ear.

[stextbox id=”info”]
As a curious aside, there is a “natural” latency in face-to-face communications. When you talk to someone two metres away it takes about 6 milliseconds for the sound to arrive. A local phone call can actually have less latency than face-to-face communications because the sound travels such a short distance through the air to the handset and then mostly travels as an electrical signal at close to the speed of light.
[/stextbox]

Network latency in computing is primarily caused by the limits of the speed of light. Packets can only travel so fast. Other network effects, such as the number of routers and network device utilisation also contribute.

As distributed applications evolve and essentially become an amalgamation of geographically dispersed services, the limits imposed by network latency will become more apparent.

One can imagine in the not-too-distant future an IT manager yelling at a subordinate to “throw bandwidth” or “new EC2 instances” at a poorly performing application and not understanding the real latency problem.

[stextbox id=”info” caption=”Latency and Bandwidth”]
If we imagine a jetliner making a trip from Singapore to Los Angeles, latency is analogous to the time the flight takes. Once a network path is established, latency is essentially a fixed constraint. Bandwidth can be thought of as the number of passengers on board. If you add “passengers” more “packets” arrive per jet-liner, but the flight still takes the same amount of time.

Latency is typically measured as round-trip latency.  This is the time a packet takes to go from source to destination and back again. Round trip latency excludes the amount of time that a destination system spends processing a packet. Typically you need to compare network latency and application response times to work out whether your network is the problem or your application.
[/stextbox]

One way around latency is to make the network path a shorter distance. Place latency-sensitive parts of a system closer together. Consider also which network providers you use as they will have different network architectures and therefore different latencies to different locations. And also look at caching strategies (eg. CDNs, web optimisation) to essentially pre-load data at a location. Finally look at reducing the network-chattiness of your applications if possible.

Latency-sensitive components of your system need to be considered up-front in the planning phase. Once you’ve built your distributed application and locked in your provider agreements with AWS and other cloud providers, your latency constraints are locked in. And once set-up, latency also needs to be constantly monitored for change and its impact on your environment.

Load more