[150522] in North American Network Operators' Group

home help back first fref pref prev next nref lref last post

Re: Reliable Cloud host ?

daemon@ATHENA.MIT.EDU (Kevin Day)
Sun Feb 26 18:27:36 2012

From: Kevin Day <toasty@dragondata.com>
In-Reply-To: <1fd3e972-a2e2-4ecc-b2d1-03fe47a828ff@zimbra.network1.net>
Date: Sun, 26 Feb 2012 17:26:31 -0600
To: Randy Carpenter <rcarpen@network1.net>
Cc: Nanog <nanog@nanog.org>
Errors-To: nanog-bounces+nanog.discuss=bloom-picayune.mit.edu@nanog.org


On Feb 26, 2012, at 4:56 PM, Randy Carpenter wrote:
> We have been using Rackspace Cloud Servers. We just realized that they =
have absolutely no redundancy or failover after experiencing a outage =
that lasted more than 6 hours yesterday. I am appalled that they would =
offer something called "cloud" without having any failover at all.
>=20
> Basic requirements:
>=20
> 1. Full redundancy with instant failover to other hypervisor hosts =
upon hardware failure (I thought this was a given!)

This is actually a much harder problem to solve than it sounds, and gets =
progressively harder depending on what you mean by "failover".

At the very least, having two physical hosts capable of running your VM =
requires that your VM be stored on some kind of SAN (usually iSCSI =
based) storage system. Otherwise, two hosts have no way of accessing =
your VM's data if one were to die. This makes things an order of =
magnitude or higher more expensive.

But then all you've really done is moved your single point of failure to =
the SAN. Small SANs aren't economical, so you end up having tons of =
customers on one SAN. If it dies tons of VMs are suddenly down. So you =
now need a redundant SAN capable of live-mirroring everyone's data. =
These aren't cheap either, and add a lot of complexity to things. (How =
to handle failover if it died mid-write, who has the most recent data =
after a total blackout, etc)

And this is really just saying "If hardware fails, i want my VM to =
reboot on another host." If what you're defining high availability to =
mean "even if a physical host fails, i don't want a second of downtime, =
my VM can't reboot" you want something like VMware's ESXi High =
Availability modules where your VM is actually running on two hosts at =
once, running in lock-step with each other so if one fails the other =
takes over transparently. Licenses for this are ridiculously expensive, =
and requires some reasonably complex networking and storage systems.

And I still haven't touched on having to make sure both physical hosts =
capable of running your VM are on totally independent =
switches/power/etc, the SAN has multiple interfaces so it's not all =
going through one switch, etc.

I also haven't run into anyone deploying a high-availability/redundant =
system where they haven't accidentally ended up with a split-brain =
scenario (network isolation causes the backup node to think it's live, =
when the primary is still running). Carefully synchronizing things to =
prevent this is hard and fragile.

I'm not saying you can't have this feature, but it's not typical in =
"reasonably priced" cloud services, and nearly unheard-of to be =
something automatically used. Just moving your virtual machine from =
using local storage to ISCSI backed storage drastically increases disk =
latency and caps the whole physical host's disk speed to 1gbps (not much =
deployment for 10GE adapters on the low-priced VM provider yet). Any =
provider who automatically provisions a virtual machine this way will =
get complaints that their servers are slow, which is true compared to =
someone selling VMs that use local storage. The "running your VM on two =
hosts at once" system has such a performance penalty, and costs so much =
in licensing, you really need to NEED it for it not to be a ridiculous =
waste of resources.

Amazon comes sorta close to this, in that their storage is =
mostly-totally separate from the hosts running your code. But they have =
had failures knock out access to your storage, so it's still not where I =
think you're saying you want to be.

The moral of the story is that just because it's "in the cloud", it =
doesn't gain higher reliability unless you're specifically taking steps =
to ensure it. Most people solve this by taking things that are already =
distributable (like DNS) and setting up multiple DNS servers in =
different places - that's where all this "cloud stuff" really shines.


(please no stories about how you were able to make a redundant virtual =
machine run using 5 year old servers in your basement, i'm talking about =
something that's supportable on a provider scale, and isn't adding more =
single-points-of-failure)

-- Kevin



home help back first fref pref prev next nref lref last post