Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Ask HN: Most cost-effective method of achieving physical redundancy
23 points by dabeeeenster on Oct 24, 2008 | hide | past | favorite | 19 comments
Hi,

We have a client currently being hosted on a single server in a data centre in the UK. They are asking about getting a level of physical redundancy built into the hosting infrastructure. If a plane lands on the data center, they want to be able to continue operating their website. They are not overly concerned about the service dropping out for a minute or two, but they want to avoid the hours spent offline whilst we restored from backups to another data center and redirected the DNS.

I've done some investigation and found two relatively cheap solutions:

1. Get an additional failover server hosted in another data centre. Operate some level of rsync and mySQL replication to the failover server. Then run a service like http://dynect.com/ to do a rudimentary form of failover via DNS. This is not a 100% solution for everyone (mainly people with badly configured long TTL DNS servers). The rsync/mysql replication is also open to potential problems. This seems like one of the cheapest solutions.

2. Use a provider with a SAN-based virtualisation setup with a backup data centre. I know Hostway offer this, but it is not cheap! In the event of a failover we just fire up our VMWare instance in the failover data center and pick up on the shared SAN.

Both of these are not ideal. 1 is a bit too flakey for my liking, and 2 is a bit too expensive for my client. Is there a neater solution that I am missing?

The application is Java/Struts2/Spring/Hibernate running Tomcat/Apache/MySQL on Linux .

Thanks!



Interesting question :)

1. The most cost effective in terms of immediate capital outlay is active-passive - which is pretty much #1 that you described. This is generally also less effort... and is generally easy to test.

2. The most cost effective long term strategy is active-active as you get ongoing use of both sites. Particularly if the client is happy for reduced performance in a DR scenario... Even if this isn't the case, at least you get a performance boost from the DR site.

This is usually more effort as you need things like replication to work/perform... And you also need to test a whole new range of failure scenarios.

This approach does have some other advantages though. For example, you can do rolling deploys very easily (upgrade DR first, bring down Prod, upgrade Prod).

3. A common hybrid is to have two active sites on the Tomcat/Apache end and active-passive for the MySQL. Depending on the DB load this can be a best of both worlds scenario.

4. Some other solutions use a coherent cache solution - e.g. Tangosol, which works with Hibernate/MySQL. As long as the latency between the two sites is low, then this should work. Tangosol is a non-free product, now owned by Oracle afaik, but I don't think it's prohibitively expensive... I've seen this used a lot, and it's probably a very simple and elegant solution, but I personally don't like adding another moving part in. A lot of people I've met swear by it though, particularly Hibernate users.


This response is exactly why I posted on HN! Love this site.

In terms of active-active, I guess I need two physically disparate load balancers that heart beat with each other? I'm not really a network engineer, and I dont understand what happens to the packets when 1 data center goes down, in terms of routing around it?

Thanks for your time.


> In terms of active-active, I guess I need two physically disparate load balancers that heart beat with each other?

Yeah - as the brk indicates in his post, you can use something like haproxy to do this. Afaik, you can get Apache to do this too...

One advantage might be that Apache can be configured to do "sticky sessions" - which you'll need if you're not sync'ing sessions between sites (which in this scenario doesn't seem necessary). Never used haproxy, but it might do this too.

In terms of your active-active setup, you can have two "legs" that are load balanced at the front, or you can have load balancing between each tier. If you're (a) not very experienced or (b) you don't have lots of bandwidth between sites, I'd go with the former and just have load balancing at the front end.


Agreed, this is the type of information that is tough to find elsewhere, since it tends to be somewhat specialized.

So for everyone posting -- thanks!


As an aside.. I've been involved in doing the sums on DR-sites in the past and it quite often doesn't add up. So I'd recommend doing a cost-benefit analysis with the client.

You've also got to look at all the intermediate threats. For example, a DR site may be on the same network as the Prod site... Or the client site might be on a single network. So you can spend a lot of money eliminating a sole single point of failure when there are dozens of others around (network failures are much more common than a site getting obliterated).


The problem is that my client's client (we are a subcontractor in this instance) is demanding it. I've explained that, in real terms, they will get little benefit from the large capital expenditure, but it's become a contractual requirement.

My current hosting provider are excellent, and we have a > 99.99% uptime with them on this application according to pingdom (and this includes scheduled downtime!) but you know how clients are sometimes!


Get your own ASN ;)

You are on the right path with your approach #1. Setup a failover server someplace and keep it sync'd at an interval that is the right balance between too-often and too-stale.

You could opt to get a small dedicated machine from a larger-scale true multi-homed hosting company. Run haproxy on this "small" machine, which serves primarily to redirect traffic to whichever server is currently deemed to be the "active". Basically this is a low-end DIY Akamai kind of solution. Your cost would be somewhere between #1 and #2. You would eliminate the DNS-lag and the issues you can't control (other servers and clients caching stale DNS data that you can't refresh), and you would get most of the benefits of an availability service like Akamai, without the $15KUSD monthly price tag.

There are other benefits to the haproxy solution as well, you can take either machine offline for maintenance with zero (theoretical ;) ) downtime, and if you ever get a traffic surge you can load balance between the two sites (which isn't a bad idea to do all the time if your syncing is up-to-date enough. Send every 10th connection to the fail over site, just to make sure it's always working as expected).


You still have a single point of failure with the haproxy though. i.e. if the Data Centre with the proxy is taken out (same probability as the prod site), then the service would stop.

Even if you install multiple haproxies you're either back to the DNS scenario, or a more expensive solution as you were suggesting.

If you are with a large-scale multihomed hosting company, then you might be able to manage with an IP address takeover between sites (ie. the passive haproxy takes over the IP of the master in an event of heartbeat failure). I've not seen this as an offering for the most part though (at smaller outfits anyway).


You still have a single point of failure with the haproxy though. i.e. if the Data Centre with the proxy is taken out (same probability as the prod site), then the service would stop.

That's why I said a true multi-homed datacenter. Reduces the chances of a single-point failure taking everything out. It is going to be somewhere in cost and resiliency between relying on DNS and "doing it right".

In all reality (and my own experiences) data-center outages to various "natural/random disasters" are far less common than outages due to just basic employee stupidity. While I was building up my org/website, I would spend more time worrying about resiliency against stupidity vs. plane crashes.


Well I guess my logic is that if you have each site either multi-homed, or at least on different networks (a given anyway), then you can have a LB at each site with the same net result - this is basically your vanilla active-active scenario.

In either case - Unless you go for a high-end solution, the catastrophic-scenario-DNS-juggling doesn't really go away.

As for the relative risk, I agree. The problem is the risk approaches companies use tend to make this kind of thing compulsory for them.

All the raw calcs I've seen rarely justify this kind of expense - the cost-benefit, against, say, a 1 day turnaround to configure a new site (two sites is usually a lot more than double the cost as there is all the other factors e.g. additional/routine testing that's introduced).

... but then companies will factor in less tangible risks, like reputational damage, etc, which can be a bit of black magic... Of course, if you're a bank, or a hospital this might be right, but I suspect in most cases it's off-base.


What happens if the 'small' server fails?


Depends on how risk-averse you are. You can generally get an SLA that would allow you to have a hot-back up of that server, or you take the risk that it won't wail until you can afford the "better" solution.

Doing "high availability" coupled with "low budget" is ALWAYS going to involve some compromises.

If you keep your DNS TTL's low, the worst-case scenario is if the small server fails, you're doing exactly what you would have done anyways: making manual or automated DNS changes.


When I investigated a low cost solution like this before, the issue was big companies like AOL not honoring TTL's less than 24 hours. Is this still the case?


If the client is keen to keep costs down... and is amenable to it... then you could give him an alternative URL to use in the case of a catastrophic failure (i.e. your plane scenario).

Then it's up to you if you use two haproxies for active-active or just maintain two active-passive sites.


1. is the cheap way. as someone else said, you can use BGP instead of dns for failover to improve downtime, but then it's no longer cheap.

Personally, this is my preferred solution. keep as much as possible in MySQL cluster and have the rest built on a central dev server and pushed out when there are changes.

San is not cheap, and it's pretty easy to screw up the whole san. It's still a single point of failure. (I ran prgmr.com on a san for the first few years, and I switched away from SAN not because of cost, but because of reliability. With a san, it's really easy for the new guy to accidentally trash everything.)

so yeah, I'd do #1. if you can use a active-active VPS, that's best, if you want to save money, run a smaller active-active site with ec2 images standing by (remember to test weekly) - the problem with 'cold' backups is that they are usually broken. active active means they are up all the time.


A quick question: When replicating mySQL across data centers, is there an easy or perhaps built-in way to encrypt the replication data streams to prevent prying eyes from potentially intercepting the data? If not, might that be a problem if one is passing replicating credit card numbers, etc.?


In the past, I've used stunnel for encrypting server-to-server communications. It's very easy to setup.


SSL is built into MySQL replication. If it's enabled on both ends, you can switch it on and get end-to-end encryption.

http://dev.mysql.com/doc/refman/5.1/en/replication-solutions...


why not just auto-copy everything to AWS and do checks for uptime of your host. Then if the check fails, auto-forward the url to the AWS location.

This way if it hits the fan, the only problem the users will notice is the 5 mins of downtime between the checks, before the site becomes slower due to the switch to AWS

But then again I don't really know hardware so could be wrong. But from what I understand should be possible




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: