Ask HN: Most cost-effective method of achieving physical redundancy

jwilliams · on Oct 24, 2008

Interesting question :)

1. The most cost effective in terms of immediate capital outlay is active-passive - which is pretty much #1 that you described. This is generally also less effort... and is generally easy to test.

2. The most cost effective long term strategy is active-active as you get ongoing use of both sites. Particularly if the client is happy for reduced performance in a DR scenario... Even if this isn't the case, at least you get a performance boost from the DR site.

This is usually more effort as you need things like replication to work/perform... And you also need to test a whole new range of failure scenarios.

This approach does have some other advantages though. For example, you can do rolling deploys very easily (upgrade DR first, bring down Prod, upgrade Prod).

3. A common hybrid is to have two active sites on the Tomcat/Apache end and active-passive for the MySQL. Depending on the DB load this can be a best of both worlds scenario.

4. Some other solutions use a coherent cache solution - e.g. Tangosol, which works with Hibernate/MySQL. As long as the latency between the two sites is low, then this should work. Tangosol is a non-free product, now owned by Oracle afaik, but I don't think it's prohibitively expensive... I've seen this used a lot, and it's probably a very simple and elegant solution, but I personally don't like adding another moving part in. A lot of people I've met swear by it though, particularly Hibernate users.

dabeeeenster · on Oct 24, 2008

This response is exactly why I posted on HN! Love this site.

In terms of active-active, I guess I need two physically disparate load balancers that heart beat with each other? I'm not really a network engineer, and I dont understand what happens to the packets when 1 data center goes down, in terms of routing around it?

Thanks for your time.

jwilliams · on Oct 24, 2008

> In terms of active-active, I guess I need two physically disparate load balancers that heart beat with each other?

Yeah - as the brk indicates in his post, you can use something like haproxy to do this. Afaik, you can get Apache to do this too...

One advantage might be that Apache can be configured to do "sticky sessions" - which you'll need if you're not sync'ing sessions between sites (which in this scenario doesn't seem necessary). Never used haproxy, but it might do this too.

In terms of your active-active setup, you can have two "legs" that are load balanced at the front, or you can have load balancing between each tier. If you're (a) not very experienced or (b) you don't have lots of bandwidth between sites, I'd go with the former and just have load balancing at the front end.

bprater · on Oct 24, 2008

Agreed, this is the type of information that is tough to find elsewhere, since it tends to be somewhat specialized.

So for everyone posting -- thanks!

jwilliams · on Oct 24, 2008

As an aside.. I've been involved in doing the sums on DR-sites in the past and it quite often doesn't add up. So I'd recommend doing a cost-benefit analysis with the client.

You've also got to look at all the intermediate threats. For example, a DR site may be on the same network as the Prod site... Or the client site might be on a single network. So you can spend a lot of money eliminating a sole single point of failure when there are dozens of others around (network failures are much more common than a site getting obliterated).

dabeeeenster · on Oct 24, 2008

The problem is that my client's client (we are a subcontractor in this instance) is demanding it. I've explained that, in real terms, they will get little benefit from the large capital expenditure, but it's become a contractual requirement.

My current hosting provider are excellent, and we have a > 99.99% uptime with them on this application according to pingdom (and this includes scheduled downtime!) but you know how clients are sometimes!

brk · on Oct 24, 2008

Get your own ASN ;)

You are on the right path with your approach #1. Setup a failover server someplace and keep it sync'd at an interval that is the right balance between too-often and too-stale.

You could opt to get a small dedicated machine from a larger-scale true multi-homed hosting company. Run haproxy on this "small" machine, which serves primarily to redirect traffic to whichever server is currently deemed to be the "active". Basically this is a low-end DIY Akamai kind of solution. Your cost would be somewhere between #1 and #2. You would eliminate the DNS-lag and the issues you can't control (other servers and clients caching stale DNS data that you can't refresh), and you would get most of the benefits of an availability service like Akamai, without the $15KUSD monthly price tag.

There are other benefits to the haproxy solution as well, you can take either machine offline for maintenance with zero (theoretical ;) ) downtime, and if you ever get a traffic surge you can load balance between the two sites (which isn't a bad idea to do all the time if your syncing is up-to-date enough. Send every 10th connection to the fail over site, just to make sure it's always working as expected).

jwilliams · on Oct 24, 2008

You still have a single point of failure with the haproxy though. i.e. if the Data Centre with the proxy is taken out (same probability as the prod site), then the service would stop.

Even if you install multiple haproxies you're either back to the DNS scenario, or a more expensive solution as you were suggesting.

If you are with a large-scale multihomed hosting company, then you might be able to manage with an IP address takeover between sites (ie. the passive haproxy takes over the IP of the master in an event of heartbeat failure). I've not seen this as an offering for the most part though (at smaller outfits anyway).

brk · on Oct 24, 2008

You still have a single point of failure with the haproxy though. i.e. if the Data Centre with the proxy is taken out (same probability as the prod site), then the service would stop.

That's why I said a true multi-homed datacenter. Reduces the chances of a single-point failure taking everything out. It is going to be somewhere in cost and resiliency between relying on DNS and "doing it right".

In all reality (and my own experiences) data-center outages to various "natural/random disasters" are far less common than outages due to just basic employee stupidity. While I was building up my org/website, I would spend more time worrying about resiliency against stupidity vs. plane crashes.

jwilliams · on Oct 24, 2008

Well I guess my logic is that if you have each site either multi-homed, or at least on different networks (a given anyway), then you can have a LB at each site with the same net result - this is basically your vanilla active-active scenario.

In either case - Unless you go for a high-end solution, the catastrophic-scenario-DNS-juggling doesn't really go away.

As for the relative risk, I agree. The problem is the risk approaches companies use tend to make this kind of thing compulsory for them.

All the raw calcs I've seen rarely justify this kind of expense - the cost-benefit, against, say, a 1 day turnaround to configure a new site (two sites is usually a lot more than double the cost as there is all the other factors e.g. additional/routine testing that's introduced).

... but then companies will factor in less tangible risks, like reputational damage, etc, which can be a bit of black magic... Of course, if you're a bank, or a hospital this might be right, but I suspect in most cases it's off-base.

dabeeeenster · on Oct 24, 2008

What happens if the 'small' server fails?

brk · on Oct 24, 2008

Depends on how risk-averse you are. You can generally get an SLA that would allow you to have a hot-back up of that server, or you take the risk that it won't wail until you can afford the "better" solution.

Doing "high availability" coupled with "low budget" is ALWAYS going to involve some compromises.

If you keep your DNS TTL's low, the worst-case scenario is if the small server fails, you're doing exactly what you would have done anyways: making manual or automated DNS changes.

mcargian · on Oct 24, 2008

When I investigated a low cost solution like this before, the issue was big companies like AOL not honoring TTL's less than 24 hours. Is this still the case?

jwilliams · on Oct 24, 2008

If the client is keen to keep costs down... and is amenable to it... then you could give him an alternative URL to use in the case of a catastrophic failure (i.e. your plane scenario).

Then it's up to you if you use two haproxies for active-active or just maintain two active-passive sites.

lsc · on Oct 24, 2008

1. is the cheap way. as someone else said, you can use BGP instead of dns for failover to improve downtime, but then it's no longer cheap.

Personally, this is my preferred solution. keep as much as possible in MySQL cluster and have the rest built on a central dev server and pushed out when there are changes.

San is not cheap, and it's pretty easy to screw up the whole san. It's still a single point of failure. (I ran prgmr.com on a san for the first few years, and I switched away from SAN not because of cost, but because of reliability. With a san, it's really easy for the new guy to accidentally trash everything.)

so yeah, I'd do #1. if you can use a active-active VPS, that's best, if you want to save money, run a smaller active-active site with ec2 images standing by (remember to test weekly) - the problem with 'cold' backups is that they are usually broken. active active means they are up all the time.

smoody · on Oct 24, 2008

A quick question: When replicating mySQL across data centers, is there an easy or perhaps built-in way to encrypt the replication data streams to prevent prying eyes from potentially intercepting the data? If not, might that be a problem if one is passing replicating credit card numbers, etc.?

nickh · on Oct 24, 2008

In the past, I've used stunnel for encrypting server-to-server communications. It's very easy to setup.

jbyers · on Oct 25, 2008

SSL is built into MySQL replication. If it's enabled on both ends, you can switch it on and get end-to-end encryption.

http://dev.mysql.com/doc/refman/5.1/en/replication-solutions...

vaksel · on Oct 24, 2008

why not just auto-copy everything to AWS and do checks for uptime of your host. Then if the check fails, auto-forward the url to the AWS location.

This way if it hits the fan, the only problem the users will notice is the 5 mins of downtime between the checks, before the site becomes slower due to the switch to AWS

But then again I don't really know hardware so could be wrong. But from what I understand should be possible