Ask HN: Would you load balance traffic without CDN
I have an average traffic of 5 Gbps to the service. Currently, I'm using managed Kubernetes in GCE and paying $20,000 for traffic. I'm planning to move to another provider, but the question is how to handle load balancing. My idea is to create several A DNS records and let the client choose a random IP, but this is incompatible with Cloudflare, as it uses round-robin under the hood. After testing, it seems like Cloudflare sends all the requests to the first record. So, the second solution is to create several sub-domains and choose randomly from them. All managed load balancers will cost a ton, so that's not the way. Is there any better solution? Ideally I'd like to have one ip without any client load balancing 5Gbps could be a single box, if you're ok with all that encompasses. Otherwise, if you're serving all the traffic with a single IP, you need to do some sort of load balancing. Haproxy + CARP + cold failover is operationally simple, but you lose sessions when your Haproxy box needs maintenance. ECMP works if your hosting allows for it. You'll still lose sessions during changes though. Maybe something something pfsync. Or something with proxygen or some other load balancing that came out of Facebook. (I worked there, but not on their load balancers) More details on what you're planning to do with 5Gbps helps you get advice. I'm assuming https, because cloudflare. Is it mostly static content, or mostly dynamic or mostly proxying? Does it burn a lot of cpu (so you need many boxes anyway) or ? Are you likely to attract DDoS, so you need more inbound bandwidth to accept and drop abuse? Do you expect to provide users with a SLA / what service level do you want to provide? Etc Thank you for the answer! It's currently operating at 5 Gbps, but it is expected to scale significantly based on traffic trends. It appears to be more cost-effective to rent multiple 5 Gbps boxes rather than a single 20 Gbps box.
The protocol used is protobuf over HTTPS (not gRPC) without channels, and it is stateless. The application proxies requests to underlying services, and is capable of handling 10 Gbps of requests on 8 cores. While the likelihood of a DDoS attack seems low, it is not entirely immune. Though there is no strict SLA, the service aims for availability 99.5% of the time Your 99.5% target simplifies failover. As long as you don't do it very often, you can just drop sessions; it's not ideal, but it's within your target and just carefully schedule the maitenance (and take the failures when they happen). If you can, the least equipment option would be to run ECMP from your application servers; no need for a load balancer at all. If you can run 10Gbps on 8 cores, a 16-core server (which is a small server) might do all you need; if so, you could do CARP/VVRP/IP takeover and have everything run through one box at a time. Or, HAProxy will work fine in front. If you want to make it easier for HAProxy, don't terminate TLS on those, run in tcp mode and let the application server do TLS; but then you can't let HAProxy split requests. If that works for you, you could do better with something that can do Direct Server Return so your load balancer doesn't need to process outbound traffic, but I don't have experience running that kind of load balancer (only being behind one), so no software recommendations. For HAProxy, especially tcp mode HAProxy, you want a power of two CPU cores that's no more than the number of NIC queues, hyper threads disabled. If you do it right, all the traffic stays on the same CPU, although it's tricky to align the backend connections. Currently, how are your clients setup? What are their www and root records pointed to? For load balancing, all you need to do is CNAME your customer to your firewall/load balancer. So you aren't using A records for this. For example, in Azure, if you spin up a traffic manager, you would get an cname like "mytrafficmanager.trafficmanager.com" and your CNAMe for www.mysite.com would point to mytrafficmanager.trafficmanager.com. However, in this case, you would also want your customers to point to something like customer.mysite.com so that if you move from GCP/Azure to something else, you can handle that record and migrate them during a failover, incident, or any other reason. Edit: And have customer.mysite.com point to the "mytrafficmanager.trafficmanager.com" Any cloud based load balancer will cost you a ton of money. Because I have 0.5 pb of traffic per month and it's only increasing True, spinning up nginx and setting that up is the cheapest option I have come across with the best enterprise support. It's also available in Microsoft to deploy instead of using their tools. However, some people like being cloud native. For something that has any expectation of fairly high uptime, I think the more important aspect here is failover (either due to actual failures, or due to maintenance) not specifically load balancing, although at this level of traffic, it's definitely helpful to be able to spread the traffic over numerous internal resources rather than relying purely on vertical scaling of the backends. If you're able/willing to manage the LB systems yourself, generally I'd stick two balancers (running HAProxy) in front of the application servers, and do IP failover between them. If your service uses internal IP connectivity to e.g a database or redis cluster or what have you, I'd generally have each of the balancers configured as a primary for either public or private traffic, and as a backup for the other traffic. If your service doesn't use a clustered database service or anything like it, you can obviously omit the config that balances private traffic, and it's possible this aspect is "taken care of" for you using k8s. The exact method of IP failover is going to depend on who is hosting your machines and what their network is configured to allow (i.e. Linode previously supported VRRP, but now do not, and BGP is used instead). AFAIK none of the "standard" IP failover techniques will work in GCP/etc. Whether this is due to legitimate technical limitations of business/profit-driven decisions is left as an exercise for the reader to further analyse. Also consider that if you're using managed k8s, the provider of that service almost certainly has a managed load balancer service. I'm not sure what you mean by "all managed load balancers will cost a ton" - Linode's balancer as an example, has a $10 monthly flat fee. You mention "5Gbps to the service" - does that mean the data is mostly inbound, or did you mean requests result in outbound traffic averaging 5Gbps? It's not a coincidence that GCP's load balancer will charge you, according to my calculation and their pricing, $18 per hour at 5Gbps sustained - that's the entire business model of 'big cloud' businesses. Have you looked at any providers besides GCP/AWS/Azure for hosting this service? Some guesses about what you're using from your post + reply below plus some quick calculations suggest this would be orders of magnitude cheaper to host on Linode, or a similar "not AWS-alike" provider. Thank you for answer! Yep, I'm moving to ovh because you pay for channel, not for consumed traffic. Ovh has a load balancer, but for 0.5pb per month it will cost you 15k. Data is mostly outbound. Haproxy will help until the channel is not saturated.
EDIT
I've checked linode pricing, their lb has a flat fee because it includes 2tb of traffic :) Sounds like you need to perhaps consider egress data costs as a high priority factor when comparing services - by my calculation 500,000 GB on Linode would be about $2.5K, even ignoring any included transfer allowance. Like I said: linode will charge you about $2.5K for half a petabyte of data. Check hetzner instead of OVH