You are viewing a single comment's thread from:

RE: [GUIDE] Optimise your RPC server for better performance

in #witness-category8 years ago

Upvoting for the effort of making things better, but I can't agree with tips. Name lookup while being a part of a benchmark is crazy long and irrelevant. Nginx config is kind of generic use case optimization. We are using it for reverse proxies, so sendfile on is irrelevant.
Network level optimizations is what devices in front of your rpc node are used for.

time_namelookup:    0.004
time_connect:       0.004
time_appconnect:    0.087
time_pretransfer:   0.087
time_redirect:      0.000
time_starttransfer: 0.088
time_total:         0.088

That's my endpoint under load, but queried from a VIP zone. No TCP tweaking on a host. 128GB RAM with RAID0 NVMe

BTW, yeah, I know that my node's performance for general public is currently not as good as it used to be but I'm still trying to serve some high frequency requests to service providers (despite multiple notices of deprecation / making that endpoint obsolete).
I will switch endpoint somewhere in May to what I have currently under tests. Same hardware, new software, you will see the difference :-)

Sort:  

The name lookup part is strange yes, but it was staying between 400ms to 600ms regardless of that until I made the optimisations.

I found one of the most common reasons for RPCs being slow, was too many connections. I saw some IPs making 100s of connections regardless of the nginx rate limiting. I also saw an ungodly amount of TIME_WAIT (waiting to close) connections that were not being cleaned up.

I copied the sendfile on part from our Privex config, I was a little confused as to why that was there too, but I just left it there as it didn't seem to be hurting anything.

If you take a look at the graphs, you can see the insane open connections that my RPC node, and the minnowsupport one were suffering from. This is partly due to asshats using get_block over HTTP rather than websockets, thus opening 100s of connections (of which by default linux takes 4 minutes to close... which is why time_wait optimisation is needed). This does slow down public RPC nodes due to the fact the network scheduler is having to deal with 2000 connections despite the fact less than 300 of those are actually active.

Yeah... I was considering disabling get_block entirely and using separate, smaller and much more robust instance for that (pre-jussi times) but there were also troubles with vops. I'm planning for improvements for June, there's no point in wasting time for temporary solutions.

This is partly due to asshats using get_block over HTTP rather than websockets

@someguy123 As a witnesses you have strong knowledgebase about stuff working behind the scene, so perhaps you have some preferences / advices aimed to API users?

I mean people using API do not worry about performance, but perhaps your hints (such as this quoted above) make a difference? Thinking about simple Do's and Do Not's list.

Agreed. For people using an api, what would be helpful?

Upvoted myself because those comments from voting bots are super annoying. Nobody is scrolling below that trash.

So what's the tl;dr?

hardware firewall in front to deal with network level hassle, nginx with ssl termination, jussi for caching, then specialized nodes; appbase+rocksdb, enterprise nvmes and 640kB should be enough for everybody ;-)
Soon in your blockchain (June, after my vacation)

Thanks. Unfortunately unable to upvote at this moment.