Jump to content

[dev] New networking code based on "proactor" pattern


Guest faramir118

Recommended Posts

  • Replies 57
  • Created
  • Last Reply

Top Posters In This Topic

faramir118, please, remember that scatter/gather API is supported only on Vista/Windows 7 operating systems - no need to bother for these specific cases at all. It is better to use old fashioned way of dealing with data buffer merging and mining.

P.S. I've checked you last commit with scatter/gather functionality discussed above - no sights of issue with packet drop by client. Very promising :)

Link to comment
Share on other sites

faramir118, I'll check your code tomorrow for memory leaks so community can be able to start testing your patch :) Thank you very much and keep up the awesome work :)

P.S. No memory leaks detected so far - very nice :) Everyone are welcome to test this patch - we want to know whether or not ping/CPU usage went down and any issues you possibly might experience.

P.P.S. I'd edited topic to reflect what this patch actually does.

Link to comment
Share on other sites

.S. No memory leaks detected so far - very nice Everyone are welcome to test this patch - we want to know whether or not ping/CPU usage went down and any issues you possibly might experience.

I pushed a commit to implement double-buffered sending and removed the message queue, as per your suggestion on github.

  • outgoing packet buffers are reused
    (some small overhead when buffers must be enlarged)
  • for sending packets, cut down the amount of memcpy calls by half.
  • 'enlarged' the critical sections in SendPacket and handle_write_stream
    (we are dealing with shared variables now)
  • removed unnecessary lock in ProcessIncoming

Overall, much less CPU time spent on buffers. Could maybe remove resize overhead by picking a good starting size for the buffers.

Link to comment
Share on other sites

Thank you, faramir118 . Yes, you can set starting buffer size to 64 KB - this is our default value in config.

I'll test your changes tomorrow since currently my brain hurts so bad I can't think at all :) Once you'll done that, is there any possibility we can make use of "Network.Threads" config value to spawn several worker thread for processing async completion notifications?

P.S. As a small advice: in AsyncSocketHandler::handle_write_stream() you should place check to see if ALL the data from m_busyBuffer was written by last write() call. If not - you should keep sending pending data until finished. Otherwise we might loose data =/

Cheers :)

Link to comment
Share on other sites

Multiple threads are possible, definitely something that have been on my list of things to do.

It is accomplished via creating multiple ACE_Proactors.

edit:

For testing, should we ask for system-wide CPU utilization?

There will probably be more CPU use in kernel-space than before, just not sure how much.

Link to comment
Share on other sites

I've found some docs regarding ACE_Proactor and ACE_Reactor pattern use on *nix and Windows systems: http://stevehuston.wordpress.com/tag/ace-proactor-overlapped-asynchronous/:

So, the Proactor situation on UNIX platforms is generally not too good for demanding applications. Again, Proactor on Windows is very good, and recommended for high-performance, highly scalable networked applications. On Linux, stick to ACE_Reactor using the ACE_Dev_Poll_Reactor implementation; on other systems, stick with ACE_Reactor and ACE_Select_Reactor or ACE_TP_Reactor depending on your need for multithreaded dispatching.

Information might be outdated since there were new versions of OSes as well as ACE which might have fixed those issues. If not then there is a solution called TProactor http://www.terabit.com.au/solutions.php with detailes about its internals: http://www.terabit.com.au/docs/Comparison.htm. Also their license allows us to use it w/o any issues if we will carry on their license file :D

Looks like not all *nix OSes are good for true async IO :)

P.S. Yes, faramir118, we should ask for kernel CPU utilization params too since async IO will push more resources to kernel level.

P.P.S. We can rename your "Network.Async" config option into "Network.OldEngine", set default to "0". Create a factory method which will create a proper networking engine based on either Reactor or Proactor patterns, so users will be able to fall back to current networking code in case any issues will be spotted. The reason to try this way and not stop developing patch is:

in many cases add-on Proactor improves performance up to 2 times (compared with Reactor);

Link to comment
Share on other sites

Many changes, restructured all of the socket code. I will work on multiple proactor threads next.

TProactor sounds interesting. If it works well, we can probably do away with the ACE_Reactor-based sockets totally.

However, it was written for a very different version of ACE, and updating it would take time.

I'd still like to hear from *nix uses about performance with default ACE implementations.

Link to comment
Share on other sites

faramir118, your code changes are great and I don't see any potential drawbacks currently. Looking forward to see multi-threaded proactor code.

Yes, TProactor looks very nice but incorporating it into mangos might be a tough job. Also in case if someone will get poor performance or issues with TProactor they will not be able to fall back to old code - mangos will become unusable =/

Anyway, cheers to you for your awesome work :)

Link to comment
Share on other sites

P.S. No memory leaks detected so far - very nice :) Everyone are welcome to test this patch - we want to know whether or not ping/CPU usage went down and any issues you possibly might experience.

Are you sure?

pacram.png

It ate 24 GB of ram in a few seconds (5-10 seconds) after mangos starts completly and started swapping...

If i use Network.OldEngine = 1

(28661 | 140737354053392) ACE_POSIX_AIOCB_Proactor::Max Number of AIOs=1024
(28661 | 140737354053392) ACE_POSIX_AIOCB_Proactor::Max Number of AIOs=1024
(28661 | 140737354053392) ACE_POSIX_AIOCB_Proactor::Max Number of AIOs=1024
(28661 | 140737354053392) ACE_POSIX_AIOCB_Proactor::Max Number of AIOs=1024
(28661 | 140737354053392) ACE_POSIX_AIOCB_Proactor::Max Number of AIOs=1024
Failed to complete asynchronous write to 10.0.0.9

http://pastebin.com/mSAK27pa

http://pastebin.com/dRPdezcn

Link to comment
Share on other sites

Thank you, Undergarun.

It must be something weird in message block resize algorithm if your mangos process ate all your RAM - this cannot be detected as memory leak at all. Another frustrating error is "Max Number of AIOs=1024". Default value should be not lower than 4096 AIOs... But ACE code thinks in other way:

#define ACE_AIO_DEFAULT_SIZE 1024

Undergarun, what is the value of "Network.Threads" config option?

P.S. faramir118, you should test for EAGAIN or EWOULDBLOCK (whateven ACE is using to report operation overflow condition) error in case our send or receive operation wasn't successful to reschedule it. Or even create a per-proactor queue of async requests to workaround issues with max async IO request problem(?). If every ACE_Proactor object allows us on *nix 1024 async io requests, then better load balancing might help us.

Cheers :)

Link to comment
Share on other sites

By default in ACE, as Ambal pointed out, maximum POSIX aio list size is 1024 elements, which equates to 512 clients (since each client can have a pending read and write at the same time). This limit can be changed, although certain platforms won't cooperate.

Asynchronous read/write calls fail immediately (with EAGAIN) if there aren't enough aio slots available. While it is a recoverable error, this isn't handled currently.

Possible things that can be done:

  • Write some runtime checks that avoid aio on platforms that have system-wide limits or low per-process limits
  • Fall back on some synchronous mechanism
  • Fall back on some queueing mechanism
  • use session queue to prevent exceeding aio slot limits

Thoughts?

Link to comment
Share on other sites

Hi, faramir118.

I'd stick with "session queue to prevent exceeding aio slot limits". All you need is to:

1) push all async read/write requests into one queue

2) count amount of active requests

3) upon read/write callbacks lower the active request count

4) execute pending tasks until active count limit will reach ACE_AIO_DEFAULT_SIZE (or whatever value you will pick up).

This solution will work across all the platforms. You just need to find out if that ACE_AIO_DEFAULT_SIZE limit is applied for one ACE_Proactor object or per process. In first case implentation will be simple, in the second one it will require some more syncronisation. ACE_Atomic_Op<> will fit just fine to handle active request count.

P.S. You will definitely need to count/remove from queue requests while disconnecting a client.

Cheers :)

Link to comment
Share on other sites

Is EAGAIN have olso something to do with thread creation?

I mean i experienced same 1024 limitation with 'ThreadManager.spawn()' function.

So if all thread slot will be used for communication we are not being able to spawn some other to our specific task.

Just confirm me iam wrong and it's not related to thread.

Link to comment
Share on other sites

Pushed the queue mechanism. The general idea is there, but I will make a couple changes soon.

I want to use a priority queue, so that I can prioritize write ops.

Deterministically speaking, write is a much more reliable operation - read ops never timeout, so if the internal ACE aiocb list is full of reads we may deadlock until a client wakes us up.

PS: I hope the code and comments make sense... [asynchronous (de)multiplexer] layered on [asynchronous event demupliplexing] will be a nightmare for whoever wants to read this code sometime in the future.

Link to comment
Share on other sites

Decided to try a little test:

750 clients, idle at character screen (my c# client, so not even CMSG_PING)

I think it shows the weakness of select-based servers, and that for large servers there is some performance gain to be found.

I'll work on having the clients send data (ping or some invalid opcode w/ data). This should give a more realistic comparison since we will get to test completion callback vs fdset scanning (and also will ignore the 'overhead' of the mangos core ;))

Download results here.

PS: found some crashes related to outstanding async operations which dereference already-deleted sockets, so I will have to work on that as well.

PPS: should have probably explained the images...

  • For the CPU graphs, look at the graph and not the label.
    The label is the % at the moment I took the picture, which could have been during a context switch, etc.
  • There is a CPU spike when the clients start connecting, and you can use the private bytes graph's slope to judge connection rate
  • The select/debug graph is off a bit, clients were running slow due to attached debugger
    Thus, actual connection rate is probably higher
  • If there are any questions, just ask!
Link to comment
Share on other sites

Been following your changes for a while now.

I don't know how "known" this is but both mangos (i only verified for realmd on win) and TC suffer from fd/handler limit issues. Once the fd limit is reached the servers become unconnectable and this condition persists even if the number of connections being open drops below the fd limit.

On Unix (where most likely ACE_Dev_Poll_Reactor is in use) ACE triggered epoll code goes into endless loop and never recovers unless the app is killed. Derex wrote a patch overloading the error handling and suspending the reactor to prevent this loop (https://github.com/derex/TrinityCore/commit/5f50d8c20f47bcf73ce6fce09e2e98bc8fe1ccbe).

On Win no solution has been found so far as the ACE_TP_Reactor which is used there doesn't really trigger error handling as Unix/ACE_Dev_Poll_Reactor does.

Long story short, you might want to verify if your proactor implementation suffers from the same problem.

Shooting something like this at the servers might reveal that:

#!/usr/bin/perl
use IO::Socket;
my @sockarr;

for(my $i = 0; $i <= 5000; $i++)
{
   if ($sockarr[$i] = IO::Socket::INET->new(Proto=>"tcp",PeerAddr=>"127.0.0.1",PeerPort=>"3724"))
   {
       print "connection: ${i} successful\\n";
   } else {
       print "connection: ${i} failed\\n";
   }
}

print "end";
sleep 10000;

On Win mangos is using FD_SETSIZE 4096 from what i've seen, so this script might need to be started multiple times as Perl has a handler limit below 4096.

Link to comment
Share on other sites

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue. Privacy Policy Terms of Use