[Develope]

Polling Made Efficient

하늘을닮은호수M 2007. 11. 23. 17:46
반응형

출처 : http://developers.sun.com/solaris/articles/polling_efficient.html

select(3c) and poll(2) are two legacy C interfaces often used to multiplex input and output over a set of file descriptors. As of the Solaris 7 Operating Environment (OE), the new /dev/poll interface -- poll(7d) -- provides a much more scalable means of polling large numbers of file descriptors for activity.

Several means exist to scan multiple file descriptors for available data. Here is a test program that shows how much better the Solaris /dev/poll mechanism scales compared to the traditional poll(2) approach. This new mechanism supports regular files, [pseudo-] terminal devices, STREAMS- based files, FIFOs, pipes, and sockets, as does the legacy poll(2). While poll(2) is far preferred over select(3C) for a multitude of reasons (some of which are described in the select(3C) man page), it still has a significant number of serious performance problems as the number of file descriptors being searched grows large. Most of these problems are solved by using /dev/poll (poll(7d)):

  1. The entire array of pollfd elements has to be passed into and out of the kernel for each call to poll(2). In the case of a server polling thousands of connections, with only activity on a small subset, this is prohibitively expensive.

  2. The kernel must redetermine the status of all the file descriptors in that array each time. As of the Solaris 7 OE, attempts are made to cache status from previous poll(2) calls on a per-LWP basis that can reduce this overhead if the array does not change significantly. However, the same logic that improves performance for largely static pollfd arrays can actually make matters worse if one has a very dynamic array.

  3. After each call to poll(2), one's user-level code must search the entire pollfd array for returned events.

    The /dev/poll pseudo-device solves all the above problems by maintaining the full set of file descriptors being polled in the kernel. File descriptors are added to or removed from the set to be polled by writing to the pseudo-device file descriptor, and polling is done by performing an ioctl(DP_POLL) on that descriptor. To reduce unnecessary copying back of data to user space, and to reduce the amount of time one's code has to search for outstanding events, the maximum number of events that ioctl(DP_POLL) call can return is specified when that call is made, and it returns a dense array of less than or equal to that number of events (as opposed to the poll(2) events scattered among all file descriptors monitored).

    While this new interface radically reduces the fundamental overhead of polling large numbers of file descriptors, it does not solve many of the traditional problems associated with polling like those of trading off latency for throughput, maintaining state information for each descriptor being monitored, removing and re-adding descriptors while processing is being done on behalf of data available, and so on.

    To keep the scope of this article manageable, I present a simple test program that creates a set of pipes and writes data to random entries, using poll or /dev/poll to determine which pipe to read from. This really shows the dramatically reduced overhead of /dev/poll. This test case is clearly artificial, as it doesn't actually need to do any polling, given it knows what index has been written to, and it doesn't do any work scheduling other than handling a read inline. However, it does make clear the cost associated with polling different numbers of file descriptors. Although real-world performance improvements would not likely be as large, given the costs of scheduling worker threads, doing real work, and adding and removing descriptors, they can be quite significant. Here is devpollperf.c.

    To run the test program with thousands of file descriptors, one needs to bump up the file descriptors available. One can use limit(1) to do that manually, though only root can bump up the hard [-h/-H] descriptor limit. Or one could compile the following program, install the binary somewhere on the system that allows setuid programs (typically a local file system, like /usr/local/bin), and make it owned by root and setuid. Note that it sets the effective uid/gid back to the original after bumping the file descriptors, so the process it spawns no longer has root privileges: bumpfdrun.c

    bash-2.03$ ulimit -n 256 bash-2.03$ ulimit -H -n 1024 bash-2.03$ ls -l /usr/local/bin/bumpfdrun -rwsr-sr-x 1 root other 7828 May 10 15:01 /usr/local/bin/bumpfdrun bash-2.03$ /usr/local/bin/bumpfdrun bash bash-2.03$ ulimit -n 65535 

    Here are some runs of the simple test program with 1 ... 4000 descriptors using no polling at all, poll(2), and then ioctl() [/dev/poll]:

    Starting write / [poll|ioctl] / read loop: 104858 write()/read()s on 1 fd(s) took : 796 msec. bash-2.03$ a.out 1 poll Opened 1 file descriptor(s) Starting write / [poll|ioctl] / read loop: 104858 write()/poll()/read()s on 1 fd(s) took : 1391 msec. bash-2.03$ a.out 1 devpoll Opened 1 file descriptor(s) Starting write / [poll|ioctl] / read loop: 104858 write()/ioctl(DP_POLL)/read()s on 1 fd(s) took : 1523 msec. bash-2.03$ a.out 100 Opened 100 file descriptor(s) Starting write / [poll|ioctl] / read loop: 104858 write()/read()s on 100 fd(s) took : 820 msec. bash-2.03$ a.out 100 poll Opened 100 file descriptor(s) Starting write / [poll|ioctl] / read loop: 104858 write()/poll()/read()s on 100 fd(s) took : 3494 msec. bash-2.03$ a.out 100 devpoll Opened 100 file descriptor(s) Starting write / [poll|ioctl] / read loop: 104858 write()/ioctl(DP_POLL)/read()s on 100 fd(s) took : 1863 msec. bash-2.03$ a.out 1000 Opened 1000 file descriptor(s) Starting write / [poll|ioctl] / read loop: 104858 write()/read()s on 1000 fd(s) took : 858 msec. bash-2.03$ a.out 1000 poll Opened 1000 file descriptor(s) Starting write / [poll|ioctl] / read loop: 104858 write()/poll()/read()s on 1000 fd(s) took : 15060 msec. bash-2.03$ a.out 4000 Opened 4000 file descriptor(s) Starting write / [poll|ioctl] / read loop: 104858 write()/read()s on 4000 fd(s) took : 1029 msec. bash-2.03$ a.out 4000 poll Opened 4000 file descriptor(s) Starting write / [poll|ioctl] / read loop: 104858 write()/poll()/read()s on 4000 fd(s) took : 34254 msec. bash-2.03$ a.out 4000 devpoll Opened 4000 file descriptor(s) Starting write / [poll|ioctl] / read loop: 104858 write()/ioctl(DP_POLL)/read()s on 4000 fd(s) took : 2179 msec.

    When scanning for activity on 4000 file descriptors, poll is over 15 times more costly than /dev/poll. Looking at user and system CPU utilization, one can see that some of that time is spent in the user code searching the sparse array for return events, while /dev/poll doesn't consume much user CPU at all for this artificial test case, since the first event in the returned array always contains data:

    rx7 42 => time a.out 10000 poll Opened 10000 file descriptor(s) Starting write / [poll|ioctl] / read loop: 104858 write()/poll()/read()s on 10000 fd(s) took : 147926 msec. 17.7u 130.2s 2:29 95% 0+0k 0+0io 0pf+0w 
    rx7 43 => time a.out 10000 devpoll Opened 10000 file descriptor(s) Starting write / [poll|ioctl] / read loop: 104858 write()/ioctl(DP_POLL)/read()s on 10000 fd(s) took : 2573 msec. 0.5u 2.5s 0:03 92% 0+0k 0+0io 0pf+0w

    Even more impressive, /dev/poll is over 50 times faster here, while the poll(2) version spends more time just searching for returned events than the entire run including kernel time takes using /dev/poll.

    Figure 1 shows the vast improvement in scaling /dev/poll has when compared to poll(2) -- note "Old poll(2)" is actually a steep linear increase in cost but appears exponential because the descriptor axis is not linear.

    figure 1

    Figure 1: Polling Scalability

    Figure 2 shows just the user CPU time needed to search for descriptors with data available. While not nearly as compelling a difference as the total run time, it too is significant.

    figure 2

    Figure 2: Poll User CPU Time

    Finally, Figure 3 compares poll(2) performance before the Solaris 7 kernel poll cache (this is actually from a 2.5.1 machine). Note that shuffling 50 percent of the file descriptors in the array between calls to poll(2) actually ends up scaling significantly worse (400 percent at 1000 descriptors) with Solaris 7 or versions later than Solaris 7, so it is critical to keep the poll array as static as possible if one must use poll(2) (to remove an element, just set the fd to -1 rather than compressing the array).

    figure 3

    Figure 3: poll(2) Improvement

    To sum it up, for Solaris 7 systems or later, /dev/poll provides significant scalability advantages over poll(2), and should be used in most cases where large numbers of descriptors need to be monitored. If poll(2) must be used, make certain to keep the pollfd array as static as possible, to at least avoid exponential increase in cost of polling as the number of descriptors monitored grows.

    Shridhar Acharya has written an article where he builds a simple UDP server using this interface: "Using the devpoll (/dev/poll) Interface".

    May 2002

반응형