📄 st.html
字号:
<HTML><HEAD><TITLE>State Threads for Internet Applications</TITLE></HEAD><BODY BGCOLOR=#FFFFFF><H2>State Threads for Internet Applications</H2><H3>Introduction</H3><P>State Threads is an application library which provides afoundation for writing fast and highly scalable Internet Applicationson UNIX-like platforms. It combines the simplicity of the multithreaded programming paradigm, in which one thread supports each simultaneous connection, with the performance and scalability of an event-driven state machine architecture.</P><H3>1. Definitions</H3><P><A NAME="IA"><H4>1.1 Internet Applications</H4></A><P>An <I>Internet Application</I> (IA) is either a server or client networkapplication that accepts connections from clients and may or may not connect to servers. In an IA the arrival or departure of network dataoften controls processing (that is, IA is a <I>data-driven</I> application).For each connection, an IA does some finite amount of work involving data exchange with its peer, where its peer may be either a client or a server.The typical transaction steps of an IA are to accept a connection,read a request, do some finite and predictable amount of work to process the request, then write a response to the peer that sent the request. One example of an IA is a Web server; the most general example of an IA is a proxy server, because it both accepts connections from clients and connects to other servers.</P><P>We assume that the performance of an IA is constrained by available CPUcycles rather than network bandwidth or disk I/O (that is, CPUis a bottleneck resource).<P><A NAME="PS"><H4>1.2 Performance and Scalability</H4></A><P>The <I>performance</I> of an IA is usually evaluated as itsthroughput measured in transactions per second or bytes per second (onecan be converted to the other, given the average transaction size). There areseveral benchmarks that can be used to measure throughput of Web servingapplications for specific workloads (such as <A HREF="http://www.spec.org/osg/web96/">SPECweb96</A>,<A HREF="http://www.mindcraft.com/webstone/">WebStone</A>,<A HREF="http://www.zdnet.com/zdbop/webbench/">WebBench</A>).Although there is no common definition for <I>scalability</I>, in general itexpresses the ability of an application to sustain its performance when someexternal condition changes. For IAs this external condition is either thenumber of clients (also known as "users," "simultaneous connections," or "loadgenerators") or the underlying hardware system size (number of CPUs, memorysize, and so on). Thus there are two types of scalability: <I>loadscalability</I> and <I>system scalability</I>, respectively.<P>The figure below shows how the throughput of an idealized IA changes withthe increasing number of clients (solid blue line). Initially the throughputgrows linearly (the slope represents the maximal throughput that one clientcan provide). Within this initial range, the IA is underutilized and CPUs arepartially idle. Further increase in the number of clients leads to a systemsaturation, and the throughput gradually stops growing as all CPUs become fullyutilized. After that point, the throughput stays flat because there are nomore CPU cycles available.In the real world, however, each simultaneous connectionconsumes some computational and memory resources, even when idle, and thisoverhead grows with the number of clients. Therefore, the throughput of thereal world IA starts dropping after some point (dashed blue line in the figurebelow). The rate at which the throughput drops depends, among other things, onapplication design.<P>We say that an application has a good <I>load scalability</I> if it cansustain its throughput over a wide range of loads.Interestingly, the <A HREF="http://www.spec.org/osg/web99/">SPECweb99</A>benchmark somewhat reflects the Web server's load scalability because itmeasures the number of clients (load generators) given a mandatory minimalthroughput per client (that is, it measures the server's <I>capacity</I>).This is unlike <A HREF="http://www.spec.org/osg/web96/">SPECweb96</A> andother benchmarks that use the throughput as their main metric (see the figurebelow).<P><CENTER><IMG SRC="fig.gif" ALT="Figure: Throughput vs. Number of clients"></CENTER><P><I>System scalability</I> is the ability of an application to sustain itsperformance per hardware unit (such as a CPU) with the increasing number ofthese units. In other words, good system scalability means that doubling thenumber of processors will roughly double the application's throughput (dashedgreen line). We assume here that the underlying operating system also scaleswell. Good system scalability allows you to initially run an application on the smallest system possible, while retaining the ability to move thatapplication to a larger system if necessary, without excessive effort orexpense. That is, an application need not be rewritten or even undergo amajor porting effort when changing system size.<P>Although scalability and performance are more important in the case of serverIAs, they should also be considered for some client applications (such as benchmark load generators).<P><A NAME="CONC"><H4>1.3 Concurrency</H4></A><P>Concurrency reflects the parallelism in a system. The two unrelated types are <I>virtual</I> concurrency and <I>real</I> concurrency.<UL><LI>Virtual (or apparent) concurrency is the number of simultaneousconnections that a system supports.<BR><BR><LI>Real concurrency is the number of hardware devices, includingCPUs, network cards, and disks, that actually allow a system to perform tasks in parallel.</UL><P>An IA must provide virtual concurrency in order to serve many userssimultaneously.To achieve maximum performance and scalability in doing so, the number ofprogramming entities than an IA creates to be scheduled by the OS kernelshould bekept close to (within an order of magnitude of) the real concurrency found onthe system. These programming entities scheduled by the kernel are known as<I>kernel execution vehicles</I>. Examples of kernel execution vehiclesinclude Solaris lightweight processes and IRIX kernel threads.In other words, the number of kernel execution vehicles should be dictated bythe system size and not by the number of simultaneous connections.<P><H3>2. Existing Architectures</H3><P>There are a few different architectures that are commonly used by IAs. These include the <I>Multi-Process</I>, <I>Multi-Threaded</I>, and <I>Event-Driven State Machine</I> architectures.<P><A NAME="MP"><H4>2.1 Multi-Process Architecture</H4></A><P>In the Multi-Process (MP) architecture, an individual process is dedicated to each simultaneous connection.A process performs all of a transaction's initialization steps and services a connection completely before moving on to service a new connection.<P>User sessions in IAs are relatively independent; therefore, no synchronization between processes handling different connections isnecessary. Because each process has its own private address space,this architecture is very robust. If a process serving one of the connectionscrashes, the other sessions will not be affected. However, to serve manyconcurrent connections, an equal number of processes must be employed.Because processes are kernel entities (and are in fact the heaviest ones), the number of kernel entities will be at least as large as the number of concurrent sessions. On most systems, good performance will not be achieved when more than a few hundred processes are created because of the high context-switching overhead. In other words, MP applications have poor load scalability.<P>On the other hand, MP applications have very good system scalability, becauseno resources are shared among different processes and there is nosynchronization overhead.<P>The Apache Web Server 1.x (<A HREF=#refs1>[Reference 1]</A>) uses the MP architecture on UNIX systems.<P><A NAME="MT"><H4>2.2 Multi-Threaded Architecture</H4></A><P>In the Multi-Threaded (MT) architecture, multiple independent threads of control are employed within a single shared address space. Like a process in the MP architecture, each thread performs all of atransaction's initialization steps and services a connection completelybefore moving on to service a new connection.<P>Many modern UNIX operating systems implement a <I>many-to-few</I> model when mapping user-level threads to kernel entities. In this model, an arbitrarily large number of user-level threads is multiplexed onto a lesser number of kernel execution vehicles. Kernel execution vehicles are also known as <I>virtual processors</I>. Whenever a user-levelthread makes a blocking system call, the kernel execution vehicle it is usingwill become blocked in the kernel. If there are no other non-blocked kernelexecution vehicles and there are other runnable user-level threads, a newkernel execution vehicle will be created automatically. This prevents theapplication from blocking when it can continue to make useful forwardprogress.<P>Because IAs are by nature network I/O driven, all concurrent sessions block onnetwork I/O at various points. As a result, the number of virtual processorscreated in the kernel grows close to the number of user-level threads(or simultaneous connections). When this occurs, the many-to-few modeleffectively degenerates to a <I>one-to-one</I> model. Again, like inthe MP architecture, the number of kernel execution vehicles is dictated bythe number of simultaneous connections rather than by number of CPUs. Thisreduces an application's load scalability. However, because kernel threads(lightweight processes) use fewer resources and are more light-weight thantraditional UNIX processes, an MT application should scale better with loadthan an MP application.<P>Unexpectedly, the small number of virtual processors sharing the same addressspace in the MT architecture destroys an application's system scalabilitybecause of contention among the threads on various locks. Even if anapplication itself is carefullyoptimized to avoid lock contention around its own global data (a non-trivialtask), there are still standard library functions and system callsthat use common resources hidden from the application. For example,on many platforms thread safety of memory allocation routines(<TT>malloc(3)</TT>, <TT>free(3)</TT>, and so on) is achieved by using a singleglobal lock. Another example is a per-process file descriptor table.This common resource table is shared by all kernel execution vehicles withinthe same process and must be protected when one modifies it viacertain system calls (such as <TT>open(2)</TT>, <TT>close(2)</TT>, and so on).In addition to that, maintaining the caches coherentamong CPUs on multiprocessor systems hurts performance when different threadsrunning on different CPUs modify data items on the same cache line.<P>In order to improve load scalability, some applications employ a differenttype of MT architecture: they create one or more thread(s) <I>per task</I>rather than one thread <I>per connection</I>. For example, one small groupof threads may be responsible for accepting client connections, another for request processing, and yet another for serving responses. The mainadvantage of this architecture is that it eliminates the tight couplingbetween the number of threads and number of simultaneous connections. However,in this architecture, different task-specific thread groups must share commonwork queues that must be protected by mutual exclusion locks (a typicalproducer-consumer problem). This adds synchronization overhead that causes anapplication to perform badly on multiprocessor systems. In other words, inthis architecture, the application's system scalability is sacrificed for thesake of load scalability.<P>Of course, the usual nightmares of threaded programming, including datacorruption, deadlocks, and race conditions, also make MT architecture (in anyform) non-simplistic to use.<P><A NAME="EDSM"><H4>2.3 Event-Driven State Machine Architecture</H4></A><P>In the Event-Driven State Machine (EDSM) architecture, a single processis employed to concurrently process multiple connections. The basics of thisarchitecture are described in Comer and Stevens<A HREF=#refs2>[Reference 2]</A>.The EDSM architecture performs one basic data-driven step associated witha particular connection at a time, thus multiplexing many concurrentconnections. The process operates as a state machine that receives an eventand then reacts to it.<P>In the idle state the EDSM calls <TT>select(2)</TT> or <TT>poll(2)</TT> to
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -