📄 nqnfs.me
字号:
"(not cached)" at 0.550,4.733 ljust"Read data" at 0.550,4.921 ljust"Reply data" at 1.675,4.421 ljust"Read request" at 1.675,4.921 ljust"lease" at 1.675,5.233 ljust"Reply non-caching" at 1.675,5.421 ljust"Reply" at 3.737,5.733 ljust"Write" at 3.175,5.983 ljust"Reply" at 3.737,6.171 ljust"Write" at 3.175,6.421 ljust"Eviction Notice" at 3.175,6.796 ljust"Get read lease" at 1.675,7.046 ljust"Read syscall" at 0.550,6.983 ljust"being cached" at 4.675,7.171 ljust"Delayed writes" at 4.675,7.358 ljust"lease" at 3.175,7.233 ljust"Reply write caching" at 3.175,7.421 ljust"Get write lease" at 3.175,7.983 ljust"Write syscall" at 4.675,7.983 ljust"with same modrev" at 1.675,8.358 ljust"Lease" at 0.550,8.171 ljust"Renewed" at 0.550,8.358 ljust"Reply" at 1.675,8.608 ljust"Get Lease Request" at 1.675,8.983 ljust"Read syscall" at 0.550,8.733 ljust"from cache" at 0.550,9.108 ljust"Read syscall" at 0.550,9.296 ljust"Reply " at 1.675,9.671 ljust"plus lease" at 2.050,9.983 ljust"Read Request" at 1.675,10.108 ljust.ps.ft.PE.sp.)zA write-caching lease is not used in the Stanford V Distributed System [Gray89],since synchronous writing is always used. A side effect of this changeis that the five to ten second lease duration recommended by Gray was foundto be insufficient to achieve good performance for the write-caching lease.Experimentation showed that thirty seconds was about optimal for cases wherethe client and server are connected to the same local area network, sothirty seconds is the default lease duration for NQNFS.A maximum of twice that value is permitted, since Gray showed that for somenetwork topologies, a larger lease duration functions better.Although there is an explicit get_lease RPC defined for the protocol,most lease requests are piggybacked onto the other RPCs to minimize theadditional overhead introduced by leasing..sh 2 "Rationale".ppLeasing was chosen over hard server state information for the followingreasons:.ip 1.The server must maintain state information about all currentclient leases.Since at most one lease is allocated for each RPC and the leases expireafter their lease term,the upper bound on the number of current leases is the product of thelease term and the server RPC rate.In practice, it has been observed that less than 10% of RPCs request new leasesand since most leases have a term of thirty seconds, the following rule ofthumb should estimate the number of server lease records:.sp.nf Number of Server Lease Records \(eq 0.1 * 30 * RPC rate.fi.spSince each lease record occupies 64 bytes of server memory, storing the leaserecords should not be a serious problem.If a server has exhausted lease storage, it can simply wait a few secondsfor a lease to expire and free up a record.On the other hand, a Sprite-like server must store records for all filescurrently open by all clients, which can require significant storage fora large, heavily loaded server.In [Mogul93], it is proposed that a mechanism vaguely similar to paging could beused to deal with this for Spritely NFS, but thisappears to introduce a fair amount of complexity and may limit theusefulness of open records for storing other state information, suchas file locks..ip 2.After a server crashes it must recover lease records forthe current outstanding leases, which actually implies that if it waitsuntil all leases have expired, there is no state to recover.The server must wait for the maximum lease duration of one minute, and it must serveall outstanding write requests resulting from terminated write-cachingleases before issuing new leases. The one minute delay can be overlapped withfile system consistency checking (eg. fsck).Because no state must be recovered, a lease-based server, like an NFS server,avoids the problem of state recovery after a crash..spThere can, however, be problems during crash recoverybecause of a potentially large number of write backs due to terminatedwrite-caching leases.One of these problems is a "recovery storm" [Baker91], which could occur whenthe server is overloaded by the number of write RPC requests.The NQNFS protocol deals with this by replyingwith a return status code calledtry_again_later to allRPC requests (except write) until the write requests subside.At this time, there has not been sufficient testing of server crashrecovery while under heavy server load to determine if the try_again_laterreply is a sufficient solution to the problem.The other problem is that consistency will be lost if other RPCs are performedbefore all of the write backs for terminated write-caching leases have completed.This is handled by only performing write RPCs untilno write RPC requests arrivefor write_slack seconds, where write_slack is set to several timesthe client timeout retransmit interval,at which time it is assumed all clients have had an opportunity to send their writesto the server..ip 3.Another advantage of leasing is that, since leases are required at times when other I/O operations occur,lease requests can almost always be piggybacked on other RPCs, avoiding some of theoverhead associated with the explicit open and close RPCs required by a Sprite-like system.Compared with Sprite cache consistency,this can result in a significantly lower RPC load (see table #1)..sh 1 "Limitations of the NQNFS Protocol".ppThere is a serious risk when leasing is used for delayed writecaching.If the server is simply too busy to service a lease renewal before a write-cachinglease terminates, the client will not be able to push the writedata to the server before the lease has terminated, resulting ininconsistency.Note that the danger of inconsistency occurs when the server assumes thata write-caching lease has terminated before the client hashad the opportunity to write the data back to the server.In an effort to avoid this problem, the NQNFS server does not assume thata write-caching lease has terminated until three conditions are met:.sp.(l1 - clock time > (expiry time + clock skew)2 - there is at least one server daemon (nfsd) waiting for an RPC request3 - no write RPCs received for leased file within write_slack after the corrected expiry time.)l.lpThe first condition ensures that the lease has expired on the client.The clock_skew, by default three seconds, must beset to a value larger than the maximum time-of-day clock error that is likely to occurduring the maximum lease duration.The second condition attempts to ensure that the clientis not waiting for replies to any writes that are still queued for service byan nfsd. The third condition tries to guarantee that the client hastransmitted all write requests to the server, since write_slack is set toseveral times the client's timeout retransmit interval..ppThere are also certain file system semantics that are problematic for both NFS and NQNFS,due to thelack of state information maintained by theserver. If a file is unlinked on one client while open on another it willbe removed from the file server, resulting in failed file accesses on theclient that has the file open.If the file system on the server is out of space or the client user's diskquota has been exceeded, a delayed write can fail long after the write systemcall was successfully completed.With NFS this error will be detected by the close system call, sincethe delayed writes are pushed upon close. With NQNFS however, the delayed writeRPC may not occur until after the close system call, possibly even after the processhas exited.Therefore,if a process must check for write errors,a system call such as \fIfsync\fR must be used..ppAnother problem occurs when a process on one client isrunning an executable fileand a process on another client starts to write to the file. The read lease onthe first client is terminated by the server, but the client has no recourse butto terminate the process, since the process is already in progress on the oldexecutable..ppThe NQNFS protocol does not support file locking, since a file lock would haveto involve hard, recovered after a crash, state information..sh 1 "Other NQNFS Protocol Features".ppNQNFS also includes a variety of minor modifications to the NFS protocol, in anattempt to address various limitations.The protocol uses 64bit file sizes and offsets in order to handle large files.TCP transport may be used as an alternative to UDPfor cases where UDP does not perform well.Transport mechanismssuch as TCP also permit the use of much larger read/write data sizes,which might improve performance in certain environments..ppThe NQNFS protocol replaces the Readdir RPC with a Readdir_and_LookupRPC that returns the file handle and attributes for each file in thedirectory as well as name and file id number.This additional information may then be loaded into the lookup and file-attributecaches on the client.Thus, for cases such as "ls -l", the \fIstat\fR system calls can be performedlocally without doing any lookup or getattr RPCs.Another additional RPC is the Access RPC that checks for fileaccessibility against the server. This is necessary since in some cases theclient user ID is mapped to a different user on the server and doing theaccess check locally on the client using file attributes and client credentials isnot correct.One case where this becomes necessary is when the NQNFS mount point is usingKerberos authentication, where the Kerberos authentication ticket is translatedto credentials on the server that are mapped to the client side user id.For further details on the protocol, see [Macklem93]..sh 1 "Performance".ppIn order to evaluate the effectiveness of the NQNFS protocol,a benchmark was used that wasdesigned to typifyreal work on the client workstation.Benchmarks, such as Laddis [Wittle93], that perform server load characterizationare not appropriate for this work, since it is primarily client cachingefficiency that needs to be evaluated.Since these tests are measuring overall client system performance andnot just the performance of the file system,each sequence of runs was performed on identical hardware and operating system in order to factor out the systemcomponents affecting performance other than the file system protocol..ppThe equipment used for the all the benchmarks are members of the DECstation\(tm\(dgfamily of workstations using the MIPS\(tm\(sc RISC architecture.The operating system running on these systems was a pre-release version of4.4BSD Unix\(tm\(dd.For all benchmarks, the file server was a DECstation 2100 (10 MIPS) with 8Mbytes ofmemory and a local RZ23 SCSI disk (27msec average access time).The clients range in speed from DECstation 2100sto a DECstation 5000/25, and always run with six block I/O daemonsand a 4Mbyte buffer cache, except for the test runs where thebuffer cache size was the independent variable.In all cases /tmp is mounted on the local SCSI disk\**, all machines wereattached to the same uncongested Ethernet, and ran in single user mode during the benchmarks..(f\**Testing using the 4.4BSD MFS [McKusick90] resulted in slightly degraded performance,probably since the machines only had 16Mbytes of memory, and so pagingincreased..)fUnless noted otherwise, test runs used UDP RPC transportand the results given are the average values of four runs..ppThe benchmark used is the Modified Andrew Benchmark (MAB)[Ousterhout90],which is a slightly modified version of the benchmark used to characterizeperformance of the Andrew ITC file system [Howard88].The MAB was set up with the executable binaries in the remote mounted filesystem and the final load step was commented out, due to a linkage problemduring testing under 4.4BSD.Therefore, these results are not directly comparable to other reported MABresults.The MAB is made up of five distinct phases:.sp.ip "1." 10Makes five directories (no significant cost).ip "2." 10Copy a file system subtree to a working directory.ip "3." 10Get file attributes (stat) of all the working files.ip "4." 10Search for strings (grep) in the files.ip "5." 10Compile a library of C sources and archive them.lpOf the five phases, the fifth is by far the largest and is the one affected mostby client caching mechanisms.The results for phase #1 are invariant over allthe caching mechanisms..sh 2 "Buffer Cache Size Tests".ppThe first experiment was done to see what effect changing the size of thebuffer cache would have on client performance. A single DECstation 5000/25was used to do a series of runs of MAB with different buffer cache sizesfor four variations of the file system protocol. The four variations areas follows:.ip "Case 1:" 10NFS - The NFS protocol as implemented in 4.4BSD.ip "Case 2:" 10Leases - The NQNFS protocol using leases for cache consistency.ip "Case 3:" 10Leases, Rdirlookup - The NQNFS protocol using leases for cache consistencyand with the readdir RPC replaced by Readdir_and_Lookup.ip "Case 4:" 10Leases, Attrib leases, Rdirlookup - The NQNFS protocol using leases forcache consistency, with the readdirRPC replaced by the Readdir_and_Lookup,and requiring a valid lease not only for file-data access, but also for file-attribute access..lpAs can be seen in figure 1, the buffer cache achieves about optimalperformance for the range of two to ten megabytes in size. At elevenmegabytes in size, the system pages heavily and the runs did notcomplete in a reasonable time. Even at 64Kbytes, the buffer cache improvesperformance over no buffer cache by a significant margin of 136-148 secondsversus 239 seconds.This may be due, in part, to the fact that the Compile Phase of the MABuses a rather small working set of file data.All variants of NQNFS achieve aboutthe same performance, running around 30% faster than NFS, with a slightlylarger difference for large buffer cache sizes.Based on these results, all remaining tests were run with the buffer cachesize set to 4Mbytes.Although I do not know what causes the local peak in the curves between 0.5 and 2 megabytes,there is some indication that contention for buffer cache blocks, between the update process(which pushes delayed writes to the server every thirty seconds) and the I/Osystem calls, may be involved..(z.PS.ps.ps 10dashwid = 0.050iline dashed from 0.900,7.888 to 4.787,7.888line dashed from 0.900,7.888 to 0.900,10.262line from 0.900,7.888 to 0.963,7.888line from 4.787,7.888 to 4.725,7.888line from 0.900,8.188 to 0.963,8.188line from 4.787,8.188 to 4.725,8.188line from 0.900,8.488 to 0.963,8.488line from 4.787,8.488 to 4.725,8.488line from 0.900,8.775 to 0.963,8.775line from 4.787,8.775 to 4.725,8.775line from 0.900,9.075 to 0.963,9.075line from 4.787,9.075 to 4.725,9.075line from 0.900,9.375 to 0.963,9.375line from 4.787,9.375 to 4.725,9.375line from 0.900,9.675 to 0.963,9.675line from 4.787,9.675 to 4.725,9.675line from 0.900,9.963 to 0.963,9.963line from 4.787,9.963 to 4.725,9.963line from 0.900,10.262 to 0.963,10.262line from 4.787,10.262 to 4.725,10.262line from 0.900,7.888 to 0.900,7.950line from 0.900,10.262 to 0.900,10.200line from 1.613,7.888 to 1.613,7.950line from 1.613,10.262 to 1.613,10.200line from 2.312,7.888 to 2.312,7.950
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -