GURU SESSIONS
NFS Deployment for High Performance
Tom Talpey, Network Appliance, Inc.
transition from dafs to NFS/RDMA focus
performance improvements for NFS over ethernet
WAFL - internal log based system, raid back end
multi-protocol - dafs. http, ftp, ssh, NFS, cifs
#1 nas storage vendor
claim: NFS works for mission critical db deployments
claim: if done right, NFS delivers performance of
local file system
tradition: NFS slow due to host cpu (most hosts have cycles to spare,
actually)
ethernet slow compared to sans
(ethernet actually catching up)
NFS speed -
file caching behavior
wire effeciency - wire I/O
single point mount parallelism - limited capacity,
lock contention
multi-NIC scalability - no built in trunking
throughput iops and mb/s
latency (response time)
per io cpu cost in relation to local fs cost -
transferring from
network - cache - cache - user
wire speed and network performance
tunings
network
fastest wire
quality NIC - hw checksumming, LSO
1 GbE latency lower, throughput
higher
tune routing paths - reduce
queueing and hops
tcp NFS outperforms udp by 10% -
buffering, flo control, reliability
enable ethernet jumbo frames
reduces read/write packet counts
requires support at both ends
requires support in switches
client - most wins, NFS overhead client limited
check mount options
rsize/wsize - use 32 kb, 4kb min
attribute caching
timeouts,
noac, nocto (adds throughput to NFS mountpoint used when created files
no shared)
forcedirectio
for databases, etc (solaris option)
actimeo=0
!=noac disables write caching (prefer actimeo so client will cache
writes)
llock can
greatly improve non-shared environments, with care
NFS readahead - server and client both tunable -
reduces latency,
amount configurable (sun and
netapp coauthored paper about this - how to turn it off on
solaris)
number of client "biods" -
linux:rpc slot table
number of readaheads
check socket options
system default socket buffers -
at least 64k, 256k not too much
NFS specific socket buffers
(tunable)
send/receive highwaters (1/4 size
of socket buffer)
send/receive buffer sizes
TCP Large Windows (LW)
check driver specific tunings
optimize for low latency
jumbo frames
server
use an appliance
use vendor's support
volume/spindle tuning - density of data and capacity
does matter,
number of spindles(drives) is very important
optimize for throughput
file and volume placement,
distribution - split it up, not one great
volume
server-specific options
"no access time" updates -
writing to the disk everytime file is read
to update metadata, posix violation
snapshots, backups, etc -
substantial traffic - can be tuned -
appropriate files at the appropriate times
war stories - caching - weak cache consistency
symptom - application runs 50x slower on NFS vs Local
problem
thread processing write completions
sometimes completed writes out-of-order
NFS client spoofed by unexpected mtime in post-op
attributes
NFS client cache invalidated because wcc processing
believed another
client had written the file (solaris 9 removed
wcc)
resolution
revert to v2 caching semantics
user - faster performance
file locks
commercial applications use different locking
techniques - no locking,
small internal byte range locking, lock 0 to
end of file, lock 0 to
infinity
cache control features
default NFS behavior usually wrong for databases
most NFS clients have no "control"
overzealous prefetch
db on cheesy local disk
NFS needed even if performance ok - backup, access
some NFS clients artificially limit operation size
limit of 8kb per write on some mount options
linux breaks all I/O into page size chunks
if page size < rsize/wsize,
I/O requests may be split on the wire
if page size > rsize/wsize,
operations will be split and serialized
user view - no idea about wire level transfers, only
sees NFS as slow
compared to local
ethereal can spot this
Little's Law
throughput is proportional to latency and concurrency
to increase throughput, increase concurrency
writers block readers
symptom
thoughput on single mount point
is poor
no identifiable resource
bottleneck
user workload extremely slow
compared to local fs
debug
emulate user workload, study
results
throughput with only reads is
very high
adding a single writer kills
throughput
discover writers block reader
needlessly
fix - vendors simply removed R/W lock when
performing direct I/O
applications have issues too (like databases)
some commercial apps are "two brained"
use "raw" interface for local
storage
use filesystem interface for NFS
storage
different code paths have major
differences
async I/O
concurrency
settings - how much a I/O? how many threads?
expensive? check settings
level of code
optimization
not an nfs problem, but solution
inhibitor
SUN and NETAPP coauthored paper about ORACLE Database Performance with
NAS
http://www.netapp.com/tech_library/ftp/3322.pdf
NFS Performance Considerations
NFS Implementation
up to date patch levels
nfs clients - not all equal
nfs servers
Network Configuration
topology - vlan, gigabit
protocol config
NFS Configuration
SIO - command line utility (similar to dt or iozone)
NFS Futures and RDMA
recent project - binding of NFS of v2, v3, v4 atop
of RDMA transport
(rpc layer implementation) such as Infinibrand
or iWARP (going
through standardization) -- behaved like local FS except for metadata
traffic
significant performance
optimization
an enabler for NAS in the high end
databases, cluster computing, etc
scalable cluster/distributed
filesystem
benefits - reduced client overhead, data copy
avoidance, user space
i/o (os bypass) , reduced latency