Guru: NFS Deployment for High Performance

Date: Tuesday, June 29, 2004
Topic: Guru: NFS Deployment for High Performance

GURU SESSIONS

NFS Deployment for High Performance
Tom Talpey, Network Appliance, Inc.

transition from dafs to NFS/RDMA focus
performance improvements for NFS over ethernet

WAFL - internal log based system, raid back end

multi-protocol - dafs. http, ftp, ssh, NFS, cifs
    #1 nas storage vendor
    claim: NFS works for mission critical db deployments
    claim: if done right, NFS delivers performance of local file system

tradition: NFS slow due to host cpu (most hosts have cycles to spare,
actually)
        ethernet slow compared to sans (ethernet actually catching up)
NFS speed -
    file caching behavior
    wire effeciency - wire I/O
    single point mount parallelism - limited capacity, lock contention
    multi-NIC scalability - no built in trunking
    throughput iops and mb/s
    latency (response time)
    per io cpu cost in relation to local fs cost - transferring from
network - cache - cache - user
    wire speed and network performance

tunings
network
    fastest wire
        quality NIC - hw checksumming, LSO
        1 GbE latency lower, throughput higher
        tune routing paths - reduce queueing and hops
        tcp NFS outperforms udp by 10% - buffering, flo control, reliability
    enable ethernet jumbo frames
        reduces read/write packet counts
        requires support at both ends
        requires support in switches
client - most wins, NFS overhead client limited
    check mount options
        rsize/wsize - use 32 kb, 4kb min
        attribute caching
            timeouts, noac, nocto (adds throughput to NFS mountpoint used when created files no shared)
            forcedirectio for databases, etc (solaris option)
            actimeo=0 !=noac disables write caching (prefer actimeo so client will cache writes)
            llock can greatly improve non-shared environments, with care
    NFS readahead - server and client both tunable - reduces latency,
        amount configurable (sun and netapp coauthored paper about this - how to turn it off on solaris)   
    number of client "biods"     - linux:rpc slot table
        number of readaheads
    check socket options
        system default socket buffers - at least 64k, 256k not too much
        NFS specific socket buffers (tunable)
        send/receive highwaters (1/4 size of socket buffer)
        send/receive buffer sizes
        TCP Large Windows (LW)
    check driver specific tunings
        optimize for low latency
        jumbo frames
server
    use an appliance
    use vendor's support
    volume/spindle tuning - density of data and capacity does matter,
number of spindles(drives) is very important
        optimize for throughput
        file and volume placement, distribution - split it up, not one great
volume
    server-specific options
        "no access time" updates - writing to the disk everytime file is read
to update metadata, posix violation
        snapshots, backups, etc - substantial traffic - can be tuned -
appropriate files at the appropriate times

war stories - caching - weak cache consistency
symptom - application runs 50x slower on NFS vs Local
problem
    thread processing write completions
    sometimes completed writes out-of-order
    NFS client spoofed by unexpected mtime in post-op attributes
    NFS client cache invalidated because wcc processing believed another
client had written the file (solaris 9 removed         wcc)
resolution
    revert to v2 caching semantics
user - faster performance

file locks
    commercial applications use different locking techniques - no locking,
small internal byte range locking, lock 0 to         end of file, lock 0 to
infinity
cache control features
    default NFS behavior usually wrong for databases
    most NFS clients have no "control"
overzealous prefetch
    db on cheesy local disk
    NFS needed even if performance ok - backup, access
some NFS clients artificially limit operation size
    limit of 8kb per write on some mount options
    linux breaks all I/O into page size chunks
        if page size < rsize/wsize, I/O requests may be split on the wire
        if page size > rsize/wsize, operations will be split and serialized
    user view - no idea about wire level transfers, only sees NFS as slow
compared to local
        ethereal can spot this
Little's Law
    throughput is proportional to latency and concurrency
    to increase throughput, increase concurrency
writers block readers
    symptom
        thoughput on single mount point is poor
        no identifiable resource bottleneck
        user workload extremely slow compared to local fs
    debug
        emulate user workload, study results
        throughput with only reads is very high
        adding a single writer kills throughput
        discover writers block reader needlessly
    fix - vendors simply removed R/W lock when performing direct I/O
applications have issues too (like databases)
    some commercial apps are "two brained"
        use "raw" interface for local storage
        use filesystem interface for NFS storage
        different code paths have major differences
            async I/O
            concurrency settings - how much a I/O?  how many threads? 
expensive?  check settings
            level of code optimization
        not an nfs problem, but solution inhibitor

SUN and NETAPP coauthored paper about ORACLE Database Performance with
NAS
    http://www.netapp.com/tech_library/ftp/3322.pdf

NFS Performance Considerations
NFS Implementation
    up to date patch levels
    nfs clients - not all equal
    nfs servers
Network Configuration
    topology - vlan, gigabit
    protocol config
NFS Configuration

SIO - command line utility (similar to dt or iozone)

NFS Futures and RDMA
    recent project - binding of NFS of v2, v3, v4 atop of RDMA transport
(rpc layer implementation) such as Infinibrand         or iWARP (going
through standardization) -- behaved like local FS except for metadata
traffic
    significant performance optimization   
    an enabler for NAS in the high end
        databases, cluster computing, etc
        scalable cluster/distributed filesystem
    benefits - reduced client overhead, data copy avoidance, user space
i/o (os bypass) , reduced latency