Network Storage

Kyle Creyts kyle.creyts at gmail.com
Sun Apr 15 06:43:43 UTC 2012


Storage capable of keeping up with 10G/20G packet capture doesn't have to
be extremely expensive...

We build this with a commodity host, multiple 10G, multiple SAS HBAs each
attached to a JBOD enclosure of at least 36 4TB 7.2k commodity sata3 disks.
In our configuration, this delivers 58 TB per JBOD enclosure. Properly
tuned, and with a little commodity SSD cache, it delivers synchronous
sequential reads and writes over 2.5GB/sec, (and incredible random speeds
which I can't recall off the top of my head) and all for under $25k.

It could yield less or much more, depending on your redundancy/striping
choices. Run out of room? Fill another JBOD shelf for ~18k.

You could opt for lower parity than we did, or fewer stripes. Either one
would stretch the space out by quite a bit. (At least 20 TB.) I didn't want
to be constantly changing drives out, however.
On Apr 13, 2012 1:46 AM, "Jimmy Hess" <mysidia at gmail.com> wrote:

> On Thu, Apr 12, 2012 at 4:18 PM, Ian McDonald <iam at st-andrews.ac.uk>
> wrote:
> > You'll need to build an array that'll random read/write upwards of
> 200MB/s if you
> > want to get a semi-reliable capture to disk. That means SSD if you're
> very rich, or many spindles
>
> Hey,  Saving packet captures to file is a ~98% asynchronous write,  2%
> read;   ~95% sequential activity. And maybe you think about applying
> some variant of header compression to the packets during capture,  to
> trade a little CPU and increased RAM requirements for storage
> efficiency.
>
> The format used by PCAP and saving raw packet header bits directly to
> disk is not necessarily among the most I/O or space efficient on disk
> storage formats to pick.
>
>
> Random writes should only occur if you are saving your captures to a
> fragmented file system,  which is not recommended;  avoiding
> fragmentation is important.    Random reads aren't involved for
> archiving data, only for analyzing it.
>
> Do you make random reads into your saved capture files?    Possibly
> you're more likely to be doing a sequential scan,  even  during
> analysis;   random reads  imply you have already indexed a dataset and
> you are seeking  a smaller number of specific records,  to collect
> information about them.
>
> Read requirements are totally dependent on your analysis workload,
> e.g. Table scan vs Index search.   Depending on what the analysis is,
> it may make sense to even make extra filtered copies of the data,
> using more disk space, in order to avoid a random access pattern.
>
> If you are building a database of analysis results from raw data,  you
> can and use a separate random IO optimized disk subsystem for  the
> stats  database.
>
>
> If  you really need approximately 200 MB/s with some random read
> performance for analysis,  you should probably be looking at  building
> a RAID50  with several 4-drive sets   and  1gb+ of writeback cache.
>
> RAID10 makes more sense in situations where write requirements are not
> sequential, when external storage is actually shared with multiple
> applications,  or when there is a requirement for a disk drive failure
> to be truly transparent,  but there is a huge capacity sacrifice   in
> choosing mirroring over parity.
>
>
> There is a  Time vs Cost tradeoff with regards to the analysis of the data.
>
> When your 'analysis tools'  start reading data,  the reads increase
> the disk access time,  and therefore reduce write performance;
> therefore the reads should be throttled,  the higher the capacity the
> disk subsystem,  the higher the cost.
>
>
> Performing your analysis ahead of time via pre-caching,  or at least
> indexing newly captured data in small chunks on a continuous basis may
> be useful,  to minimize the amount of searching of the raw dataset
> later.    A small SSD or separate mirrored drive pair for that
> function,   would avoid adding load to  the  "raw capture storage"
> disk system,  if your analysis requirements are amenable to  that
> pattern.
>
> Modern OSes cache some recent filesystem data in RAM.     So if the
> server capturing data has sufficient SDRAM,   analyzing data while
> it's still hot in the page cache,  and  saving that analysis in an
> efficient index for later use,  can be useful.
>
> >(preferably 15k's) in a stripe/ raid10 if you're building from your scrap
> pile. Bear in mind that write >cache won't help you, as the io isn't going
> to be bursty, rather a continuous stream.
>
> Not really...   A good read cache is more important for the analysis,
> but  Efficient write cache on your array and OS page cache is still
> highly beneficial,   especially  because it can ensure that your RAID
> subsystem is performing full stripe writes, for maximal efficiency of
> sequential write activity,  and it can delay the media write until the
> optimal moment based on platter position,  and sequence the read/write
> requests;
>
> as long as the performance of the storage system behind the cache  is
> such that the storage system can on average successfully drain the
> cache at a faster rate than you can fill it with data a sufficient
> amount of the time,  the write cache serves an important function.
>
>
>  Your I/O may be a continuous stream,  but there are most certainly
> variations and spikes in the rate of packets and the performance of
> mechanical disk drives.
>
>
> > Aligning your partitions with the physical disk geometry can produce
> surprising speedups, as can >stripe block size changes, but that's
> generally empirical, and depends on your workload.
>
>
> For RAID systems partitions should absolutely be aligned if the OS
> install defaults don't align them correctly;  on a modern OS, the
> defaults are normally OK. Having an unaligned or improperly aligned
> partition is just a misconfiguration;  A  track crossing for every
> other sector read is an easy way of doubling the size of  small I/Os.
>
> You won't notice with this particular use case when you are writing
> large blocks, you're writing a 100Mb  chunk, asynchronously,  you
> won't notice a 63kB  difference, it's less than   .0001%  of your
> transfer size;    this is primarily a concern during analysis  or
> database searching which may involve small random reads and small
> synchronous random writes.
>
> In other words, you will probably get away just ignoring partition
> alignment and filesystem block size,
> so there are other aspects of the configuration to be more concerned about
> (YMMV).
>
> --
> -JH
>
>



More information about the NANOG mailing list