Network Storage

Jimmy Hess mysidia at gmail.com
Fri Apr 13 05:45:57 UTC 2012


On Thu, Apr 12, 2012 at 4:18 PM, Ian McDonald <iam at st-andrews.ac.uk> wrote:
> You'll need to build an array that'll random read/write upwards of 200MB/s if you
> want to get a semi-reliable capture to disk. That means SSD if you're very rich, or many spindles

Hey,  Saving packet captures to file is a ~98% asynchronous write,  2%
read;   ~95% sequential activity. And maybe you think about applying
some variant of header compression to the packets during capture,  to
trade a little CPU and increased RAM requirements for storage
efficiency.

The format used by PCAP and saving raw packet header bits directly to
disk is not necessarily among the most I/O or space efficient on disk
storage formats to pick.


Random writes should only occur if you are saving your captures to a
fragmented file system,  which is not recommended;  avoiding
fragmentation is important.    Random reads aren't involved for
archiving data, only for analyzing it.

Do you make random reads into your saved capture files?    Possibly
you're more likely to be doing a sequential scan,  even  during
analysis;   random reads  imply you have already indexed a dataset and
you are seeking  a smaller number of specific records,  to collect
information about them.

Read requirements are totally dependent on your analysis workload,
e.g. Table scan vs Index search.   Depending on what the analysis is,
it may make sense to even make extra filtered copies of the data,
using more disk space, in order to avoid a random access pattern.

If you are building a database of analysis results from raw data,  you
can and use a separate random IO optimized disk subsystem for  the
stats  database.


If  you really need approximately 200 MB/s with some random read
performance for analysis,  you should probably be looking at  building
a RAID50  with several 4-drive sets   and  1gb+ of writeback cache.

RAID10 makes more sense in situations where write requirements are not
sequential, when external storage is actually shared with multiple
applications,  or when there is a requirement for a disk drive failure
to be truly transparent,  but there is a huge capacity sacrifice   in
choosing mirroring over parity.


There is a  Time vs Cost tradeoff with regards to the analysis of the data.

When your 'analysis tools'  start reading data,  the reads increase
the disk access time,  and therefore reduce write performance;
therefore the reads should be throttled,  the higher the capacity the
disk subsystem,  the higher the cost.


Performing your analysis ahead of time via pre-caching,  or at least
indexing newly captured data in small chunks on a continuous basis may
be useful,  to minimize the amount of searching of the raw dataset
later.    A small SSD or separate mirrored drive pair for that
function,   would avoid adding load to  the  "raw capture storage"
disk system,  if your analysis requirements are amenable to  that
pattern.

Modern OSes cache some recent filesystem data in RAM.     So if the
server capturing data has sufficient SDRAM,   analyzing data while
it's still hot in the page cache,  and  saving that analysis in an
efficient index for later use,  can be useful.

>(preferably 15k's) in a stripe/ raid10 if you're building from your scrap pile. Bear in mind that write >cache won't help you, as the io isn't going to be bursty, rather a continuous stream.

Not really...   A good read cache is more important for the analysis,
but  Efficient write cache on your array and OS page cache is still
highly beneficial,   especially  because it can ensure that your RAID
subsystem is performing full stripe writes, for maximal efficiency of
sequential write activity,  and it can delay the media write until the
optimal moment based on platter position,  and sequence the read/write
requests;

as long as the performance of the storage system behind the cache  is
such that the storage system can on average successfully drain the
cache at a faster rate than you can fill it with data a sufficient
amount of the time,  the write cache serves an important function.


  Your I/O may be a continuous stream,  but there are most certainly
variations and spikes in the rate of packets and the performance of
mechanical disk drives.


> Aligning your partitions with the physical disk geometry can produce surprising speedups, as can >stripe block size changes, but that's generally empirical, and depends on your workload.


For RAID systems partitions should absolutely be aligned if the OS
install defaults don't align them correctly;  on a modern OS, the
defaults are normally OK. Having an unaligned or improperly aligned
partition is just a misconfiguration;  A  track crossing for every
other sector read is an easy way of doubling the size of  small I/Os.

You won't notice with this particular use case when you are writing
large blocks, you're writing a 100Mb  chunk, asynchronously,  you
won't notice a 63kB  difference, it's less than   .0001%  of your
transfer size;    this is primarily a concern during analysis  or
database searching which may involve small random reads and small
synchronous random writes.

In other words, you will probably get away just ignoring partition
alignment and filesystem block size,
so there are other aspects of the configuration to be more concerned about
(YMMV).

--
-JH




More information about the NANOG mailing list