Network Storage
Kyle Creyts
kyle.creyts at gmail.com
Sun Apr 15 06:43:43 UTC 2012
Storage capable of keeping up with 10G/20G packet capture doesn't have to
be extremely expensive...
We build this with a commodity host, multiple 10G, multiple SAS HBAs each
attached to a JBOD enclosure of at least 36 4TB 7.2k commodity sata3 disks.
In our configuration, this delivers 58 TB per JBOD enclosure. Properly
tuned, and with a little commodity SSD cache, it delivers synchronous
sequential reads and writes over 2.5GB/sec, (and incredible random speeds
which I can't recall off the top of my head) and all for under $25k.
It could yield less or much more, depending on your redundancy/striping
choices. Run out of room? Fill another JBOD shelf for ~18k.
You could opt for lower parity than we did, or fewer stripes. Either one
would stretch the space out by quite a bit. (At least 20 TB.) I didn't want
to be constantly changing drives out, however.
On Apr 13, 2012 1:46 AM, "Jimmy Hess" <mysidia at gmail.com> wrote:
> On Thu, Apr 12, 2012 at 4:18 PM, Ian McDonald <iam at st-andrews.ac.uk>
> wrote:
> > You'll need to build an array that'll random read/write upwards of
> 200MB/s if you
> > want to get a semi-reliable capture to disk. That means SSD if you're
> very rich, or many spindles
>
> Hey, Saving packet captures to file is a ~98% asynchronous write, 2%
> read; ~95% sequential activity. And maybe you think about applying
> some variant of header compression to the packets during capture, to
> trade a little CPU and increased RAM requirements for storage
> efficiency.
>
> The format used by PCAP and saving raw packet header bits directly to
> disk is not necessarily among the most I/O or space efficient on disk
> storage formats to pick.
>
>
> Random writes should only occur if you are saving your captures to a
> fragmented file system, which is not recommended; avoiding
> fragmentation is important. Random reads aren't involved for
> archiving data, only for analyzing it.
>
> Do you make random reads into your saved capture files? Possibly
> you're more likely to be doing a sequential scan, even during
> analysis; random reads imply you have already indexed a dataset and
> you are seeking a smaller number of specific records, to collect
> information about them.
>
> Read requirements are totally dependent on your analysis workload,
> e.g. Table scan vs Index search. Depending on what the analysis is,
> it may make sense to even make extra filtered copies of the data,
> using more disk space, in order to avoid a random access pattern.
>
> If you are building a database of analysis results from raw data, you
> can and use a separate random IO optimized disk subsystem for the
> stats database.
>
>
> If you really need approximately 200 MB/s with some random read
> performance for analysis, you should probably be looking at building
> a RAID50 with several 4-drive sets and 1gb+ of writeback cache.
>
> RAID10 makes more sense in situations where write requirements are not
> sequential, when external storage is actually shared with multiple
> applications, or when there is a requirement for a disk drive failure
> to be truly transparent, but there is a huge capacity sacrifice in
> choosing mirroring over parity.
>
>
> There is a Time vs Cost tradeoff with regards to the analysis of the data.
>
> When your 'analysis tools' start reading data, the reads increase
> the disk access time, and therefore reduce write performance;
> therefore the reads should be throttled, the higher the capacity the
> disk subsystem, the higher the cost.
>
>
> Performing your analysis ahead of time via pre-caching, or at least
> indexing newly captured data in small chunks on a continuous basis may
> be useful, to minimize the amount of searching of the raw dataset
> later. A small SSD or separate mirrored drive pair for that
> function, would avoid adding load to the "raw capture storage"
> disk system, if your analysis requirements are amenable to that
> pattern.
>
> Modern OSes cache some recent filesystem data in RAM. So if the
> server capturing data has sufficient SDRAM, analyzing data while
> it's still hot in the page cache, and saving that analysis in an
> efficient index for later use, can be useful.
>
> >(preferably 15k's) in a stripe/ raid10 if you're building from your scrap
> pile. Bear in mind that write >cache won't help you, as the io isn't going
> to be bursty, rather a continuous stream.
>
> Not really... A good read cache is more important for the analysis,
> but Efficient write cache on your array and OS page cache is still
> highly beneficial, especially because it can ensure that your RAID
> subsystem is performing full stripe writes, for maximal efficiency of
> sequential write activity, and it can delay the media write until the
> optimal moment based on platter position, and sequence the read/write
> requests;
>
> as long as the performance of the storage system behind the cache is
> such that the storage system can on average successfully drain the
> cache at a faster rate than you can fill it with data a sufficient
> amount of the time, the write cache serves an important function.
>
>
> Your I/O may be a continuous stream, but there are most certainly
> variations and spikes in the rate of packets and the performance of
> mechanical disk drives.
>
>
> > Aligning your partitions with the physical disk geometry can produce
> surprising speedups, as can >stripe block size changes, but that's
> generally empirical, and depends on your workload.
>
>
> For RAID systems partitions should absolutely be aligned if the OS
> install defaults don't align them correctly; on a modern OS, the
> defaults are normally OK. Having an unaligned or improperly aligned
> partition is just a misconfiguration; A track crossing for every
> other sector read is an easy way of doubling the size of small I/Os.
>
> You won't notice with this particular use case when you are writing
> large blocks, you're writing a 100Mb chunk, asynchronously, you
> won't notice a 63kB difference, it's less than .0001% of your
> transfer size; this is primarily a concern during analysis or
> database searching which may involve small random reads and small
> synchronous random writes.
>
> In other words, you will probably get away just ignoring partition
> alignment and filesystem block size,
> so there are other aspects of the configuration to be more concerned about
> (YMMV).
>
> --
> -JH
>
>
More information about the NANOG
mailing list