Had an idea - looking for a math buff to tell me if it's possible with today's technology.

Heath Jones hj1980 at gmail.com
Thu May 19 00:03:46 UTC 2011


My point here is it IS possible to transfer just a hash and counter value
and effectively generate identical data at the remote end.
The limit that will be hit is the difficulty of generating and comparing
hash values with current processing power.

I'm proposing iterating through generated data up until the actual data.
It's not even a storage issue, as once you have incremented the data you
don't need to store old data or hash values - just the counter. No massive
hash tables.

It's a CPU issue.

On 19 May 2011 00:42, <Valdis.Kletnieks at vt.edu> wrote:

> On Thu, 19 May 2011 00:26:26 BST, Heath Jones said:
> > I wonder if this is possible:
> >
> > - Take a hash of the original file. Keep a counter.
> > - Generate data in some sequential method on sender side (for example
> simply
> > starting at 0 and iterating until you generate the same as the original
> > data)
> > - Each time you iterate, take the hash of the generated data. If it
> matches
> > the hash of the original file, increment counter.
> > - Send the hash and the counter value to recipient.
> > - Recipient performs same sequential generation method, stopping when
> > counter reached.
>
> MD5 is a 128 bit hash.
>
> 2^128 is 340,282,366,920,938,463,463,374,607,431,768,211,456 - you're
> welcome
> to iterate that many times to find a duplicate. You may get lucky and get a
> hit
> in the first trillion or so attempts - but you may get unlucky and not get
> a
> hit until the *last* few trillion attempts. On average you'll have to
> iterate
> about half that huge number before you get a hit.
>
> And it's lossy - if you hash all the possible 4K blocks with MD5, you'll
> find
> that each of those 2^128 hashes has been hit about 256 times - and no
> indication in the hash of *which* of the 256 colliding 4K blocks you have
> on
> this iteration.  (The only reason that companies can do block-level
> de-duplication by saving a hash as an index to one copy shared by all
> blocks
> with the same hash value is because you have a *very small* fraction of the
> possibilities covered, so if you saved a 4K block of data from somebody's
> system32 folder under a given MD5 hash, it's *far* more likely that another
> block with that same hash is from another copy of another identical
> system32
> folder, than it is an actual accidental collision.)
>
> Protip:  A good hash function is by definition one-way - given the data,
> it's
> easy to generate the hash - but reversing it to find the "pre-image" (the
> data
> that *generated* the hash) is massively difficult.
>
>



More information about the NANOG mailing list