[livesupport-dev] Using hashes of files for keys - Airtime Development Discussions on Sourcefabric Forum

import

As I mentioned during Summercamp, you can use hashes as keys for sound
files and use these to determine if you have the same file on different
servers. Someone then mentioned that you can have collisions - that is,
two files with the same hash. Yes, this is theoretically possible, but
will it happen within our lifetime? Here are some statistics. You can
choose how statistically improbable you want to make a collision by
choosing your hash function. Note that the number of songs in existence
is much lower than these numbers. Smile

Chance of Collision (2 messages with the same hash):

MD5: 1 in 18,446,744,073,709,551,616 chance of collision

SHA-1: 1 in 1,208,925,819,614,629,174,706,176 chance of collision

SHA-256: 1 in 340,282,366,920,938,463,463,374,607,431,768,211,456 chance
of collision

SHA-384: 1 in
6,277,101,735,386,680,763,835,789,423,207,666,416,102,355,444,464,034,512,896
chance of collision

SHA-512: 1 in
115,792,089,237,316,195,423,570,985,008,687,907,853,269,984,665,640,564,
039,457,584,007,913,129,639,936 chance of collision

Note that a lot of file-sharing systems use this method. There are also
more advanced hash systems built specifically for music, which can
identify a song regardless of its encoding. In other words, if you had
the same song as an MP3 and as an OGG (and at different bit rates), they
would generate the same hash. I am not as familiar with these though.

What do you think?

- Paul

------------------------------------------
Posted to Phorum via PhorumMail

import

Paul,

> As I mentioned during Summercamp, you can use hashes as keys for sound
> files and use these to determine if you have the same file on different
> servers. Someone then mentioned that you can have collisions - that is,
> two files with the same hash. Yes, this is theoretically possible, but

you're right on the issue of not having collisions with hashes. but,
keep in mind that we have other requirmenets for the IDs, for example,
they should be numerically representable in the database. this will
allow fast search on and access to them. none of the hashes are small
enough numbers (up to 64 bits) that would fit into a numeric column in
the database.

right now we're using a globally unique ID pattern, according to the
Leach UUID/GUID draft:
http://www.opengroup.org/dce/info/draft-leach-uuids-guids-01.txt . But
as this method generates a 128 bit ID, we're using some MD5-related
magic to cut it to 63bits.

Akos

------------------------------------------
Posted to Phorum via PhorumMail

Howdy, Stranger!

Categories

Poll

Top Posters