[livesupport-dev] Using hashes of files for keys
  • As I mentioned during Summercamp, you can use hashes as keys for sound
    files and use these to determine if you have the same file on different
    servers. Someone then mentioned that you can have collisions - that is,
    two files with the same hash. Yes, this is theoretically possible, but
    will it happen within our lifetime? Here are some statistics. You can
    choose how statistically improbable you want to make a collision by
    choosing your hash function. Note that the number of songs in existence
    is much lower than these numbers. Smile

    Chance of Collision (2 messages with the same hash):

    MD5: 1 in 18,446,744,073,709,551,616 chance of collision

    SHA-1: 1 in 1,208,925,819,614,629,174,706,176 chance of collision

    SHA-256: 1 in 340,282,366,920,938,463,463,374,607,431,768,211,456 chance
    of collision

    SHA-384: 1 in
    6,277,101,735,386,680,763,835,789,423,207,666,416,102,355,444,464,034,512,896
    chance of collision

    SHA-512: 1 in
    115,792,089,237,316,195,423,570,985,008,687,907,853,269,984,665,640,564,
    039,457,584,007,913,129,639,936 chance of collision

    Note that a lot of file-sharing systems use this method. There are also
    more advanced hash systems built specifically for music, which can
    identify a song regardless of its encoding. In other words, if you had
    the same song as an MP3 and as an OGG (and at different bit rates), they
    would generate the same hash. I am not as familiar with these though.

    What do you think?

    - Paul

    ------------------------------------------
    Posted to Phorum via PhorumMail
  • 1 Comment sorted by
  • Paul,

    > As I mentioned during Summercamp, you can use hashes as keys for sound
    > files and use these to determine if you have the same file on different
    > servers. Someone then mentioned that you can have collisions - that is,
    > two files with the same hash. Yes, this is theoretically possible, but

    you're right on the issue of not having collisions with hashes. but,
    keep in mind that we have other requirmenets for the IDs, for example,
    they should be numerically representable in the database. this will
    allow fast search on and access to them. none of the hashes are small
    enough numbers (up to 64 bits) that would fit into a numeric column in
    the database.

    right now we're using a globally unique ID pattern, according to the
    Leach UUID/GUID draft:
    http://www.opengroup.org/dce/info/draft-leach-uuids-guids-01.txt . But
    as this method generates a 128 bit ID, we're using some MD5-related
    magic to cut it to 63bits.


    Akos

    ------------------------------------------
    Posted to Phorum via PhorumMail