Zsh Mailing List Archive Messages sorted by: Reverse Date, Date, Thread, Author

Re: find duplicate files

X-seq: zsh-users 23919
From: Charles Blake <charlechaud@xxxxxxxxx>
To: Zsh Users <zsh-users@xxxxxxx>
Subject: Re: find duplicate files
Date: Tue, 9 Apr 2019 05:28:33 -0400
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=tmUWVVtclb4Zh0sX0LWlpwCOBuHbpmc30cyr91qwMws=; b=q/op3waRQKfGl9fq55ZnNePslVC+BoWB+Ea1UHisOgbeIx17EWYXqCd0anS5e6DpWY tqHSnwPVWjmLdXcVwbkgLhUjQtSw3nsXkmEQK32kkVEE6rx5z4oPmePt1Gvmp0ttMVgq LeeBnBBQFl4r2/ATDvdC3cjY1k2UsrJLRNTUTWWFrotASKC2VGfCD5C1It5jOm9Y05AY bIdGGXWto55eR70bJSQmjXxU2NaoFO+ki2Yw6MRS5w6pU+w5gMN66Bd12UPuqjdLg+h6 5+enpsdHlJKdVI/d6ZPBSIq9K54Avy3YmaqtP/VBTrqpa72h1pwJlkIVJOcDaGOH8GxC 8zcw==
In-reply-to: <ufamul0c7f0.fsf@epithumia.math.uh.edu>
List-help: <mailto:zsh-users-help@zsh.org>
List-id: Zsh Users List <zsh-users.zsh.org>
List-post: <mailto:zsh-users@zsh.org>
List-unsubscribe: <mailto:zsh-users-unsubscribe@zsh.org>
Mailing-list: contact zsh-users-help@xxxxxxx; run by ezmlm
References: <86v9zrbsic.fsf@zoho.eu> <20190406130242.GA29292@trot> <86tvfb9ore.fsf@zoho.eu> <caaf6f13-2b09-4a91-6c24-491e8954a30a@gmail.com> <86mul2apj8.fsf@zoho.eu> <20190408143748.GA21630@trot> <391277a7-d604-4c20-a666-a2886b1d2939@eastlink.ca> <ufamul0c7f0.fsf@epithumia.math.uh.edu>

Apologies if my 70 line Python script post added to that confusion.
I thought it better to include working code rather than merely a list
of theoretical optimizations to consider in a pure Zsh implementation.
In my experience lately, availability of Python is greater than that
of either Zsh or a C compiler.  So, there is also that, I suppose.

In timings I just did, my Python was only 1.25x slower than 'duff' (in
MAX=-1/hash everything mode).  Most of the time is in the crypto hashing
which the Py just calls fast impls for.  So, that's not so surprising.
Volodymyr's pipeline took about 20X longer and Bart's pure Zsh took 10X
longer.  jdupes was 20X faster than the Python but missed 87% of my dups.
[ I have not tried to track down why..conceivably wonky pathnames making
it abort early.  That seems the most likely culprit. ] The unix duff
seems better designed with a -0 option, FWIW.

One other optimization (not easily exhibited without a low level lang) is
to not use a cryptographic hash at all, but to use a super fast one and
always do the equivalent of a "cmp" only on matching hashes (or maybe a
slow hash only after a fast hash).  That is the jdupes approach.  I think
going parallel adds more value than that, though.

At least in server/laptop/desktop CPUs from around Intel's Nehalem onward,
just one core has been unable to saturate DIMM BW.  I usually get 2X-8X
more DIMM BW going multi-core.  So, even infinitely fast hashes doesn't
mean parallelism could not speed things by a big factor for RAM resident
or superfast IO backed sets of (equally sized) files.  At those speeds,
Python's cPickle BW would likely suck enough compared to hash/cmp-ing
that you'd need a C-like impl to max out your performance.

This may *seem* to be drifting more off topic, but actually does speak to
Ray's original question.  Hardware has evolved enough enough over the
last 50 years that a tuned implementations from yester-decade had concerns
different from a tuned implementation today.  RAM is giant compared to
(many) file sets, IO fast compared to RAM, CPU cores abundant.  Of the
5 solutions discussed (pure Zsh, Volodymyr's, Py, duff, jdupes), only my
Python one used parallelism to any good effect.  Then there is variation
in what optimization applies to what file sets..so, a good tool would
probably provide all of them.

So, even though there may be dozens to hundreds of such tools, there may
well *still* be room for some new tuned impl in a fast language taking all
the optimizations I've mentioned into account instead of just a subset,
where any single optimization might, depending upon deployment context,
make the interesting procedure "several to many" times faster. [ Maybe
some Rust person already did it.  They seem to have a lot of energy. ;-) ]
I agree the original question from Emanuel Berg was probably just unaware
of any tool, though, or asked out of curiosity.  Anyway, enough about the
most optimal approaches.  It was probably always too large a topic.

Cheers

References:
- find duplicate files
  - From: Emanuel Berg
- Re: find duplicate files
  - From: Paul Hoffman
- Re: find duplicate files
  - From: Emanuel Berg
- Re: find duplicate files
  - From: zv
- Re: find duplicate files
  - From: Emanuel Berg
- Re: find duplicate files
  - From: Paul Hoffman
- Re: find duplicate files
  - From: Ray Andrews
- Re: find duplicate files
  - From: Jason L Tibbitts III

Messages sorted by: Reverse Date, Date, Thread, Author