Zsh Mailing List Archive Messages sorted by: Reverse Date, Date, Thread, Author

Re: find duplicate files

X-seq: zsh-users 23914
From: Bart Schaefer <schaefer@xxxxxxxxxxxxxxxx>
To: Charles Blake <charlechaud@xxxxxxxxx>
Subject: Re: find duplicate files
Date: Mon, 8 Apr 2019 10:14:20 -0700
Cc: Zsh Users <zsh-users@xxxxxxx>
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=brasslantern-com.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=17IBHh97xcuMAFQIlXXc/OJESIkqgxy4YA8dtJeDfC0=; b=xseAiaoVntD7V2wyh8RerWCWeMmE9p1UnWYyw5TM3BdHsjjOmQCT8QZPe0hqt4lnE3 NSQomPJQs859C5t6unfGtcjB0twlS8IhaUGB7TcssK0I06hGJ/eK4jwKy+S5EbPbcttb KJtwIjvtrkKXmFVaj6tVpsQxchVZvCC957g2uZw+5HTXBe6YnmbjNQHZnpT9BzU4mux7 /LCXp/RLC6Io+PXbhsZaRyimCeFQ2Cwow/c0d+lUOoHGTt7rZwvGgdDxrBPtrmhDzfIG GuwGYMywQ67qlBv9ZF6p5QVROpKMKVYl+Rl9C9RM7fecXKzifRjAKhw3uDlQIZi0oCiM kqKQ==
In-reply-to: <CAKiz1a90DsdjXmrE1wEviN5R9hP=225tVfQPFTJkC73f9pQ74A@mail.gmail.com>
List-help: <mailto:zsh-users-help@zsh.org>
List-id: Zsh Users List <zsh-users.zsh.org>
List-post: <mailto:zsh-users@zsh.org>
List-unsubscribe: <mailto:zsh-users-unsubscribe@zsh.org>
Mailing-list: contact zsh-users-help@xxxxxxx; run by ezmlm
References: <86v9zrbsic.fsf@zoho.eu> <20190406130242.GA29292@trot> <86tvfb9ore.fsf@zoho.eu> <caaf6f13-2b09-4a91-6c24-491e8954a30a@gmail.com> <CAKiz1a_yfNUY87FX15H3=M5P46icqeH7m=1MM7JWazqVqz4bYw@mail.gmail.com> <CAH+w=7YKTPfQdkA3c-FUKbZmJnpZuBzBQdaCpSxoeY=SQaJtMw@mail.gmail.com> <CAKiz1a90DsdjXmrE1wEviN5R9hP=225tVfQPFTJkC73f9pQ74A@mail.gmail.com>

On Mon, Apr 8, 2019 at 4:18 AM Charles Blake <charlechaud@xxxxxxxxx> wrote:
>
> >I find that a LOT more understandable than the python code.
>
> Understandability is, of course, somewhat subjective (e.g. some might say
> every 15th field is unclear relative to a named label)

Yes, lack of multi-dimensional data structures is a limitation on the
shell implementation.

I could have done it this way:

names=( **/*(.l+0) )
zstat -tA stats $names
sizes=( ${(M)stats:#size *} )

I chose the other way so the name and size would be directly connected
in the stats array rather than rely on implicit ordering (to one of
your later points, bad things happen with the above if a file is
removed between generating the list of names and collecting the file
stats).

> >unless you're NOT going to consider linked files as duplicates you
> >might as well just compare sizes.  (It would be faster to get inodes
>
> It may have been underappreciated is that handling hard-link identity also
> lets you skip recomputing hashes over a hard link cluster

Yes, this could be used to reduce the number of names passed to
"cksum" or the equivalent.

> Almost everything you say needs a "probably/maybe"
> qualifier.  I don't think you disagree.  I'm just elaborating a little
> for passers by.

Absolutely.  The flip side of this is that shells and utilities are
generally optimized for the average case, not for the extremes.

References:
- find duplicate files
  - From: Emanuel Berg
- Re: find duplicate files
  - From: Paul Hoffman
- Re: find duplicate files
  - From: Emanuel Berg
- Re: find duplicate files
  - From: zv
- Re: find duplicate files
  - From: Charles Blake
- Re: find duplicate files
  - From: Bart Schaefer
- Re: find duplicate files
  - From: Charles Blake

Messages sorted by: Reverse Date, Date, Thread, Author