Zsh Mailing List Archive
Messages sorted by:
Reverse Date,
Date,
Thread,
Author
Re: find duplicate files
- X-seq: zsh-users 23914
- From: Bart Schaefer <schaefer@xxxxxxxxxxxxxxxx>
- To: Charles Blake <charlechaud@xxxxxxxxx>
- Subject: Re: find duplicate files
- Date: Mon, 8 Apr 2019 10:14:20 -0700
- Cc: Zsh Users <zsh-users@xxxxxxx>
- Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=brasslantern-com.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=17IBHh97xcuMAFQIlXXc/OJESIkqgxy4YA8dtJeDfC0=; b=xseAiaoVntD7V2wyh8RerWCWeMmE9p1UnWYyw5TM3BdHsjjOmQCT8QZPe0hqt4lnE3 NSQomPJQs859C5t6unfGtcjB0twlS8IhaUGB7TcssK0I06hGJ/eK4jwKy+S5EbPbcttb KJtwIjvtrkKXmFVaj6tVpsQxchVZvCC957g2uZw+5HTXBe6YnmbjNQHZnpT9BzU4mux7 /LCXp/RLC6Io+PXbhsZaRyimCeFQ2Cwow/c0d+lUOoHGTt7rZwvGgdDxrBPtrmhDzfIG GuwGYMywQ67qlBv9ZF6p5QVROpKMKVYl+Rl9C9RM7fecXKzifRjAKhw3uDlQIZi0oCiM kqKQ==
- In-reply-to: <CAKiz1a90DsdjXmrE1wEviN5R9hP=225tVfQPFTJkC73f9pQ74A@mail.gmail.com>
- List-help: <mailto:zsh-users-help@zsh.org>
- List-id: Zsh Users List <zsh-users.zsh.org>
- List-post: <mailto:zsh-users@zsh.org>
- List-unsubscribe: <mailto:zsh-users-unsubscribe@zsh.org>
- Mailing-list: contact zsh-users-help@xxxxxxx; run by ezmlm
- References: <86v9zrbsic.fsf@zoho.eu> <20190406130242.GA29292@trot> <86tvfb9ore.fsf@zoho.eu> <caaf6f13-2b09-4a91-6c24-491e8954a30a@gmail.com> <CAKiz1a_yfNUY87FX15H3=M5P46icqeH7m=1MM7JWazqVqz4bYw@mail.gmail.com> <CAH+w=7YKTPfQdkA3c-FUKbZmJnpZuBzBQdaCpSxoeY=SQaJtMw@mail.gmail.com> <CAKiz1a90DsdjXmrE1wEviN5R9hP=225tVfQPFTJkC73f9pQ74A@mail.gmail.com>
On Mon, Apr 8, 2019 at 4:18 AM Charles Blake <charlechaud@xxxxxxxxx> wrote:
>
> >I find that a LOT more understandable than the python code.
>
> Understandability is, of course, somewhat subjective (e.g. some might say
> every 15th field is unclear relative to a named label)
Yes, lack of multi-dimensional data structures is a limitation on the
shell implementation.
I could have done it this way:
names=( **/*(.l+0) )
zstat -tA stats $names
sizes=( ${(M)stats:#size *} )
I chose the other way so the name and size would be directly connected
in the stats array rather than rely on implicit ordering (to one of
your later points, bad things happen with the above if a file is
removed between generating the list of names and collecting the file
stats).
> >unless you're NOT going to consider linked files as duplicates you
> >might as well just compare sizes. (It would be faster to get inodes
>
> It may have been underappreciated is that handling hard-link identity also
> lets you skip recomputing hashes over a hard link cluster
Yes, this could be used to reduce the number of names passed to
"cksum" or the equivalent.
> Almost everything you say needs a "probably/maybe"
> qualifier. I don't think you disagree. I'm just elaborating a little
> for passers by.
Absolutely. The flip side of this is that shells and utilities are
generally optimized for the average case, not for the extremes.
Messages sorted by:
Reverse Date,
Date,
Thread,
Author