Finding duplicate FLAC albums in your collection

AVForums

Help Support AVForums:

This site may earn a commission from merchant affiliate links, including eBay, Amazon, and others.

Now Playing

AVForums Super Veteran
Joined
May 25, 2010
Messages
1,208
Reaction score
15
Location
South Africa
For a variety of reasons I've found that over the years as my tagging, filing and backup conventions have changed I've sometimes ended up duplicating albums in my collection, principally because of changes to folder names and locations.  I had a little time to do some deduping and thought I'd share my methods, 1) so it's there for me to reference when I need to, and 2) so others could benefit.

What I'm sharing basically culminates in a table listing of folders containing the same number of FLAC files with identical audio content to one or more of the other listed folders.  This then leaves you to manually delete the FLAC content in the duplicated album should you wish to (for obvious reasons I'm not automating this part, feel free to do so yourself).

To get the job done I've used some simple terminal commands (Linux based, so Windows users will have to find the Windows equivalent) coupled with a little SQLite scripting.

Here goes...

1. Firstly, from the root folder of your music collection generate a text file produced from listing files, their path and FLAC md5sum using bash terminal:

Code:
find -type f -name \*.flac -print0 | xargs -0 -n1 metaflac --with-filename --show-md5sum >  md5sums.txt

2. Add the necessary delimiters to be able to import the file into a SQLite table

Code:
echo __path:__md5sig:__filename:__dirpath > import.csv && sed ':a;N;$!ba;s/\n/::\n/g' md5sums.txt >> import.csv

3. Now create a database in SQLite, add a table "audio" having text fields __path, __md5sig, __dirpath and __filename and import the csv into the audio table

Code:
DROP TABLE IF EXISTS audio;
DROP TABLE IF EXISTS __dirpath_content_concat__md5sig;
DROP TABLE IF EXISTS __dirpaths_with_same_content;
DROP TABLE IF EXISTS __dirpaths_with_FLACs_to_kill;
CREATE TABLE audio (
    __path                     TEXT UNIQUE,
    __md5sig                   TEXT,
    __filename                 TEXT,
    __dirpath                  TEXT
);


4. Run the necessary SQL scripts:
Code:
-- derive filename from the full file path
 UPDATE audio
    SET __filename = [replace](__path, rtrim(__path, [replace](__path, '/', '') ), '');

Code:
-- derive __dirpath from the full file path
 UPDATE audio
    SET __dirpath = substr(__path, 1, length(__path) - length([replace](__path, rtrim(__path, [replace](__path, '/', '') ), '') ) );

Code:
-- create table in which to store concatenated __md5sig for all __dirnames in duplicates table
 DROP TABLE IF EXISTS __dirpath_content_concat__md5sig;
 CREATE TABLE __dirpath_content_concat__md5sig (
     __dirpath      TEXT,
     concat__md5sig TEXT
 );

Code:
-- populate table with __dirpath and concatenated __md5sig of all files associated with __dirpath (note order by __md5sig to ensure concatenated __md5sig is consistently generated irrespective of physical record sequence).
 INSERT INTO __dirpath_content_concat__md5sig (
                                                  __dirpath,
                                                  concat__md5sig
                                              )
                                              SELECT __dirpath,
                                                     group_concat(__md5sig, '|') 
                                                FROM audio
                                               GROUP BY __dirpath
                                               ORDER BY __md5sig;

Code:
 --now write the duplicate records into a separate table listing all __dirname's that have identical FLAC contents
 DROP TABLE IF EXISTS __dirpaths_with_same_content;
 CREATE TABLE __dirpaths_with_same_content (
     __dirpath      TEXT,
     concat__md5sig TEXT
 );

Code:
INSERT INTO __dirpaths_with_same_content (
                                              __dirpath,
                                              concat__md5sig
                                          )
                                          SELECT __dirpath,
                                                 concat__md5sig 
                                            FROM __dirpath_content_concat__md5sig
                                           WHERE concat__md5sig IN (
                                                     SELECT concat__md5sig
                                                       FROM __dirpath_content_concat__md5sig
                                                      GROUP BY concat__md5sig
                                                     HAVING count( * ) > 1
                                                 )
                                           ORDER BY concat__md5sig, __dirpath;

Code:
-- populate table listing directories in which FLAC files should be deleted as they're duplicates
DROP TABLE IF EXISTS __dirpaths_with_FLACs_to_kill;
CREATE TABLE __dirpaths_with_FLACs_to_kill (
    __dirpath      TEXT,
    concat__md5sig TEXT
);

INSERT INTO __dirpaths_with_FLACs_to_kill (
                                              __dirpath,
                                              concat__md5sig
                                          )
                                          SELECT __dirpath,
                                                 concat__md5sig
                                            FROM __dirpaths_with_same_content
                                           WHERE rowid NOT IN (
                                                     SELECT min(rowid) 
                                                       FROM __dirpaths_with_same_content
                                                      GROUP BY concat__md5sig
                                                 );


The table __dirpaths_with_same_content will list all folders in your collection whose FLAC audio content is identical to one or more folders in your collection.  The file is sorted so that folders with identical FLAC audio content are grouped together so as to ease navigating your file system.

The table __dirpaths_with_FLACs_to_kill represents a subset of __dirpaths_with_same_content. Deleting the FLAC content in the corresponding folders will eliminate the duplication. It goes without saying in choosing duplicates to add to the killfile no regard has been had to tagging etc?use contents of __dirpaths_with_FLACs_to_kill at your own risk.

IMPORTANT NOTE: Please do not post questions / concerns regarding FLAC files having different filenames, being housed in different folder names or having different file sizes and thus having incorrect results.  FLAC files produced by a FLAC specification compliant FLAC encoder have an internal md5sum of the audio stream - this md5sum remains the same regardless of the level of compression applied and is independent of any metadata present in the FLAC header.
 

Latest posts

Top