Find duplicates files java

Saved searches

Use saved searches to filter your results more quickly

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

Find duplicate files in specified directory trees

License

janosgyerik/java-dupfinder

This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Sign In Required

Please sign in to use Codespaces.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Читайте также:  Java date and time api

Launching Xcode

If nothing happens, download Xcode and try again.

Launching Visual Studio Code

Your codespace will open once ready.

There was a problem preparing your codespace, please try again.

Latest commit

Git stats

Files

Failed to load latest commit information.

README.md

Find duplicate files in specified directory trees.

Create the JAR including dependencies using Maven:

mvn clean compile assembly:single 

This will create target/dupfinder-jar-with-dependencies.jar , an executable jar.

To find duplicate files in the current directory and all sub-directories:

To see the available options, use the -h or —help flag:

To find duplicate files in multiple directory trees, only considering filenames with extension .avi , descending to at most 2 sub-directory levels:

java -jar $JAR --ext avi --maxdepth 2 path/to/dir path/to/other/dir 

Or using the helper script:

./run.sh --ext avi --maxdepth 2 path/to/dir path/to/other/dir 

About

Find duplicate files in specified directory trees

Источник

BiruLyu / 609. Find Duplicate File in System(#).java

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters

public class Solution
public List < List < String >> findDuplicate ( String [] paths )
Map < String , List < String >> map = new HashMap <>();
for ( String path : paths )
String [] pathArr = path . split ( » » );
String dir = pathArr [ 0 ] + ‘/’ ;
for ( int i = 1 ; i < pathArr . length ; i ++)
int start = pathArr [ i ]. indexOf ( ‘(‘ );
String fileName = dir + pathArr [ i ]. substring ( 0 , start );
String content = pathArr [ i ]. substring ( start + 1 , pathArr [ i ]. length () — 1 );
map . putIfAbsent ( content , new ArrayList <>());
map . get ( content ). add ( fileName );
>
>
List < List < String >> res = new ArrayList <>();
for ( String key : map . keySet ())
if ( map . get ( key ). size () > 1 )
res . add ( map . get ( key ));
>
>
return res ;
>
>

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters

public class Solution
public List < List < String >> findDuplicate ( String [] paths )
HashMap < String , Integer >map = new HashMap <>();
HashMap < String , String >mapPath = new HashMap <>();
HashMap < String , Integer >resultIndex = new HashMap <>();
List < List < String >> result = new ArrayList <>();
for ( int i = 0 ; i < paths . length ; i ++)
String [] parts = paths [ i ]. split ( » » );
String dir = parts [ 0 ];
for ( int j = 1 ; j < parts . length ; j ++)
String file = parts [ j ];
//Separate content and filename
int s_index = file . indexOf ( «(» );
String fname = file . substring ( 0 , s_index );
String content = file . substring ( s_index + 1 , file . length ()- 1 );
if ( map . containsKey ( content ))
// Duplicate File Found
if ( map . get ( content )== 1 )
int r_index = result . size ();
ArrayList < String >newRow = new ArrayList <>();
newRow . add ( mapPath . get ( content )); // To add first file
newRow . add ( dir + «/» + fname );
result . add ( newRow );
resultIndex . put ( content , r_index );
map . put ( content , map . get ( content )+ 1 );
>
else
result . get ( resultIndex . get ( content )). add ( dir + «/» + fname );
>
>
else
map . put ( content , 1 );
mapPath . put ( content , dir + «/» + fname );
>
>
> // end for
return result ;
>
>

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters

Follow up questions:
1. Imagine you are given a real file system, how will you search files? DFS or BFS ?
In general, BFS will use more memory then DFS. However BFS can take advantage of the locality of files in inside directories, and therefore will probably be faster
2. If the file content is very large (GB level), how will you modify your solution?
In a real life solution we will not hash the entire file content, since it’s not practical. Instead we will first map all the files according to size. Files with different sizes are guaranteed to be different. We will than hash a small part of the files with equal sizes (using MD5 for example). Only if the md5 is the same, we will compare the files byte by byte
3. If you can only read the file by 1kb each time, how will you modify your solution?
This won’t change the solution. We can create the hash from the 1kb chunks, and then read the entire file if a full byte by byte comparison is required.
What is the time complexity of your modified solution? What is the most time consuming part and memory consuming part of it? How to optimize?
Time complexity is O(n^2 * k) since in worse case we might need to compare every file to all others. k is the file size
How to make sure the duplicated files you find are not false positive?
We will use several filters to compare: File size, Hash and byte by byte comparisons.

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters

public static List < List < String >> findDuplicate ( String [] paths )
Map < String , List < String >> map = new HashMap <>();
for ( String path : paths )
String [] tokens = path . split ( » » );
for ( int i = 1 ; i < tokens . length ; i ++)
String file = tokens [ i ]. substring ( 0 , tokens [ i ]. indexOf ( ‘(‘ ));
String content = tokens [ i ]. substring ( tokens [ i ]. indexOf ( ‘(‘ ) + 1 , tokens [ i ]. indexOf ( ‘)’ ));
map . putIfAbsent ( content , new ArrayList <>());
map . get ( content ). add ( tokens [ 0 ] + «/» + file );
>
>
return map . values (). stream (). filter ( e -> e . size () > 1 ). collect ( Collectors . toList ());
>

Источник

Recursively find all duplicate files in a directory (Java)

Recursively find all duplicate files in a directory (Java)

Sure this could be implemented using a few lines of basic Linux commands. However, understanding how to write such code in Java, requires understanding in several topics — Hash tables, recursion, lists, file system, and more.

That is why I love this problem. Not only does it concern understanding much of Java’s fundamentals, but also there is great deal of required efficiency regarding time complexity and space complexity.

Duplicate detection using hash

The first problem to take into consideration is, how do I detect duplicated files? Should I only consider file names? What about file size? Maybe both? Considering both is still not enough. It’s pretty easy to come up with a counter example for this approach. Take for example File A thats called fileA.txt and file B called fileB.txt. fileA.txt contains the word «hello» however fileB.txt contains the word «world». Both files contain the same name and size, however are not identical. That is why my approach will contain reading the files bytes, and saving a unique hash id for each file.

private static MessageDigest messageDigest; static < try < messageDigest = MessageDigest.getInstance("SHA-512"); > catch (NoSuchAlgorithmException e) < throw new RuntimeException("cannot initialize SHA-512 hash function", e); > >

In the above code, we apply a notable secure hash function called SHA-512. We will use this function to create a unique id for each of the files in the file system.

Duplicated files retrieval using Hash Table

Our second problem, is how to store the files id hash efficiently for future retrieval in an efficient way. One of the best methods for retrieval efficiently is of course Hash Tables which if implemented properly, enable retrieval in O(1) complexity time. What we’ll do is store the hash unique id’s as keys, and for every key, the value will be a Linked List containing all of the duplicated String paths associated to the same key. Such hash id’s are very very big which is why we’ll also use the Java library BigInteger.

And finally, we’ll traverse all sub directories and files recursively, such that for each directory, traverse all of it’s files. The final implementation is as follows:

public static void findDuplicatedFiles(MapString, ListString>> lists, File directory) < for (File child : directory.listFiles()) < if (child.isDirectory()) < findDuplicatedFiles(lists, child); > else < try < FileInputStream fileInput = new FileInputStream(child); byte fileData[] = new byte[(int) child.length()]; fileInput.read(data); fileInput.close(); String uniqueFileHash = new BigInteger(1, md.digest(fileData)).toString(16); ListString> list = lists.get(uniqueFileHash); if (list == null) < list = new LinkedListString>(); lists.put(uniqueFileHash, list); > list.add(child.getAbsolutePath()); > catch (IOException e) < throw new RuntimeException("cannot read file " + child.getAbsolutePath(), e); > > > >

All thats left is to run the above method and print out the Hash tables key values if such exists (that is that the associated linked lists hold duplicates.

MapString, ListString>> lists = new HashMapString, ListString>>(); FindDuplicates.findDuplicateFiles(lists, dir); for (ListString> list : lists.values()) < if (list.size() > 1) < System.out.println("\n"); for (String file : list) < System.out.println(file); > > > System.out.println("\n");

The source code can be found in the download link below:

Источник

Saved searches

Use saved searches to filter your results more quickly

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

A Java application with GUI(Swing) to find all the duplicate files in a given directory and all its sub-directories, using the SHA-512 hash function

License

maanavshah/duplicate-file-finder

This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Sign In Required

Please sign in to use Codespaces.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching Xcode

If nothing happens, download Xcode and try again.

Launching Visual Studio Code

Your codespace will open once ready.

There was a problem preparing your codespace, please try again.

Latest commit

Git stats

Files

Failed to load latest commit information.

README.md

A Java application with GUI(Swing) to find all the duplicate files in a given directory and all its sub-directories, using the SHA-512 hash function

This is a simple app to scan through all the files in a given directory, then list out all the duplicate files according to their MD5 hash values. Very quickly find files with duplicate content, and provides the option to delete duplicates.

  • List all duplicates in a directory and sub-directories
  • Two modes (Quick finder and Memory Saver)
    • Quick finder: Quickly finds the duplicates using message digest function
    • Memory saver: Finds duplicates by limiting the size of buffer used by the same message digest function

    Bug reports and pull requests are welcome on GitHub at https://github.com/maanavshah/duplicate-file-finder. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the Contributor Covenant code of conduct.

    The content of this repository is licensed under MIT LICENSE.

    About

    A Java application with GUI(Swing) to find all the duplicate files in a given directory and all its sub-directories, using the SHA-512 hash function

    Источник

Оцените статью