Duplicate File Hunter

A high-performance Python utility designed to detect and report duplicate files within complex directory structures. It employs a two-stage filtering process (Size grouping + MD5 Hashing) to ensure 100% accuracy while minimizing CPU and I/O usage.

🚀 Features

Two-Stage Algorithm:
Fast Filter: Groups files by byte size first (O(N) complexity), instantly eliminating unique files from further processing.
Precise Filter: Generates MD5 hashes (Digital Fingerprints) only for potential duplicates, ensuring zero false positives.
Recursive Scanning: deeply scans all sub-directories and nested folders using os.walk.
Memory Efficient: Uses chunked reading (8KB buffers) to safely hash large files (e.g., 4GB movies) without consuming excessive RAM.
Robust Error Handling: Gracefully handles permission errors, missing directories, and system files.
Cross-Platform: Works on Linux, Windows, and macOS.

🛠️ Installation

No external dependencies are required. This script runs on standard Python 3.

# Clone the repository
git clone https://github.com/Ade20boss/file_twin.git

# Navigate to the directory
cd file_twin

📖 Usage

Open the script deduplicator.py.
Modify the function call at the bottom to point to the directory you want to scan:

# At the bottom of duplicate_hunter.py
print(find_duplicate("/path/to/your/folder"))

Run the script:

python deduplicator.py

Example Output

Scanning /home/user/Downloads...
___________________
[DUPLICATE SET] Hash: 5d41402abc4b2a76b9719d911017c592 (Size: 1048576 bytes)
  -> /home/user/Downloads/image.png
  -> /home/user/Downloads/backup/image_copy.png

___________________
[DUPLICATE SET] Hash: a1b2c3d4e5f6g7h8i9j0k1l2m3n4o5p6 (Size: 512 bytes)
  -> /home/user/Downloads/notes.txt
  -> /home/user/Documents/notes_final.txt

🧠 How It Works

This tool avoids the common performance pitfall of hashing every single file (which is slow). Instead, it uses a logic funnel:

Validation: It first verifies the directory exists and is accessible.
Grouping (The "Lazy" Check): It walks the tree and groups files into buckets based on their exact byte size.

Logic: If File A is 100 bytes and File B is 100 bytes, they might be duplicates. If File C is 101 bytes, it is definitely unique and is ignored immediately.

Hashing (The "Deep" Check): It calculates the MD5 hash only for the buckets that contain more than one file.
Reporting: It compares the hashes. If the hashes match, the files are mathematically identical.

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Fork the Project
Create your Feature Branch (git checkout -b feature/AmazingFeature)
Commit your Changes (git commit -m 'Add some AmazingFeature')
Push to the Branch (git push origin feature/AmazingFeature)
Open a Pull Request

📄 License

Distributed under the MIT License. See LICENSE for more information.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
LICENSE		LICENSE
README.md		README.md
deduplicator.py		deduplicator.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Duplicate File Hunter

🚀 Features

🛠️ Installation

📖 Usage

Example Output

🧠 How It Works

🤝 Contributing

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Duplicate File Hunter

🚀 Features

🛠️ Installation

📖 Usage

Example Output

🧠 How It Works

🤝 Contributing

📄 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages