Skip to content

Ideas for enhancement #34

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
itinerant-fox opened this issue Dec 16, 2020 · 7 comments
Open

Ideas for enhancement #34

itinerant-fox opened this issue Dec 16, 2020 · 7 comments

Comments

@itinerant-fox
Copy link

  1. Support for taking input from multiple input files.
    Wordlist can be spread across multiple files. Currently I am merging it and then passing to duplicut.
    Need to work something like : duplicut -p 1.txt 2.txt 3.txt 4.txt -o output.txt

  2. Progress bar. Need not be accurate. Can be a guesstimate.

Thanks for the software 👍

@nil0x42
Copy link
Owner

nil0x42 commented Dec 17, 2020

Hi ! Thank you for your suggestions :)


1. Support for taking input from multiple input files.

I did not implement multiple input files because it will not be faster than doing:
cat 1.txt 2.txt 3.txt > all.txt && duplicut -i all.txt -o output.txt
I prefer focusing on features that would do things better than existing tools. But of course, if someone is willing to make a PR implementing it, i would be happy to merge it !


2. Progress bar.

All the needed source for progress bar is already implemented (status.c & uinput.c).
Current progress-tracking implementation is good enough, but the UX might be a little old-fashioned: Each time you press any key during execution, a line reports current progress info with ETA, to feel like john-the-ripper's progress tracking.

So implementing a progress-bar would be actually very easy, as it's only a matter of display, i'll add- it to my TODO for sure 👍

@sectroyer
Copy link

I have few dictionaries and I would like to remove duplicates across them. Therefore it would be good to have an option to remove words from a dict that are in some other dict. I want to keep those dicts separate but make sure that I am not testing same pws when using them in sequence( NOT always the case) :)

@yanncam
Copy link

yanncam commented Mar 22, 2022

Good morning all,

First of all, thank you for duplicut, the tool is particularly powerful!

However, I encountered the same problem than @sectroyer and @itinerant-fox: I needed to use duplicate in a unitary way on each of my wordlists, then to deduplicate the wordlists between them.

Duplicut is only designed to process a single file, so I designed a wrapper (in bash) that automates the process for N wordlists while relying on duplicut.

This wrapper generates a single temporary file concatenating all the wordlists with delimiters (need disk space), then after deduplication, recuts this single file to regenerate the initial deduplicated unitary wordlists, accompanied by some optimization statistics.

You will find the wrapper here : https://github.com/yanncam/multiduplicut

Hope it can help others!

Thanks again for this great tool :)!

@nil0x42
Copy link
Owner

nil0x42 commented Mar 22, 2022

the delimiters idea is intreresting, actually it's probably the easiest way to implement it inside of duplicut without needing to rewrite a large part of the codebase.
I'll consider implementing multi file when i have time, so no ETA from now (always busy)
Anyway, your script is very nice, and i think it will help many people

@nil0x42
Copy link
Owner

nil0x42 commented May 12, 2025

Hi ! I'm considering implementing this. Therefore i'm not sure about prefered behavior:

If i dedupe bigfile.txt, smallfile.txt, mediumfile.txt, which file should duplicut 'favor' ?

There are 2 possibilities
1 - Respect cli order -- remove string1 on smallfile.txt & mediumfile.txt
2 - Priorize removal on bigger files -- remove string1 on mediumfile.txt & bigfile.txt

I tend to prefer option 2, but i'd like to hear your opinion @itinerant-fox , @sectroyer , @yanncam

@yanncam
Copy link

yanncam commented May 12, 2025

Hello @nil0x42,

For my part, when I created the "multiduplicut" wrapper, I based my sorting on file size:

  • I leave all the words in the smallest file (in bytes);
  • Then I deduplicate each other dictionary from the previous one (always sorted in size order).

Generally, a small wordlist contains very specific/very common words, so for a cryptanalysis approach, such a wordlist will be very relevant and it's best to leave it intact to maximize time.

The larger a wordlist becomes, the longer it will take.

This is why I opted for an implicit sort based on file size.

But customizing this order with an ordered list as a command line argument could be interesting too :)!

@sectroyer
Copy link

I use following script:

#!/bin/bash

echo -e "\nDuplicut file extension\n"

if [ -z "$3" ] || [ ! -f "$1" ] || [ ! -f "$2" ]
then
	echo "Usage: $0 <input_file_to_clean.txt> <file_to_remove> <output_file.txt>"
	exit -1
fi

rm /mnt/data/trim/*.txt 2> /dev/null


echo "Counting lines..."
number_of_lines_to_skip=$(wc -l "$2" | cut -d ' ' -f 1)
number_of_lines_to_skip=$(($number_of_lines_to_skip + 1))


echo "Number of lines to skip: $number_of_lines_to_skip"

echo "Copying input 1..."
cat "$2" > /mnt/data/trim/input.txt

echo "Copying input 2..."
cat "$1" >> /mnt/data/trim/input.txt

echo "Removing duplicates..."
duplicut /mnt/data/trim/input.txt -p -o /mnt/data/trim/output.txt

echo "Trimming output.."
tail -n "+$number_of_lines_to_skip" /mnt/data/trim/output.txt > "$3"

echo ""

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants