The xtool dedup parameter is not a one-size-fits-all hammer. Use for synthetic data or logs. Use fuzzy dedup (with MinHash and threshold 0.8–0.9) for natural language corpora.
: When enabled, the deduplication feature typically creates temporary files during the encoding process to track and manage duplicate streams. xtool dedup parameter