ANN: A #reuse analyzer script #reuse

Larry Kollar

Hey all,
I've posted an Awk script to identify potential reuse candidates (and it can find inconsistencies as well), on Github.

It works on plain text, using fuzzy matching to find matches and near-matches, and generates a report. You can alter the match threshold on the command line. It grinds through a smallish (2200 block elements) book in under 5 minutes on my home desktop (late-2013 iMac), or 1-1/2 minutes if compiled with awka. Bigger books (700-800 pages) can take two hours or more.

The README discusses how to prep your doc set(s)—basically, use DITA-OT to output Markdown, then use Pandoc to strip markup and put each block/paragraph on its own line.

I've already used this on a couple of my work documents, to good effect. Let me know if it helps you out!