For the plain text you may also consider using tidy with a configuration to not wrap lines, then some simple XSLT to get rid of all the element tags. It may leave you with a decent amount of white space, but it should leave each element on it's own line.
toggle quoted messageShow quoted text
On March 6, 2020 9:43:57 AM PST, lkollar@... wrote:
I got bored this week and cobbled up a reuse analyzer. It's about 100 lines of awk, and I found the fuzzy matching code on the web.
It turns out the more difficult part is getting plain text out of DITA, with each block element on one line, so the script can properly compare blocks. I tried Robert Anderson's Morse example (mentioned earlier this week), turning off the Morse code mapping part. I ended up with the entire topic on one line, often with missing spaces. That ain't working.
My current scheme is to export Markdown, then use `pandoc -t plain --wrap=none` to strip most of the markup. The awk script fixes the rest as it ingests the text files. This works OK, but is best for documents that have been given NO reuse treatment. As one might expect, the Markdown export resolves all existing keys and conrefs, and outputs them with the rest of the content. That means a bunch of false positives.
If the Markdown plugin allows customization, I could make a variant that just throws away the reused content for me. But maybe I'm missing a better way to do it.