Topics

Plain text output #text

Larry Kollar
 

I got bored this week and cobbled up a reuse analyzer. It's about 100 lines of awk, and I found the fuzzy matching code on the web.

It turns out the more difficult part is getting plain text out of DITA, with each block element on one line, so the script can properly compare blocks. I tried Robert Anderson's Morse example (mentioned earlier this week), turning off the Morse code mapping part. I ended up with the entire topic on one line, often with missing spaces. That ain't working.

My current scheme is to export Markdown, then use `pandoc -t plain --wrap=none` to strip most of the markup. The awk script fixes the rest as it ingests the text files. This works OK, but is best for documents that have been given NO reuse treatment. As one might expect, the Markdown export resolves all existing keys and conrefs, and outputs them with the rest of the content. That means a bunch of false positives.

If the Markdown plugin allows customization, I could make a variant that just throws away the reused content for me. But maybe I'm missing a better way to do it.

Chris Papademetrious
 

Hey cool, this has been something I've wanted to do too!

What fuzzy matching code are you using? I've been looking primarily at "longest common substring" algorithms.

 - Chris

Mica Semrick
 

For the plain text you may also consider using tidy with a configuration to not wrap lines, then some simple XSLT to get rid of all the element tags. It may leave you with a decent amount of white space, but it should leave each element on it's own line.


On March 6, 2020 9:43:57 AM PST, lkollar@... wrote:
I got bored this week and cobbled up a reuse analyzer. It's about 100 lines of awk, and I found the fuzzy matching code on the web.

It turns out the more difficult part is getting plain text out of DITA, with each block element on one line, so the script can properly compare blocks. I tried Robert Anderson's Morse example (mentioned earlier this week), turning off the Morse code mapping part. I ended up with the entire topic on one line, often with missing spaces. That ain't working.

My current scheme is to export Markdown, then use `pandoc -t plain --wrap=none` to strip most of the markup. The awk script fixes the rest as it ingests the text files. This works OK, but is best for documents that have been given NO reuse treatment. As one might expect, the Markdown export resolves all existing keys and conrefs, and outputs them with the rest of the content. That means a bunch of false positives.

If the Markdown plugin allows customization, I could make a variant that just throws away the reused content for me. But maybe I'm missing a better way to do it.

Larry Kollar
 

I'm using the Levenshtein distance algorithm. I started with the C code from https://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance#C (second example) -- translating it to awk was pretty straightforward. If you prefer Python or Perl, you can use one of the pre-built libraries.

There's a recursive awk implementation at http://awk.freeshell.org/ja/LevenshteinEditDistance but I found it to be significantly slower.

Looping through an array of strings ends up requiring (b^2-b) comparisons if you throw out attempts to compare a block to itself. Then it occurred to me that if I delete each block that I've already compared to all the others, I don't get duplicate hits (i.e. comparing A to B, then B to A). That gives (b^2-b)/2 comparisons, which is pretty significant when you're talking about a smallish book containing 2400 block elements. Even with that optimization, 2400 block elements means nearly 2.9 million calls to the matching algorithm. I haven't had cause to utter the phrase "computationally expensive" in a looong time. :)

Next, I'm going to look at going to C... but past efforts suggest that it won't be that much faster.