Re: Anyone Using Elasticsearch to Index DITA Content?


This sounds totally cool.  At my current gig (Turbonomic) we're supporting an export to ElasticSearch data...  Basically an export to Kafka "documents"...  JSON objects.  You can read the JSON into ElasticSearch, and then do lots with it, including interesting analysis and visualizations.  This approach seems to be loose with the details of the JSON you send it.  So there seems to be leeway in what you do.

I would think that LwDITA would be easier to translate into JSON...  In fact, isn't there some thinking going on about making a JSON implementation of LwDITA? 

Our product does supply-chain analysis of entities to find the best provider of resources to each consumer, and to give advantage to consumers that in turn provide more value (resources) to the overall system.  It's designed to manage a network, but that's a matter of naming the entities and resources.  You could hijack the base model and overlay it on other domains (much like specialization works)...  To do that for a body of DITA you would need to convert the DITA to JSON.  Long-winded...  I've been thinking about doing this for a while.  Sadly, work in the software salt mines doesn't leave the time to get to it.  But some thoughts...

JSON to drive analysis probably should not try to replicate the full document, so much as replicate the structure.  If you want the analysis to map back to the actual content, use references to IDs. 

You probably don't need to replicate the full structure.  Depending on the analysis you want to perform, you can get away with dipping into the structure at different depths. 

Things you could do include:
  • Stats similar to the DITAMap Metrics Report, but maybe more powerful.
  • Change tracking and management -- A super-DIFF that can list where things have changed.  Mapping back to the content you could render change bars or some such.  So then you could have change tracking without adding cruft to the source.  (I might even use that...)
  • Pattern discovery -- This is what I want to play with.  For example, you could merge a search index with your structural representation in the JSON, and then look for elements with lexical similarities...  Elements that use the same words.  From there you could assemble a map of topics that answers just a specific question...  Personalized content.  You would have to start with analyzing the patterns that arise -- something ElasticSearch should be good at.

It never occurred to me to try this with ElasticSearch or similar...  It sounds like gangs of fun!

I have done some DITA to JSON, but only to turn a topic into an array for a walk-through tour.  This would be bigger...  I think the first step is to design what you want ES to analyze, and then decide how to get that out of the DITA.  I would start humbly, and try to grow on that.


Join to automatically receive all group messages.