Anyone Using Elasticsearch to Index DITA Content?


ekimber@contrext.com
 

In my current job role I'm contributing to an Elasticsearch-based application that is not in any way XML or DITA related. I am new to Elasticsearch but basically it uses the Apache Lucene full-text engine to index JSON data and then provide a bunch of useful search and retrieval services on top of that. It seems very cool.

However, except for the Elasticsearch Logstash tool's generic XML-to-JSON feature (designed primarily for indexing Windows XML logs I believe), there doesn't seem to be a dedicated documentation XML-to-Elasticsearch mechanism (at least I didn't find one in my initial brief searching).

But since the input to Elasticsearch is JSON there's no technical barrier to generating JSON from DITA content, but there would be some art to it.

In the DITA world we'd typically choose something like MarkLogic or XBase to do full-text searching on our content but Elasticsearch is widely used (and is open-source so it's free).

So I'm curious if anyone in the DITA community is using Elasticsearch to index their DITA content and if so, can you talk about it?

Cheers,

E.
--
Eliot Kimber
http://contrext.com


Tony Chung
 

Following this conversation. I've seen the ELK stack gaining popularity in dev teams over the past few years and would be interesting to see how DITA analytics could benefit from this open source tool stack.

-Tony


despopoulos_chriss
 

This sounds totally cool.  At my current gig (Turbonomic) we're supporting an export to ElasticSearch data...  Basically an export to Kafka "documents"...  JSON objects.  You can read the JSON into ElasticSearch, and then do lots with it, including interesting analysis and visualizations.  This approach seems to be loose with the details of the JSON you send it.  So there seems to be leeway in what you do.

I would think that LwDITA would be easier to translate into JSON...  In fact, isn't there some thinking going on about making a JSON implementation of LwDITA? 

Our product does supply-chain analysis of entities to find the best provider of resources to each consumer, and to give advantage to consumers that in turn provide more value (resources) to the overall system.  It's designed to manage a network, but that's a matter of naming the entities and resources.  You could hijack the base model and overlay it on other domains (much like specialization works)...  To do that for a body of DITA you would need to convert the DITA to JSON.  Long-winded...  I've been thinking about doing this for a while.  Sadly, work in the software salt mines doesn't leave the time to get to it.  But some thoughts...

JSON to drive analysis probably should not try to replicate the full document, so much as replicate the structure.  If you want the analysis to map back to the actual content, use references to IDs. 

You probably don't need to replicate the full structure.  Depending on the analysis you want to perform, you can get away with dipping into the structure at different depths. 

Things you could do include:
  • Stats similar to the DITAMap Metrics Report, but maybe more powerful.
  • Change tracking and management -- A super-DIFF that can list where things have changed.  Mapping back to the content you could render change bars or some such.  So then you could have change tracking without adding cruft to the source.  (I might even use that...)
  • Pattern discovery -- This is what I want to play with.  For example, you could merge a search index with your structural representation in the JSON, and then look for elements with lexical similarities...  Elements that use the same words.  From there you could assemble a map of topics that answers just a specific question...  Personalized content.  You would have to start with analyzing the patterns that arise -- something ElasticSearch should be good at.

It never occurred to me to try this with ElasticSearch or similar...  It sounds like gangs of fun!

I have done some DITA to JSON, but only to turn a topic into an array for a walk-through tour.  This would be bigger...  I think the first step is to design what you want ES to analyze, and then decide how to get that out of the DITA.  I would start humbly, and try to grow on that.

cud


Toshihiko Makita
 

I have experienced to develop DITA full test search pilot project last year via AWS Elasticsearch before the conflict between AWS and elastic.co.
This search is integrated into the DITA to HTML (or .php) publishing result. Following are several things I have done:
It was very exciting experience because I must learn about AWS operations and develop PHP and JavaScript (TypeScript) programs which I haven't ever knew.
Unfortunately it is still pilot project. However it will be integrated into user Web publishing system in the feature.

-- 
/*--------------------------------------------------
 Toshihiko Makita
 Development Group. Antenna House, Inc. Ina Branch
 Web site:
 http://www.antenna.co.jp/
 http://www.antennahouse.com/
 --------------------------------------------------*/


ekimber@contrext.com
 

That sounds very interesting. I'll take a look at that docs-bulk.html link.

Cheers,

E.

--
Eliot Kimber
http://contrext.com


On 6/10/21, 10:44 AM, "Toshihiko Makita" <main@dita-users.groups.io on behalf of tmakita@antenna.co.jp> wrote:

I have experienced to develop DITA full test search pilot project last year via AWS Elasticsearch before the conflict between AWS and elastic.co.
This search is integrated into the DITA to HTML (or .php) publishing result. Following are several things I have done:

* Use "curl" (or "awscurl") to generate index in AWS Elastic search.
https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-index_.html
* Convert DITA map & topics into JSON and execute "bulk" operation.
https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html
* Develop PHP program that accepts search request from client browser and return the search result from Elasticsearch as JSON.
* The JSON search results are edited by JavaScript and displayed in the browser.
* By clicking the search result, a user can reach the target Web page.

It was very exciting experience because I must learn about AWS operations and develop PHP and JavaScript (TypeScript) programs which I haven't ever knew.
Unfortunately it is still pilot project. However it will be integrated into user Web publishing system in the feature.

--
/*--------------------------------------------------
Toshihiko Makita
Development Group. Antenna House, Inc. Ina Branch
Web site:
http://www.antenna.co.jp/
http://www.antennahouse.com/
--------------------------------------------------*/


ekimber@contrext.com
 

I have created the GitHub project https://github.com/dita-community/org.dita-community.elastic as a place to capture my experimentation with using Elasticsearch to store and query DITA content.

 

It’s super minimal at the moment—I’m mostly using it to drive my learning of Elasticsearch by giving me a data set I understand.

 

At the moment there’s just one very small XSLT transform intended to be run against the output of the built-in Open Toolkit normalization transform (transtype “dita”). The transform generates a JSON file for the input file where each element is represented by a separate JSON “document” (in Elasticsearch terms), where each document captures the XML and DITA details, as well as the parentage of each non-root element, and the element’s full text (including the text of any subelements—this lets you quickly find both leaf and ancestor elements that that contain given text {writing this just now I think I need to capture the children elements of each non-leaf element}).

 

This is all very much “as is” and I make no guarantee I’ll do anything more with it after next week but it’s there and maybe it will be useful.

 

Cheers,

 

E.

 

--

Eliot Kimber

http://contrext.com

 

 

 

From: <main@dita-users.groups.io> on behalf of "ekimber@..." <ekimber@...>
Reply-To: <main@dita-users.groups.io>
Date: Thursday, June 10, 2021 at 4:46 PM
To: <main@dita-users.groups.io>
Subject: Re: [dita-users] Anyone Using Elasticsearch to Index DITA Content?

 

That sounds very interesting. I'll take a look at that docs-bulk.html link.

 

Cheers,

 

E.

 

--

Eliot Kimber

 

 

On 6/10/21, 10:44 AM, "Toshihiko Makita" <main@dita-users.groups.io on behalf of tmakita@...> wrote:

 

    I have experienced to develop DITA full test search pilot project last year via AWS Elasticsearch before the conflict between AWS and elastic.co.

    This search is integrated into the DITA to HTML (or .php) publishing result. Following are several things I have done:

 

    * Use "curl" (or "awscurl") to generate index in AWS Elastic search.

    * Convert DITA map & topics into JSON and execute "bulk" operation.

    * Develop PHP program that accepts search request from client browser and return the search result from Elasticsearch as JSON.

    * The JSON search results are edited by JavaScript and displayed in the browser.

    * By clicking the search result, a user can reach the target  Web page.

 

    It was very exciting experience because I must learn about AWS operations and develop PHP and JavaScript (TypeScript) programs which I haven't ever knew.

    Unfortunately it is still pilot project. However it will be integrated into user Web publishing system in the feature.

 

    --

    /*--------------------------------------------------

     Toshihiko Makita

     Development Group. Antenna House, Inc. Ina Branch

     Web site:

     --------------------------------------------------*/