Running Saxon Directly With a Grammar Pool: dita-community/grammarpool-parser-factory


Eliot Kimber
 

 
tl;dr: Enables efficient processing of large numbers of DITA documents using standalone Saxon.
 
At ServiceNow I implemented an XSLT migration tool that operates over our 40,000 topics in one go. (We're migrating our content to use keys for all references from topics to other things, as Scott Hudson and I discussed at the 2022 Convex conference.)
 
It was taking hours to process the topics and I knew that all the time was due to parsing the DTDs for every topic and map. 
 
This transform is run directly with Saxon, as opposed to being an Open Toolkit plug-in. (In this case, there's no value in using Open Toolkit because it would impose overhead without adding any particular value for the migration I'm doing--I already have XSLT code for resolving maps and doing DITA-specific things in XSLT--see the DITA utilities project in the DITA Community organization on GitHub. It's also a one-time migration, so it won't be used again once our migration is complete. I also tried using Open Toolkit to just run my transform with Saxon directly but Open Toolkit as of 3.7.3 isn't configured to do that for documents that need to be parsed against the DTDs and it would have been too much change to Open Toolkit to enable it.)
 
Open Toolkit is configured to use a Xerces grammar pool when parsing, which is why it can be so fast.
 
But when you run Saxon directly it does not have a grammar pool configured. I asked the Saxon team if there was a way to configure one from the command line and there was not. Saxon expects the XML parsers used to be separately configured and out of the box uses the default parser configuration provided by Xerces.
 
My solution was to implement a very simple Java library that provides a Xerces XML parser factory configured with a grammar pool. This parser factory can then be specified on the Saxon command line.
 
Using this parser factory I got at least a 60x speedup, so that a process that took nearly three hours(!) to process about 8500 topics now takes three minutes.
 
 
See the project README for usage details, but the short form is:
 
1. Put the grammarpool-parser-factory-0.9.9-jar-with-depencies.jar on the Java class path used to run Saxon
2. Replace the Saxon -catalog command-line parameter with -Dxmlcatalogs
3. Use the Saxon -x parameter to specify the parser factory class org.ditacommunity.xml.grammarpool.GrammarPoolUsingSAXParserFactory 
 
Everything else for your Saxon command line is the same.
 
I developed this specifically for use with DITA content because DITA content tends to use large DTDs and have large numbers of files that need to be processed, meaning that the DTD parsing time cost is often swamps the XSLT time cost, as I discovered with my migration script.
 
But the parser factory would be useful for any other XML processing that requires DTD or XSD parsing.
 
Cheers,
 
E.
----
Eliot Kimber


Mark Giffin
 

Cool! So could I use this plus DITA utilities to do something like validate a whole repo of DITA files quickly without having to use the DITA OT?

Mark Giffin
http://markgiffin.com/

On 11/27/2022 9:43 AM, Eliot Kimber wrote:

 
tl;dr: Enables efficient processing of large numbers of DITA documents using standalone Saxon.
 
At ServiceNow I implemented an XSLT migration tool that operates over our 40,000 topics in one go. (We're migrating our content to use keys for all references from topics to other things, as Scott Hudson and I discussed at the 2022 Convex conference.)
 
It was taking hours to process the topics and I knew that all the time was due to parsing the DTDs for every topic and map. 
 
This transform is run directly with Saxon, as opposed to being an Open Toolkit plug-in. (In this case, there's no value in using Open Toolkit because it would impose overhead without adding any particular value for the migration I'm doing--I already have XSLT code for resolving maps and doing DITA-specific things in XSLT--see the DITA utilities project in the DITA Community organization on GitHub. It's also a one-time migration, so it won't be used again once our migration is complete. I also tried using Open Toolkit to just run my transform with Saxon directly but Open Toolkit as of 3.7.3 isn't configured to do that for documents that need to be parsed against the DTDs and it would have been too much change to Open Toolkit to enable it.)
 
Open Toolkit is configured to use a Xerces grammar pool when parsing, which is why it can be so fast.
 
But when you run Saxon directly it does not have a grammar pool configured. I asked the Saxon team if there was a way to configure one from the command line and there was not. Saxon expects the XML parsers used to be separately configured and out of the box uses the default parser configuration provided by Xerces.
 
My solution was to implement a very simple Java library that provides a Xerces XML parser factory configured with a grammar pool. This parser factory can then be specified on the Saxon command line.
 
Using this parser factory I got at least a 60x speedup, so that a process that took nearly three hours(!) to process about 8500 topics now takes three minutes.
 
 
See the project README for usage details, but the short form is:
 
1. Put the grammarpool-parser-factory-0.9.9-jar-with-depencies.jar on the Java class path used to run Saxon
2. Replace the Saxon -catalog command-line parameter with -Dxmlcatalogs
3. Use the Saxon -x parameter to specify the parser factory class org.ditacommunity.xml.grammarpool.GrammarPoolUsingSAXParserFactory 
 
Everything else for your Saxon command line is the same.
 
I developed this specifically for use with DITA content because DITA content tends to use large DTDs and have large numbers of files that need to be processed, meaning that the DTD parsing time cost is often swamps the XSLT time cost, as I discovered with my migration script.
 
But the parser factory would be useful for any other XML processing that requires DTD or XSD parsing.
 
Cheers,
 
E.
----
Eliot Kimber


Eliot Kimber
 

Correct—you could use it for that.

 

In my keyification migration script, the XSLT uses the resolve map XSLT from DITA utilities to resolve the input map into a single document, then gathers up all the submap documents and directly-referenced topics. From those directly referenced topics it then determines the set of content reference topics not already referenced from the map, representing the total set of topics ultimately involved in the map.

 

With Saxon’s collection() extensions that let you process all the files in a directory tree from within an XSLT transform, you could easily process and report on all the topics or maps in some directory structure.

 

The use of the grammar pool doesn’t change what you could always do with Saxon, it just makes it much, much faster than it would otherwise be with out-of-the box Saxon when applied to DITA documents (because of our giant DTDs and XSDs).

 

The reason Open Toolkit doesn’t work for what I’m doing with the keyification migration is that Open Toolkit always does an initial parse of the input maps and topics to generate a set of intermediate topics that are fully normalized and then applies the final form processing to that. That’s a lot of overhead when all you want to do is modify some parts of the input maps and topics, i.e., an identity transform.

 

Instead of doing an identity XSLT transform you could instead use Saxon’s XQuery update support to do the document modification in a possibly simpler way, although it requires Saxon EE I think (which you have with Oxygen if you have an Oxygen license), at least to do update in place.

Or, depending on what you need to do, using XQuery update as an Oxygen refactor can be both maximally easy to implement and very fast to apply. In this keyification case, the business logic involved was complex enough that it didn’t make sense to implement as an Oxygen refactor. I also needed to be able to run it as a Jenkins job [I think I could in fact run a refactor from the command line using Oxygen’s separately-licensed command line processing support, which we use for other things, but I didn’t go that route].

 

I used XSLT for my keyification transform mostly because that’s how I started it and to make the code more accessible to my team, who are more familiar with XSLT than XQuery. One advantage of XQuery update over XSLT for this type of task is that an XQuery update script is 100% explicit about what is being modified while an XSLT is more indirect and thus poses the risk of either unintended modification or missing something that should be modified.

 

Cheers,

 

E.

 

_____________________________________________

Eliot Kimber

Sr Staff Content Engineer

O: 512 554 9368

M: 512 554 9368

servicenow.com

LinkedIn | Twitter | YouTube | Facebook

 

From: main@dita-users.groups.io <main@dita-users.groups.io> on behalf of Mark Giffin via groups.io <mark@...>
Date: Sunday, November 27, 2022 at 5:16 PM
To: main@dita-users.groups.io <main@dita-users.groups.io>
Subject: Re: [dita-users] Running Saxon Directly With a Grammar Pool: dita-community/grammarpool-parser-factory

[External Email]

 


Cool! So could I use this plus DITA utilities to do something like validate a whole repo of DITA files quickly without having to use the DITA OT?

Mark Giffin
http://markgiffin.com/

On 11/27/2022 9:43 AM, Eliot Kimber wrote:

 

tl;dr: Enables efficient processing of large numbers of DITA documents using standalone Saxon.

 

At ServiceNow I implemented an XSLT migration tool that operates over our 40,000 topics in one go. (We're migrating our content to use keys for all references from topics to other things, as Scott Hudson and I discussed at the 2022 Convex conference.)

 

It was taking hours to process the topics and I knew that all the time was due to parsing the DTDs for every topic and map. 

 

This transform is run directly with Saxon, as opposed to being an Open Toolkit plug-in. (In this case, there's no value in using Open Toolkit because it would impose overhead without adding any particular value for the migration I'm doing--I already have XSLT code for resolving maps and doing DITA-specific things in XSLT--see the DITA utilities project in the DITA Community organization on GitHub. It's also a one-time migration, so it won't be used again once our migration is complete. I also tried using Open Toolkit to just run my transform with Saxon directly but Open Toolkit as of 3.7.3 isn't configured to do that for documents that need to be parsed against the DTDs and it would have been too much change to Open Toolkit to enable it.)

 

Open Toolkit is configured to use a Xerces grammar pool when parsing, which is why it can be so fast.

 

But when you run Saxon directly it does not have a grammar pool configured. I asked the Saxon team if there was a way to configure one from the command line and there was not. Saxon expects the XML parsers used to be separately configured and out of the box uses the default parser configuration provided by Xerces.

 

My solution was to implement a very simple Java library that provides a Xerces XML parser factory configured with a grammar pool. This parser factory can then be specified on the Saxon command line.

 

Using this parser factory I got at least a 60x speedup, so that a process that took nearly three hours(!) to process about 8500 topics now takes three minutes.

 

 

See the project README for usage details, but the short form is:

 

1. Put the grammarpool-parser-factory-0.9.9-jar-with-depencies.jar on the Java class path used to run Saxon

2. Replace the Saxon -catalog command-line parameter with -Dxmlcatalogs

3. Use the Saxon -x parameter to specify the parser factory class org.ditacommunity.xml.grammarpool.GrammarPoolUsingSAXParserFactory 

 

Everything else for your Saxon command line is the same.

 

I developed this specifically for use with DITA content because DITA content tends to use large DTDs and have large numbers of files that need to be processed, meaning that the DTD parsing time cost is often swamps the XSLT time cost, as I discovered with my migration script.

 

But the parser factory would be useful for any other XML processing that requires DTD or XSD parsing.

 

Cheers,

 

E.
----
Eliot Kimber