DITA and Security


despopoulos_chriss
 

I don't know if this is even a reasonable question, but...  Has anybody looked at ways to sneak security violations into DITA?  For example, sneaking in JavaScript as CDATA into a topic?  Does DITA just allow CDATA?  Are there other ways to sneak in bad things that might make it through a transform to HTML?  I'm looking for ways to prove that has not happened to my topics.  I think that is becoming important for Docs-As-Code workflows.


ekimber@contrext.com
 

I’m not sure that literal Javascript in HTML content can pose a security problem without something to run it, which means some Javascript loaded by the browser, which means something referenced from the generated HTML or otherwise loaded separate from the DITA itself.

 

If you’re processing the DITA XML directly in the browser and your XML processor does DTD resolution and entity expansion, the external entity expansion is always a potential issue, although I’m not sure it should be a danger in a browser (because browsers don’t normally have access to a user’s file system). External entity expansion is usually a vulnerability on servers (i.e., a processor expands an entity pointing to “/etc/passwords” or similar, putting the content of that file into the generated result).

 

Otherwise, I would think any DITA-carried vulnerability would be a function of processing applied to it in the browser, which is ultimately the responsibility of the implementor of the processing (i.e., not blindly processing DITA content just as you would not blindly process user-supplied values, i.e., the typical SQL injection attack).

 

If you are generating your HTML then you have complete control over what gets generated and can avoid any potential data issues in the HTML. If you’re generating the HTML in the browser from the base DITA then it’s a function of the in-browser processing. But even in that case I would expect the DITA provided to the browser to have gone through some preprocess, for example to normalize the XML (to remove the need for DTD resolution), resolve conrefs, apply filtering, and otherwise scrub the data for public delivery.

 

Cheers,

 

Eliot

 

--

Eliot Kimber

http://contrext.com

 

 

 

From: <main@dita-users.groups.io> on behalf of "despopoulos_chriss via groups.io" <despopoulos_chriss@...>
Reply-To: <main@dita-users.groups.io>
Date: Wednesday, July 14, 2021 at 9:36 AM
To: <main@dita-users.groups.io>
Subject: [dita-users] DITA and Security

 

I don't know if this is even a reasonable question, but...  Has anybody looked at ways to sneak security violations into DITA?  For example, sneaking in JavaScript as CDATA into a topic?  Does DITA just allow CDATA?  Are there other ways to sneak in bad things that might make it through a transform to HTML?  I'm looking for ways to prove that has not happened to my topics.  I think that is becoming important for Docs-As-Code workflows.


Mica Semrick
 

Certainly links to external websites/documents/etc could make it through the publish and into your production content.


On July 14, 2021 7:36:06 AM PDT, "despopoulos_chriss via groups.io" <despopoulos_chriss@...> wrote:
I don't know if this is even a reasonable question, but...  Has anybody looked at ways to sneak security violations into DITA?  For example, sneaking in JavaScript as CDATA into a topic?  Does DITA just allow CDATA?  Are there other ways to sneak in bad things that might make it through a transform to HTML?  I'm looking for ways to prove that has not happened to my topics.  I think that is becoming important for Docs-As-Code workflows.


Nicholas Mucks
 

Metadata in the prolog does pass through to html but the html plugin would probably need additional templates to transform that into a script tag for javascript. You could certainly unintentionally pass internal information out to customers if your writers apply metadata that will appear in the output (like author) or your plugins include processing-instructions in the actual output (like some sort of track changes).

Take care,
- Nick

Sent from mobile

On Jul 14, 2021, at 11:36 AM, Mica Semrick <mica@...> wrote:

Certainly links to external websites/documents/etc could make it through the publish and into your production content.

On July 14, 2021 7:36:06 AM PDT, "despopoulos_chriss via groups.io" <despopoulos_chriss@...> wrote:
I don't know if this is even a reasonable question, but...  Has anybody looked at ways to sneak security violations into DITA?  For example, sneaking in JavaScript as CDATA into a topic?  Does DITA just allow CDATA?  Are there other ways to sneak in bad things that might make it through a transform to HTML?  I'm looking for ways to prove that has not happened to my topics.  I think that is becoming important for Docs-As-Code workflows.


Don Day
 

Hi, Chris. Good question!

Because DITA as XML supports general entity resolution in the document instance, it is vulnerable to a denial-of-service class of exploit called Billion Laughs (see wikipedia link following). But since DITA uses transclusion instead of entities for content references, this type of exploit seems unlikely, although it might be introduced in manipulated DTD mod files which DO depend on entity definitions and resolutions. I would think that other schema formats would be much harder to manipulate to become memory bombs like Billion Laughs. Check the references in this following article for possibly related threat vectors in XML processing systems. Your web server application is more likely to be compromised by cross-site scripting attacks sent through form data, as in a DITA transform that generates an HTML input field tied to an unprotected POST or GET handler.

https://en.wikipedia.org/wiki/Billion_laughs_attack
--
Don Day

On 7/14/2021 9:36 AM, despopoulos_chriss via groups.io wrote:
I don't know if this is even a reasonable question, but...  Has anybody looked at ways to sneak security violations into DITA?  For example, sneaking in JavaScript as CDATA into a topic?  Does DITA just allow CDATA?  Are there other ways to sneak in bad things that might make it through a transform to HTML?  I'm looking for ways to prove that has not happened to my topics.  I think that is becoming important for Docs-As-Code workflows.


despopoulos_chriss
 

So, my interest in this is whether one can identify the threat loopholes in a DITA document, and then include an automated scan for any of them existing in a file before committing it in a Docs-As-Code work flow.  For example, if a given construct opens such a loophole, maybe you can specialize your topics to disallow it, and then simply validate against your specialization.

I understand that our situation is a bit unique in that we transform in the browser.  If you transform before delivery, then the danger is in the transformed artifact, and you need to check it before publishing.  Ib iyr case, we need to pre-empt that artifact by controlling the DITA, and controlling the transform. We can limit the possibilities since the transform can effectively white-list the constructs it will pass.  We also use parameters when calling the transform, but again I believe the transform simply ignores unrecognized params, so you can't directly inject mischief that way.  OTOH, we can pass variable values as params, and inject those values into a templated topic file.  But the topic file must declare the placeholders for those values, the transform must recognize the placeholders, and it must have the params already declared in the transform file.  So you have to know all these things.  If you get past that, you might be able to pass malicious code in a param...  But the transform can limit what it produces when it expands the param in a waiting placeholder.  So I need to get my head around this...

The billion laughs attack is pretty funny!  Interestingly enough, MS Edge does not allow a doc declaration in the XML it receives...  Calls that a security violation.  We had to abandon entities a long time ago, in favor of references.


Dan Caprioara
 

Since DITA-OT is a Java process, you can use the Java SecurityManager and policy file infrastructure to limit its access to the system resources (for example you could give it read access only to the folder that contains the topics and maps, or to connect to a specific host to get binary resources). 

Or, perhaps simpler, run DITA-OT in a docker container and mount in this container only the DITA source folders, and the output folder. You can also impose network restrictions to the container, so the process inside it  cannot connect to other hosts.

Many regards,
Dan Caprioara


On 16 Jul 2021, at 17:14, despopoulos_chriss via groups.io <despopoulos_chriss@...> wrote:

So, my interest in this is whether one can identify the threat loopholes in a DITA document, and then include an automated scan for any of them existing in a file before committing it in a Docs-As-Code work flow.  For example, if a given construct opens such a loophole, maybe you can specialize your topics to disallow it, and then simply validate against your specialization.

I understand that our situation is a bit unique in that we transform in the browser.  If you transform before delivery, then the danger is in the transformed artifact, and you need to check it before publishing.  Ib iyr case, we need to pre-empt that artifact by controlling the DITA, and controlling the transform. We can limit the possibilities since the transform can effectively white-list the constructs it will pass.  We also use parameters when calling the transform, but again I believe the transform simply ignores unrecognized params, so you can't directly inject mischief that way.  OTOH, we can pass variable values as params, and inject those values into a templated topic file.  But the topic file must declare the placeholders for those values, the transform must recognize the placeholders, and it must have the params already declared in the transform file.  So you have to know all these things.  If you get past that, you might be able to pass malicious code in a param...  But the transform can limit what it produces when it expands the param in a waiting placeholder.  So I need to get my head around this...

The billion laughs attack is pretty funny!  Interestingly enough, MS Edge does not allow a doc declaration in the XML it receives...  Calls that a security violation.  We had to abandon entities a long time ago, in favor of references.