Topics

Copy and Paste Scripts from PDF (from a codeblock) loosing formatting when copied to a text editor #PDF

Eric Sirois
 

Hi,

I have a client who is running into an issue where they are trying to copy and paste a script from a PDF into Notepad++ or Ultraedit.  The script is in a codeblock in the source and the PDF does display the content with the proper indentation/spaces as expected.  The issue is that when they copy and paste from the PDF it loses all the formatting except in one case (PDF Architect 6).  Acrobat Reader and PDF Architect 7 don't work as expected.  Has anyone else run into the issue before?  

Eric

David H
 

Hi Eric,

AFAIK, there is no formatting in a PDF file. I would suggest that PDF Architect 6 has inferred the formatting. So, you could suggest a support call to find out why PDF Architect 7 doesn't give the same results as 6?

HTH,
David

Hi,

I have a client who is running into an issue where they are trying to copy and paste a script from a PDF into Notepad++ or Ultraedit. The script is in a codeblock in the source and the PDF does display the content with the proper indentation/spaces as expected. The issue is that when they copy and paste from the PDF it loses all the formatting except in one case (PDF Architect 6). Acrobat Reader and PDF Architect 7 don't work as expected. Has anyone else run into the issue before?

Eric

Wayne Brissette
 

Eric Sirois wrote on 2020-05-08 12:55:
I have a client who is running into an issue where they are trying to copy and paste a script from a PDF into Notepad++ or Ultraedit.  The script is in a codeblock in the source and the PDF does display the content with the proper indentation/spaces as expected.  The issue is that when they copy and paste from the PDF it loses all the formatting except in one case (PDF Architect 6).  Acrobat Reader and PDF Architect 7 don't work as expected.  Has anyone else run into the issue before?
Seems to me that this is more a function of the OS' clipboard format and the application captures and hands off to the OS. Are you thinking that maybe this has something to do with the way the OT is handling <codeblock>?

Wayne Brissette
 

David H wrote on 2020-05-08 13:04:
Hi Eric,

AFAIK, there is no formatting in a PDF file. I would suggest that PDF Architect 6 has inferred the formatting. So, you could suggest a support call to find out why PDF Architect 7 doesn't give the same results as 6?
David, I'm assuming by formatting here he's talking about spaces. It would also be interesting to know if in the codeblock tabs or actual spaces were used, if he is referring to that as the 'formatting'. There's actually quite a bit of formatting in a PDF, but I think that's semantics, because I would consider formatting anything related to how an object would get rendered and there's a load of that within a PDF.

-Wayne

Eric Sirois
 

Hi,

Yes, it's tabs and spaces.  Everything is fine in the topic.fo as the spaces/tabs are there (as expected) )and they are displayed properly in the PDF (as expected). For instance, XML snippets with indentation. It's taking it from the PDF to a text editor that is the issue.  And it seems it doesn't really matter what the formatter that was used to create the PDF (AH, FOP, XyVision).  They all have the same issue.  All the space/tabs disappear when pasted into a text editor. If I paste the snippet in oxygen, it seems fine, but I'm assuming oxygen is likely inferring something about the XML markup. When I copy a command line into a text document in oxygen...it does not maintain the spaces/indentation either.

Eric

Radu Coravu
 

Hi Eric,

About this remark:

If I paste the snippet in oxygen, it seems fine, but I'm assuming oxygen is likely inferring something about the XML markup.
If you paste XML content in the Oxygen XML Editor's "Text" editing mode indeed by default Oxygen will indent the XML content obtained from the clipboard. In the Oxygen Preferences-"Editor / Format / XML" page there is an "Indent on paste" checkbox which is by default checked. You can uncheck it to paste exactly the content obtained by Oxygen from the clipboard, without any indentation.

Regards,
Radu

Radu Coravu
<oXygen/> XML Editor
http://www.oxygenxml.com

On 5/8/2020 9:25 PM, Eric Sirois wrote:
Hi,
Yes, it's tabs and spaces.  Everything is fine in the topic.fo as the spaces/tabs are there (as expected) )and they are displayed properly in the PDF (as expected). For instance, XML snippets with indentation. It's taking it from the PDF to a text editor that is the issue.  And it seems it doesn't really matter what the formatter that was used to create the PDF (AH, FOP, XyVision).  They all have the same issue.  All the space/tabs disappear when pasted into a text editor. If I paste the snippet in oxygen, it seems fine, but I'm assuming oxygen is likely inferring something about the XML markup. When I copy a command line into a text document in oxygen...it does not maintain the spaces/indentation either.
Eric

Chris Brand
 

Hi Eric

 

We have this issue as well. We normally open PDFs with the browser plugin. Firefox for example ignores "formatting" in codeblocks completely, while Chrome and IE preserve this information. Give it a shot using these browsers.

 

Chris.

 

Von: main@dita-users.groups.io [mailto:main@dita-users.groups.io] Im Auftrag von Eric Sirois
Gesendet: Freitag, 8. Mai 202
0 19:55
An: main@dita-users.groups.io
Betreff: [dita-users] Copy and Paste Scripts from PDF (from a codeblock) loosing formatting when copied to a text editor #PDF

 

Hi,

I have a client who is running into an issue where they are trying to copy and paste a script from a PDF into Notepad++ or Ultraedit.  The script is in a codeblock in the source and the PDF does display the content with the proper indentation/spaces as expected.  The issue is that when they copy and paste from the PDF it loses all the formatting except in one case (PDF Architect 6).  Acrobat Reader and PDF Architect 7 don't work as expected.  Has anyone else run into the issue before?  

Eric

Patricia Billard
 

The problem is with PDF.  My research on the issue several years ago revealed that the spaces are not honored as actual characters.  Instead, PDF kind of has text like an image, and estimates how much space to put in, etc.  Basically the spacing info is lost when copying and pasting out of PDF.  Here's one of many discussions on the subject:  https://superuser.com/questions/198392/how-to-copy-text-out-of-a-pdf-without-losing-formatting
Maybe things have changed since then, but I don't think so.  One workaround seems to be to open the PDF using Word.  We ended up pointing folks to the HTML version of our content if they wanted to copy/paste code examples.

Patricia Billard
 

A couple of recent things, having to do with python, which depends on spacing as part of its syntax:
https://www.blog.pythonlibrary.org/2018/05/03/exporting-data-from-pdfs-with-python/
https://automatetheboringstuff.com/chapter13/
--Trish

Wayne Brissette
 

I was really curious about this myself. Eric pointed me to the DITA 1.3 spec which he said also had this issue. Specifically he pointed to me to page 526.

On that page  in the PDF, there's a codeblock. When you copy that text out of the code box, and paste it he's right, it loses semblance of spacing used as formatting.

Adobe Acrobat:

<codeblock>returnType methodName(pList1, pList2) {</codeblock>
where
<parml>
<plentry>
<pt>pList1</pt>
<pd>is the first variable declaration passed to methodName</pd>
</plentry>
<plentry>
<pt>pList2</pt>
<pd>is the second variable declaration passed to methodName</pd>
</plentry>
</parml>

PDF Reader Pro (Mac OS product):

<codeblock>returnType methodName(pList1, pList2) {</codeblock>
where
<parml>
 <plentry>
  <pt>pList1</pt>
  <pd>is the first variable declaration passed to methodName</pd>
 </plentry>
 <plentry>
  <pt>pList2</pt>
  <pd>is the second variable declaration passed to methodName</pd>
 </plentry>
</parml>

Preview (Mac OS built-in PDF viewer from Apple):


<codeblock>returnType methodName(pList1, pList2) {</codeblock>
where
<parml>
 <plentry>
  <pt>pList1</pt>
  <pd>is the first variable declaration passed to methodName</pd>
 </plentry>
 <plentry>
  <pt>pList2</pt>
  <pd>is the second variable declaration passed to methodName</pd>
 </plentry>
</parml>

The PDF was rendered using Antenna House and nothing fancy when it comes to type. However, peeking into the PDF gave me some very unexpected results.

Normally the Ruby library PDF-Reader does a really good job of providing me text. However, for whatever reason the DITA Spec instead of providing me the text, gave me this nonsense:
▯▯▯▯▯▯▯▯▯▯▯▯▯▯▯▯▯▯▯▯▯
▯▯▯▯▯▯▯▯▯▯▯▯▯▯  ▯▯▯▯▯▯▯▯▯▯▯▯▯▯▯▯▯▯▯▯▯▯▯▯▯▯▯▯▯▯▯▯▯▯▯▯▯▯▯▯▯▯▯▯▯▯▯▯▯▯▯▯▯▯▯▯▯▯▯▯▯▯▯▯▯▯▯▯▯▯▯▯▯▯▯▯▯▯▯▯▯▯▯▯▯▯▯▯

A hex editor peering into that page also gave me similar nonsense. So I'm not sure what's going on there, it could have been some red herring and this is on a different document than the one you were actually testing.

But again what is surprising is what happened in my next test. I created a simply codeblock and in it, put the following text:

<codeblock>
   <indent>
      <new>
   </indent>
</codeblock>

Acrobat DC lost the formatting. Preview and PDF Reader Pro spit back an odd mix of stuff:

<codeblock>
<indent> <new>
   </indent>
</codeblock>

However, looking at that code with the PDF-Reader library gave me exactly what I put into the oXygen editor before the DITA OT generated the PDF.

While not hugely scientific because of the limited testing, it at least gives me some hints that Acrobat DC regardless of rendering engine just gives the clipboard the text without any semblance of spacing in the original. Other PDF readers, which are using other tools for rendering the content, seem to do a better job, but they aren't perfect either.

At the end of the day, I think the only way you would have any consistency would be via HTML output from the DITA OT. While the display qualities of a PDF are great, seems that any hope of maintaining that via copy-and-paste is lost.

-Wayne

Larry Kollar
 

Do a web search for "pretty print <script language>" to see if there's a utility that can fix up the pasted content automatically.

I think this happens when the formatter replaces leading spaces with a "start here" command inside the PDF. You could use a utility like mutool to examine the PDF and confirm my guess.