Topics

How correct is your Simplified Chinese Index? #DITA-OT #PDF

Toshihiko Makita
 

Hi List,

 

If you are publishing DITA documents that contains <indexterm>, you may make index pages by specifying backmatter/booklists/indexlist. Also if your publication contains Simplified Chinese localization, the indexlist will be generated by sorting <indexterm> using following sort keys:

 

   pinyin-reading/strokes/radical/GB0 code

 

However there is very headache problem in making index pages. A Hanzi (Chinese Character) has sometimes plural pinyin readings and  only the most frequently used pinyin reading is adopted for sorting/grouping <indexterm>.

 

For instance:

 

  1. "速系" has reading "tiao2 su4 xi4 tong3" and the meaning is "Speed control system".
  2. "调查结" has reading "diao4 cha2 jie2 guo3" and the meaning is "Survey results".

 

1. should be grouped into "T" and 2. should grouped into "D" according to its readings.

Surprisingly they are all grouped into "D" group unconditionally because  representative reading defined in Unihan database (ftp://unicode.org/Public/13.0.0/ucdxml/ucd.unihan.flat.zip) for "" is "diao4".

 

Here is sample PDF result generated by PDF2 plug-in (DITA-OT 3.5.1).




This is well-known problem and is not avoidable as long as the index-sorting program uses ICU (http://userguide.icu-project.org/collation) or Java collator directly. (Both collator may uses pinyin-reading defined in Unihan database)

We (Antenna House) has been working on this problem and developed new dictionary based index sorting in I18N Index Library (https://www.antennahouse.com/i18n-index-library). This is still under the development but we can generate correct results for above example. (Outputted via PDF5-ML plug-in https://github.com/AntennaHouse/pdf5-ml)



The dictionary based index-sorting outputs the following log:

     [xslt] [readKeyFile][DEBUG] Unihan database entry=41377
     [xslt] [readDictionaryFile][DEBUG] Dictionary entry=189082
     [xslt] [readDictionaryFile][DEBUG] User dictionary entry=5
     [xslt] [getKey][DEBUG] Processing indexterm=调速系统
     [xslt] [processHanziKey][DEBUG] Got pinyin from dictionary! word=调速 pinyin=tiao2 su4
     [xslt] [processHanziKey][DEBUG] Got pinyin from dictionary! word=系统 pinyin=xi4 tong3
     [xslt] [getKey][DEBUG] Processing indexterm=调查结果
     [xslt] [processHanziKey][DEBUG] Got pinyin from dictionary! word=调查结果 pinyin=diao4 cha2 jie2 guo3

It shows that dictionary based method is useful for generating Simplified Chinese index pages. We hope to refine this library function more accurate to automatically generate index pages.

If you have any interest about this library, could you offer your Simplified Chinese DITA publication data for estimation?

  1. The needed file is merged middle file in the temporary directory that DITA-OT generates. It is usually named xxx_MERGED.xml (xxx is map file name).
  2. If you offer the merged middle file, we will extract only <indexterm> and generates index pages both usual method and dictionary based method.
  3. We will send you the both results and analyzing report with them in no charge.

Hope this helps your DITA publishing.

Regards,

-- 
/*----------------------------------------------------------------------- 
 Toshihiko Makita
 Development Group. Antenna House, Inc. Ina Branch
 E-Mail tmakita@...
 Web site:
 http://www.antenna.co.jp/
 http://www.antennahouse.com/
 ------------------------------------------------------------------------*/