Joergs "Document" Page

This page contains a collection of tips and links around document handling, document formats (such as PostScript and PDF), text processing, typesetting, and related topics. Most of this information is centered around Open Source Software. - Note that these are links which are of interest for my own work; I do not try to build a comprehensive reference list here ;-)

Typesetting

LyX and LaTeX

For those who are used to today's so-called "Office" software packages, LyX will be a pretty unusual piece of software. You concentrate on the content, just telling the software what is a caption, what is a citation etc and all the rest is done for you. There is no such thing like entering two spaces following each other to put words further apart, or hitting the Enter key twice to mark a paragraph.

LyX uses LaTeX to do the actual typesetting. It is an very well working software for professional typesetting that produces high-quality, professional output that is both readable (in the sense of "easily accessible") and appealing to the eye.

I am using both LyX and LaTeX for many years now for different purposes:

Writing technical documentation
Writing letters, both for my company and private
Generating invoices from an SQlite database
Generating and printing various lists for my seminars and for our theater group
Generating printed exams in 2 different versions and 2 different languages
Printing serial letters
... and many others

Templates

Together with Koma Script, LyX/LaTeX can make use of letter templates that define the whole layout of a letter.

They should go into the ~/texmf/tex/latex/ tree which is in your home directory. As an example, my private letterhead is in ~/texmf/tex/latex/jha.lco and that of my company is in ~/texmf/tex/latex/qualitycoach.lco

Custom field names

By default, Koma Script letters provide fields such as Customer number, Your reference, Invoice number etc.

To change these names, simply modify the corresponding fields in the LaTeX preamble:

\setkomavar{yourref}[ID]{}
\setkomavar{customer}[VAT No.]{}
\setkomavar{invoice}[Account]{}

Editing on the command line

LyX and LaTeX files are plain-text files. This makes it easy to search-and-replace from the command line.

Here is an example where an e-mail address is changed in all .lyx files in the current directory. Note that the @ character needs to be escaped:

find . -type f -name "*.lyx" -print | xargs perl -p -i.bak -e 's/me\@gmail.com/contact\@example.org/gi;'

We can then re-generate the PDF files without opening LyX. The following command will overwrite existing PDF without warning:

find .  -type f -name "*.lyx" -print | xargs lyx --export pdf2

pdf2 simply means that the PDF are to be generated by pdflatex.

Page numbering

By default, the LyX/LaTeX document class "book" uses american page counting. If you want to use the "European" style (that is, counting the frontmatter in roman numbers until the end of the table of contents, then re-starting with arabian numbers), you can also use the commands \frontmatter and \mainmatter directly in the text (insert as "ERT", i.e. LaTeX code).

Note that this is only valid for the document class `book'. If you have another document class, or if you need more customisation, you can make use if the command \pagenumbering{style}, where style is one of the following:

arabic arabian numbers (default)
roman lowercase roman numbers
Roman uppercase roman numbers
alph lowercase letters
Alph uppercase letters

By using \pagenumbering, the corresponding page counter is automagically set to 1 (at the point where to code appears). To force any other page number, use \setcounter{page}{number}.

(Many thanks to Peter Ehrbar and Roman Merz from ch.comp)

Hyperlinks

LyX/LaTeX can create active ("clickable") hyperlinks in your documents. All recent versions of LyX treat URLs as you would expect. If you would like to replace the default typewriter font for URLs by italics, simply redefine it in the LaTeX preamble:

\def\UrlFont{\itshape}

Caption Settings

By default, captions to figures and tables are set in the same font as the text. If you want to change this behaviour, use the package caption2 (which supersedes caption) and add in the LaTeX Preamble:

\usepackage[it,small]{caption2}

This sets the whole caption in a smaller font and the caption title in italics. If you want the complete caption to appear in italics, use the following code in the LaTeX Preamble:

\usepackage{caption2}
\renewcommand\captionfont{\itshape\small}

The spacing above and below the caption is controlled by abovecaptionskip (default 10 pt) and and belowcaptionskip (default zero). You can use the standard LaTeX commands \setlength and \addtolength to modify these values. As an example,

\addtolength\belowcaptionskip{0.5cm}

will add 5 mm below the caption.

Changing the font

By default, LaTeX will use a font such as Times. If you want to use a Sans-serif base font with LyX, the font selection under "Layout | Document" on its own does not work ... you will need to add \renewcommand{\familydefault}{\sfdefault} in the document preamble.

€

To use the EURO symbol (€) in LyX, insert a TeX box and write \euro{} inside. Mark the entire box and change the language of the marked box to Greek - this is necessary since \euro (without any packages) works in Greek but not otherwise.

(Thanks to Helge Hafting)

Suppress Date

By default, LaTeX will insert the actual date (that is, the date the document was processed the last time) in several documents, such as the "book" class. To set a fixed date, just enter the date you wish using this style; to suppress the date line completely, use \date{} in the document preamble.

Red edge

Sometimes I need to make sure that a letter stands out from other letters, such as a reminder for a late payment. Adding the following code in the document preamble will result in a nice, red, 1.5 cm wide border that runs along the entire page of a letter:

\usepackage{background}
\backgroundsetup{scale=1,angle=0,hshift=9.2cm,color=red,opacity=0.8,contents=\rule{1.5cm}{\paperheight}}

Note: this may not show up in the DVI preview, but the resulting PDF is fine.

Ignore French list layout

In LaTeX, the layout of lists changes when you switch to the French language. As an example, it will never use the bullet symbol in lists, will not insert spaces between lists, etc. The detailed behavior and its customization is explained in the frenchb documentation.

If this bothers you, you can revert to the "English" behavior by inserting the following in the preamble:

\usepackage[french]{babel}
\frenchbsetup{StandardLayout=true}

Conversion to HTML

I have tried many different HTML converters, but the best "translator" of TeX into HTML and OpenDocument that I am aware of is Eitan M. Gurari's TeX4ht.

We have used this software extensively in the frame of the HitKeeper project, where the documentation is entirely written and maintained in LyX. TeX4ht is used to automagically generate interactive (!) HTML pages from the LyX documentation.

Duplex Printers

To test if a PostScript printer is actually capable of printing duplex, copy the following lines into a file and send it to the printer:

%!
<</Duplex true>> setpagedevice
clippath stroke showpage
clippath stroke showpage

If there is a small frame on both sites of one sheet of Paper the test is passed. If there are two sheets of paper each having one frame of one side, duplex is turned off.

Another thing to keep in mind is that the printer needs memory for duplex printing. According to a posting in comp.lang.postscript (1998), a HP LaserJet 5M needs for duplex printing:

13 MB of memory if you like to print with 600 dpi, and
6 MB for 300dpi (6 MB is the factory installed Memory).

Duplex Printing

If you have a PostScript printer that is capable of duplex printing (i.e., on both sides of the same sheet of paper), you may want to enable duplex printing within your PostScript documents. To enable duplex printing of "normal" documents, use:

<</Duplex true>> setpagedevice

in the header (usually the second line of the file). The top of the pages will then be on the same short edge of a DIN A4 sheet - aka "long edge binding" (which should also work with "4-up" documents).

For combined 2-up AND duplex printing, we need "short edge binding", i.e. the top of the (2-up!) pages oriented towards the same long edge of the paper. In this case, use the following code:

<</Duplex true /Tumble true>> setpagedevice

In the section below you can find a simple script that does exactly this job.

Booklet Printing

When printing manuals, HOWTOs, READMEs etc, I like to save paper. One way to achieve this is to print "n-up", n pages per page and duplex where available. Even more elegant, especially for longer documents, is rearranging the pages so that you can fold them into a handy booklet.

The problem lies in the code needed to actually activate duplex-printing with short-edge binding on the printer. The script below rearranges pages "2-up" and recto/verso, so that 4 pages in DIN A4 format fit onto one physical sheet (using psutils) and a sed script inserts the duplex code into the output file which is ready for sending to a printer.

Click here to download the makebook.sh script.

Adding PDF metadata

If you use ps2pdf to convert PostScript files to pdf, you will have noticed that Metadata such as the "General Info" fields in the resulting pdf file are usually blank. This can be avoided by adding a simple "preamble" to the PostScript file before you pass it to ps2pdf:

% Document information
[ /Author (Who am I)
  /Title (This is my title)
  /Subject (This is a description of the Subject)
  /Creator (What Software)
  /CreationDate (Spring 1994)
  /ModDate (Converted to PDF in 2002-08)
  /Keywords (key, words, go, here)
  /DOCINFO pdfmark
%% here comes the unchanged PS file
%!PS-Adobe-3.0
...

The resulting PostScript file can still be converted to PDF without problems. However, if you really want to ensure that this PostScript file can also be printed on devices that are not aware of the pdfmark operator - such as printers -, you can "undefine" the pdfmark operator in the prologue of the PostScript file:

/pdfmark where {pop} {userdict /pdfmark /cleartomark load put} ifelse

In this case, the DOCINFO code above should of course be included in the body of the PostScript code, not at the beginning.

Source: pdfmark Reference Manual (Adobe Technical Note #5150).

Changing PDF metadata

When you create PDF from other applications, the metadata are usually either blank or filled with some garbage. To change these data, you can use pdftk (PDF Toolkit) with its command update_info:

pdftk file.pdf update_info /path/to/info.txt output file2.pdf \
&& mv file2.pdf file.pdf

The file /path/to/info.txt is a simple text file with key-value pairs:

InfoKey: Title
InfoValue: Document Title here
InfoKey: Author
InfoValue: John Doe <john.doe@foo.com>
InfoKey: Subject
InfoValue: I like to put some general Document class or the Company name here
InfoKey: Keywords
InfoValue: Keyword, other Keyword, third Keyword

Adding a watermark to a PDF file

To add a watermark to each and every page of an existing PDF document, simply create a separate file watermark.pdf that contains only the watermark on an otherwise empty page. Then, combine the two using the multibackground command of pdftk:

pdftk input.pdf multibackground watermark.pdf output out.pdf

You can use exactly the same command to overlay two PDF, e.g. to add comments, annotations or signatures into an existing PDF document. In this case, the file watermark.pdf would contain exactly as many pages as the file input.pdf.

If both documents contain just one page (and only then), a shorter syntax is available:

pdftk input.pdf background watermark.pdf output out.pdf

Note that in the previous commands, the watermark is "under" the original document. If you do not like this (e.g. you want a scanned signature to "cover" the document text) or if your document does not have a transparent background, you can use the stamp and multistamp commands in the same way:

pdftk input.pdf stamp watermark.pdf output out.pdf

Splitting a PDF page into smaller pages

Some on-line services create shipping labels in A6 or A7 format, but the "surrounding" paper format is usually A4. To split such a page into "real" A6- or A7-sized pages, I'm using mutool:

mutool poster -x 2 -y 2 infile-A4.pdf outfile-A6.pdf
mutool poster -x 2 -y 4 infile-A4.pdf outfile-A7.pdf

The resulting files contains multiple pages in the smaller format, which can then be printed e.g. with a label printer.

Re-writing PDF to accelerate printing

Recently I found that documents in PDF 1.7 format print rather slowly on my printer, in spite of the fact that these documents contain only text.

To speed up printing, I now rewrite these files as PDF 1.4:

gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/default -dNOPAUSE \
-dQUIET -dBATCH -dDetectDuplicateImages -dNOTRANSPARENCY -dCompressFonts=true \
-sOutputFile=x.pdf file.pdf && mv x.pdf file.pdf

Some PDF tricks

If you want to remove the cover page from a document you received (e.g. a fax):

convert -verbose -page A4 -compress Zip multipagefax.tif[1-4] fax.pdf

If you have a multi-page PDF file that you would like to OCR, you can split it into individual PNG files:

pdftoppm -png -r 220 file.pdf prefix
for i in prefix*.png; do tesseract -l fra ${i%.*}.png ${i%.*} pdf; done

You can link to a particular page or section in a pdf document in the following way:

<a href="http://www.mydomain.com/myPDF.pdf#page=7">Link text</a>
<a href="http://www.mydomain.com/myPDF.pdf#nameddest=TOC">Link text</a>

A prerequisite is that the document is served via HTTP ... in other words, this technique does not work if the document is merely in the local file structure. - In LyX-created PDF files, nameddest is something like section*.3, section.2.2, subsection.5.2.2, table.2.1 ... you get the idea.

Scanning multiple pages

The approach that I apply with success since 2004 uses a combination of two command-line programs, scanimage (from the SANE package, see the Imaging page on this website) and the image conversion utility convert from the ImageMagick package. First, scan the pages to files:

scanimage --mode Gray --resolution 300 -x 210 -y 297 > page01.pnm

... increase the counter for page02.pnm, page03.pnm, etc. You may also append all scans to a single file, but this will give you trouble if you have a single image in the series that you would like to re-scan. — Note that you can change the scan mode per slide, allowing to include color graphics, photos or whatever. When you know that all scans are of the same type, better use the --batch option which provides a built-in counter. In this case, the output redirection is not required anymore:

scanimage --mode Gray ---batch=out%03d.pnm --resolution 300 -x 210 -y 297

The switches -x and -y specify that the image area is DIN A4 in size.

scanimage: setting of option failed

If you stumble across the message scanimage: setting of option --br-y failed (Invalid argument), this indicates that you probably try to use a scan range that is outside the physical limits of the scanner. Run scanimage -h and look for the indications of the print range, such as -y 0..296.926mm. I have encountered this problem on HP all-in-one scanners/printers where the y direction is often just below the 297 mm that would be needed to scan a full A4 page. Thus, set y to 296.9 mm ... and do not forget to complain to the manufacturer.

Building the PDF (still without OCR)

When you have finished scanning, combine the files into one:

convert -page A4 -compress Zip page*.pnm allpages.pdf

If you want to remove some shades and specks, try something along this line:

convert -level 5%,95% -page A4 -compress Zip page*.pnm allpages.pdf

Note that file sizes and processing time increase roughly with the square of the resolution: A 150-dpi line art PNM image of an A4 page takes 272 kB, but at 300 dpi the same page occupies 1.1 MB. The resulting PDF files are smaller; count a factor of 2 to 4 in reduction if the conversion is invoked without compression.— A significant reduction in filesize can be obtained by invoking the convert with the option -compress Zip as shown above; with that, a page that occupied 170 kB as uncompressed PDF went down to just under 60 kB.

Even smaller files can be achieved by converting the scanned PDF into PS, then re-converting back to PDF. In my environment, the following one-liner usually yields a reduction in filesize of about factor 15 and the resulting files remain very readable:

pdf2ps file.pdf out.ps && ps2pdf -dPDFSETTINGS=/ebook out.ps file.small.pdf && rm out.ps

Note that this will only deal with the images. In other words: if you have OCRd the file before, this process will destroy any text data.

Keep in mind that the resulting PDF files contain a series of images of the individual pages, i.e. they are not searchable for text. However, these are "real" PDF.

OCR

For OCR (Optical Character Recognition), we use the same image files that we acquired above and process them by an OCR engine. tesseract is a capable open-source engine that supports multiple dictionaries and can (finally!) create searchable PDF straight in the OCR process:

tesseract -l eng image01.png image01 pdf

If you have many pages to process, you may want to speed things up using parallel processing:

ls -1 image*.png | parallel tesseract -l eng '{}' '{.}' pdf

Option -l specifies the language to be used. For French, use -l fra, for German, -l deu. There are even dictionaries for scripting languages, etc. — Argument image01.png is the input file that we want to OCRise, image01 is the prefix for the output file (which will become image01.pdf) and the final keyword pdf indicates that a searchable PDF shall be produced.

Note that tesseract relies on the page size being part of the input (image) file header. This seems not to be the case with the PNM files created by scanimage and will lead to "unusual", huge page sizes (an A4 page grows to 955x675 mm). The workaround is to specify the resolution and unit explicitly when you convert the PNM to PNG:

convert -density 240x240 -units PixelsPerInch page*.pnm allpages.pdf

See Tesseract issue 150 for details.

Building the PDF (now with OCR)

To combine the PDF created above, we use the "PDF Toolkit" pdftk:

pdftk in*.pdf cat output outfile.pdf

A problem with the PDF created from tesseract is that they can be very large. The challenge is to reduce file size without compromising at all on the textual information, and without compromising too much on the image quality. After some experimentation, I found that the following ghostscript command line works well for my purpose:

gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/default \
   -dNOPAUSE -dQUIET -dBATCH -dDetectDuplicateImages -dNOTRANSPARENCY \
   -dCompressFonts=true -sOutputFile=outfile.pdf infile.pdf

The final script

scan2pdf.sh is a shell script that performs the tasks described above with some improvements - such as support for multiple pages, different scanner types with ADF (explicitly: Epson GT-1500 and Fujitsu SP-1120). Since early 2017, it includes the option to enable OCR:

Download the scan2pdf.sh script (updated continuously).

Thanks to Tibor D. from ch.comp. The scanner button functionality was inspired by an article about scanning multiple pages into one PDF file from the Pro-Linux website. Another series of articles on the same site deals with the same topic and further aspects of archival.

File Sizes

The table below shows a quick comparison of the file sizes. The input was the same 2-page letter everywhere, scanned in grayscale.

Key findings:

The additional file size created by the OCR process is minimal (2...4% of the file size)
The files created by scan2pdf.sh are still about 50% bigger than these made with commercial software.
The files created by scan2pdf.sh often have a better image quality with less blurring.

Software	Scanner	Resolution	Size / bytes	Comment
scan2pdf.sh without OCR	Fujitsu SP-1120	225 dpi	657790	Clear text
scan2pdf.sh with OCR	Fujitsu SP-1120	225 dpi	683512	Very clear text
ABBYY FineReader Sprint 6.0, WinXP	Fujitsu SP-1120	300 dpi (225 not possible)	432274	heavy shades, heavy blurring
ABBYY FineReader Sprint 12.0, Win7	Fujitsu SP-1120	300 dpi (225 not possible)	709222	minor shades, clear text
ABBYY FineReader Sprint 12.0, Win7	Epson 1660 Photo	300 dpi (225 not possible)	577245	minor shades, clear text
scan2pdf.sh without OCR	Epson GT-1500	225 dpi	997238	Clear text
scan2pdf.sh with OCR	Epson GT-1500	225 dpi	1019201	Clear text
Epson Scan / ABBYY FineReader Sprint 6.0, WinXP	Epson GT-1500	200 dpi (225 not possible)	433259	Very clear text, slight wash-out

Scanning odd and even pages

If your scanner has ADF but does not support duplex scanning, you will usually scan the odd pages first (1, 3, 5, ...), then flip the pile of paper around and scan the even pages (2, 4, 6, ...). The resulting two PDF files can be combined using pdftk:

pdftk A=myfile.odd.PDF B=myfile.even.PDF shuffle A Bend-1 output myfile.pdf

Source: pdflabs.com

Separating odd and even pages

If you have scanned a pile of documents in duplex mode but actually only want to preserve the odd pages (typical example: bank receipts), you can use pdftk:

pdftk allpages.pdf cat 1-endodd 2 output oddpages.pdf

That single "2" adds only the first even page (= page 2) to the end of that pile. If you don't want this, just omit the "2".

Scanning unusual paper sizes

To scan unusual paper sizes on a scanner such as the Fujitsu SP-1120:

scanimage --mode Color --resolution 220 --paper-size Custom --page-width 160mm --page-height 410mm \
--batch=out%03d.pnm --source Adf-front --autofeed=yes --multifeed-detection Do-not-detect
for i in out*.pnm; do convert -density 220x220  -compress Zip  $i ${i%.*}.pdf; done

Bibliographic tools

Biblio to RIS
format converter

bib2ris is a small program to convert bibliography files from VCH Biblio 2.2 "Archive" format to the "RIS" format, a tagged ASCII file format.

I have used the software VCH Biblio to maintain a database of bibliographic references throughout the first decade of my professional career. However, the MS-DOS version of this program became soon obsolete, as the manufacturer concentrated on the MS-Windows versions. Yet ... all MS-Windows versions of VCH Biblio that I have been supplied with had a number of bugs. One of the more important problems was that the "export" function at least of the 16-bit WinBib 3.x-Versions was broken (it omits Patent numbers, bails out with long texts, etc.). As the manufacturer was not able to provide a fixed version within reasonable time, I decided to stay on the safe side, i.e. with the old DOS version. But ... the data format of VCH Biblio is proprietary and not publicly available, thus my data were deemed to be used with only this - now obsolete - software.

To prevent loss of my precious data, I had to find a way to convert them to another format and this is why I finally wrote bib2ris ;-). The program does exactly one thing: It takes VCH Biblio 2.2 "archive" files and converts them to the "RIS" format. That's it. Nothing else.

Please do not ask about any other version of VCH Biblio or about other output formats. The answer is "no, it doesn't".

This program was designed for a "one-shot" conversion of a bibliography with 1900+ references and the resulting file imported flawlessly into Reference Manager 10.0 ... in other words, it worked perfectly for me.

I am making this software available under the terms and conditions of the GNU Public License (GPL). This means that the software is available free of charge, including the source code and that any future version will also remain free.

Click here to download bib2ris (31 kB). The latest version is 20020906; detailed instructions are included in the README file.

Jörgs "Document" Page