Student: "How long do you want this report to be?"
Professor: "I would like you to think of this paper much like a lady's dress - long enough to cover the subject, yet short enough to keep it interesting."
This page contains a collection of tips and links around document handling, document formats (such as PostScript and PDF), text processing, typesetting, and related topics. Most of this information is centered around Open Source Software. - Note that these are links which are of interest for my own work; I do not try to build a comprehensive reference list here ;-)
For those who are used to today's so-called "Office" software packages, LyX will be a pretty unusual piece of software. You concentrate on the content, just telling the software what is a caption, what is a citation etc and all the rest is done for you. There is no such thing like entering two spaces following each other to put words further apart, or hitting the Enter key twice to mark a paragraph.
LyX uses LaTeX to do the actual typesetting. It is an very well working software for professional typesetting that produces high-quality, professional output that is both readable (in the sense of "easily accessible") and appealing to the eye.
I am using both LyX and LaTeX for many years now for different purposes:
Together with Koma Script, LyX/LaTeX can make use of letter templates that define the whole layout of a letter.
They should go into the ~/texmf/tex/latex/ tree which is in your home directory. As an example, my private letterhead is in ~/texmf/tex/latex/jha.lco and that of my company is in ~/texmf/tex/latex/qualitycoach.lco
By default, Koma Script letters provide fields such as Customer number, Your reference, Invoice number etc.
To change these names, simply modify the corresponding fields in the LaTeX preamble:
\setkomavar{yourref}[ID]{} \setkomavar{customer}[VAT No.]{} \setkomavar{invoice}[Account]{}
LyX and LaTeX files are plain-text files. This makes it easy to search-and-replace from the command line.
Here is an example where an e-mail address is changed in all .lyx files in the current directory.
Note that the @
character needs to be escaped:
find . -type f -name "*.lyx" -print | xargs perl -p -i.bak -e 's/me\@gmail.com/contact\@example.org/gi;'
We can then re-generate the PDF files without opening LyX. The following command will overwrite existing PDF without warning:
find . -type f -name "*.lyx" -print | xargs lyx --export pdf2
pdf2 simply means that the PDF are to be generated by pdflatex.
By default, the LyX/LaTeX document class "book" uses american page counting. If you want to use
the "European" style (that is, counting the frontmatter in roman numbers until the end of the
table of contents, then re-starting with arabian numbers), you can also use the commands
\frontmatter
and \mainmatter
directly in the text (insert as "ERT", i.e. LaTeX code).
Note that this is only valid for the document class `book'. If you have another document class,
or if you need more customisation, you can make use if the command \pagenumbering{style}
,
where style
is one of the following:
arabic
arabian numbers (default)roman
lowercase roman numbers Roman
uppercase roman numbersalph
lowercase lettersAlph
uppercase lettersBy using \pagenumbering
, the corresponding page counter is automagically
set to 1 (at the point where to code appears). To force any other page number, use
\setcounter{page}{number}
.
(Many thanks to Peter Ehrbar and Roman Merz from ch.comp)
LyX/LaTeX can create active ("clickable") hyperlinks in your documents. All recent versions of LyX treat URLs as you would expect. If you would like to replace the default typewriter font for URLs by italics, simply redefine it in the LaTeX preamble:
\def\UrlFont{\itshape}
By default, captions to figures and tables are set in the same font as the text.
If you want to change this behaviour, use the package caption2
(which
supersedes caption
) and add in the LaTeX Preamble:
\usepackage[it,small]{caption2}
This sets the whole caption in a smaller font and the caption title in italics. If you want the complete caption to appear in italics, use the following code in the LaTeX Preamble:
\usepackage{caption2} \renewcommand\captionfont{\itshape\small}
The spacing above and below the caption is controlled by abovecaptionskip
(default 10 pt) and and belowcaptionskip
(default zero). You can use
the standard LaTeX commands \setlength
and \addtolength
to modify
these values. As an example,
\addtolength\belowcaptionskip{0.5cm}
will add 5 mm below the caption.
By default, LaTeX will use a font such as Times. If you want to use a
Sans-serif base font with LyX, the font selection under "Layout | Document"
on its own does not work ... you will need to add
\renewcommand{\familydefault}{\sfdefault}
in the document preamble.
To use the EURO symbol (€) in LyX, insert a TeX box and write \euro{}
inside.
Mark the entire box and change the language of the marked box to Greek - this is
necessary since \euro
(without any packages) works in Greek but not otherwise.
(Thanks to Helge Hafting)
By default, LaTeX will insert the actual date (that is, the date the document
was processed the last time) in several documents, such as the "book" class.
To set a fixed date, just enter the date you wish using this style;
to suppress the date line completely, use \date{}
in the document preamble.
Sometimes I need to make sure that a letter stands out from other letters, such as a reminder for a late payment. Adding the following code in the document preamble will result in a nice, red, 1.5 cm wide border that runs along the entire page of a letter:
\usepackage{background} \backgroundsetup{scale=1,angle=0,hshift=9.2cm,color=red,opacity=0.8,contents=\rule{1.5cm}{\paperheight}}
Note: this may not show up in the DVI preview, but the resulting PDF is fine.
In LaTeX, the layout of lists changes when you switch to the French language. As an example, it will never use the bullet symbol in lists, will not insert spaces between lists, etc. The detailed behavior and its customization is explained in the frenchb documentation.
If this bothers you, you can revert to the "English" behavior by inserting the following in the preamble:
\usepackage[french]{babel} \frenchbsetup{StandardLayout=true}
I have tried many different HTML converters, but the best "translator" of TeX into HTML and OpenDocument that I am aware of is Eitan M. Gurari's TeX4ht.
We have used this software extensively in the frame of the HitKeeper project, where the documentation is entirely written and maintained in LyX. TeX4ht is used to automagically generate interactive (!) HTML pages from the LyX documentation.
To test if a PostScript printer is actually capable of printing duplex, copy the following lines into a file and send it to the printer:
%! <</Duplex true>> setpagedevice clippath stroke showpage clippath stroke showpage
If there is a small frame on both sites of one sheet of Paper the test is passed. If there are two sheets of paper each having one frame of one side, duplex is turned off.
Another thing to keep in mind is that the printer needs memory for duplex printing. According to a posting in comp.lang.postscript (1998), a HP LaserJet 5M needs for duplex printing:
If you have a PostScript printer that is capable of duplex printing (i.e., on both sides of the same sheet of paper), you may want to enable duplex printing within your PostScript documents. To enable duplex printing of "normal" documents, use:
<</Duplex true>> setpagedevice
in the header (usually the second line of the file). The top of the pages will then be on the same short edge of a DIN A4 sheet - aka "long edge binding" (which should also work with "4-up" documents).
For combined 2-up AND duplex printing, we need "short edge binding", i.e. the top of the (2-up!) pages oriented towards the same long edge of the paper. In this case, use the following code:
<</Duplex true /Tumble true>> setpagedevice
In the section below you can find a simple script that does exactly this job.
When printing manuals, HOWTOs, READMEs etc, I like to save paper. One way to achieve this is to print "n-up", n pages per page and duplex where available. Even more elegant, especially for longer documents, is rearranging the pages so that you can fold them into a handy booklet.
The problem lies in the code needed to actually activate duplex-printing with
short-edge binding on the printer. The script below rearranges pages "2-up"
and recto/verso, so that 4 pages in DIN A4 format fit onto one physical sheet
(using psutils
) and a sed
script inserts the duplex code
into the output file which is ready for sending to a printer.
If you use ps2pdf
to convert PostScript files to pdf, you will have noticed
that Metadata such as the "General Info" fields in the resulting pdf file are usually blank.
This can be avoided by adding a simple "preamble" to the PostScript file before
you pass it to ps2pdf
:
% Document information [ /Author (Who am I) /Title (This is my title) /Subject (This is a description of the Subject) /Creator (What Software) /CreationDate (Spring 1994) /ModDate (Converted to PDF in 2002-08) /Keywords (key, words, go, here) /DOCINFO pdfmark %% here comes the unchanged PS file %!PS-Adobe-3.0 ...
The resulting PostScript file can still be converted to PDF without problems. However, if you really want to ensure that this PostScript file can also be printed on devices that are not aware of the pdfmark operator - such as printers -, you can "undefine" the pdfmark operator in the prologue of the PostScript file:
/pdfmark where {pop} {userdict /pdfmark /cleartomark load put} ifelse
In this case, the DOCINFO code above should of course be included in the body of the PostScript code, not at the beginning.
Source: pdfmark Reference Manual (Adobe Technical Note #5150).
When you create PDF from other applications, the metadata are usually either blank or filled with some garbage.
To change these data, you can use pdftk
(PDF Toolkit) with its command update_info
:
pdftk file.pdf update_info /path/to/info.txt output file2.pdf \ && mv file2.pdf file.pdf
The file /path/to/info.txt is a simple text file with key-value pairs:
InfoKey: Title InfoValue: Document Title here InfoKey: Author InfoValue: John Doe <john.doe@foo.com> InfoKey: Subject InfoValue: I like to put some general Document class or the Company name here InfoKey: Keywords InfoValue: Keyword, other Keyword, third Keyword
To add a watermark to each and every page of an existing PDF document,
simply create a separate file watermark.pdf
that contains only the watermark on an otherwise empty page.
Then, combine the two using the multibackground
command of pdftk
:
pdftk input.pdf multibackground watermark.pdf output out.pdf
You can use exactly the same command to overlay two PDF, e.g. to add comments, annotations or signatures into an existing PDF document.
In this case, the file watermark.pdf
would contain exactly as many pages as the file input.pdf
.
If both documents contain just one page (and only then), a shorter syntax is available:
pdftk input.pdf background watermark.pdf output out.pdf
Note that in the previous commands, the watermark is "under" the original document.
If you do not like this (e.g. you want a scanned signature to "cover" the document text) or if your document does not have a transparent background,
you can use the stamp
and multistamp
commands in the same way:
pdftk input.pdf stamp watermark.pdf output out.pdf
Some on-line services create shipping labels in A6 or A7 format, but the "surrounding" paper format is usually A4. To split such a page into "real" A6- or A7-sized pages, I'm using mutool:
mutool poster -x 2 -y 2 infile-A4.pdf outfile-A6.pdf mutool poster -x 2 -y 4 infile-A4.pdf outfile-A7.pdf
The resulting files contains multiple pages in the smaller format, which can then be printed e.g. with a label printer.
Recently I found that documents in PDF 1.7 format print rather slowly on my printer, in spite of the fact that these documents contain only text.
To speed up printing, I now rewrite these files as PDF 1.4:
gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/default -dNOPAUSE \ -dQUIET -dBATCH -dDetectDuplicateImages -dNOTRANSPARENCY -dCompressFonts=true \ -sOutputFile=x.pdf file.pdf && mv x.pdf file.pdf
If you want to remove the cover page from a document you received (e.g. a fax):
convert -verbose -page A4 -compress Zip multipagefax.tif[1-4] fax.pdf
If you have a multi-page PDF file that you would like to OCR, you can split it into individual PNG files:
pdftoppm -png -r 220 file.pdf prefix for i in prefix*.png; do tesseract -l fra ${i%.*}.png ${i%.*} pdf; done
You can link to a particular page or section in a pdf document in the following way:
<a href="http://www.mydomain.com/myPDF.pdf#page=7">Link text</a> <a href="http://www.mydomain.com/myPDF.pdf#nameddest=TOC">Link text</a>
A prerequisite is that the document is served via HTTP ... in other words, this technique does not work
if the document is merely in the local file structure. - In LyX-created PDF files, nameddest is something like
section*.3
,
section.2.2
,
subsection.5.2.2
,
table.2.1
... you get the idea.
If you want to recover and archive documents that were created at a time where "cut-and-paste" really meant "scissors and glue", you can simply scan them to a PDF file. This section explains how. — Note that most of the material that I scan is paperwork, i.e. letters, invoices, reports and the like. Your mileage may vary.
The approach that I apply with success since 2004 uses a combination of two command-line programs,
scanimage
(from the SANE package, see the Imaging page on this website)
and the image conversion utility convert
from the
ImageMagick
package. First, scan the pages to files:
scanimage --mode Gray --resolution 300 -x 210 -y 297 > page01.pnm
... increase the counter for page02.pnm
, page03.pnm
, etc.
You may also append all scans to a single file, but this will give you trouble if you have a single image
in the series that you would like to re-scan. —
Note that you can change the scan mode per slide, allowing to include color graphics, photos or whatever.
When you know that all scans are of the same type, better use the --batch
option
which provides a built-in counter. In this case, the output redirection is not required anymore:
scanimage --mode Gray ---batch=out%03d.pnm --resolution 300 -x 210 -y 297
The switches -x
and -y
specify that the image area is DIN A4 in size.
If you stumble across the message
scanimage: setting of option --br-y failed (Invalid argument)
,
this indicates that you probably try to use a scan range that is outside the physical limits of the scanner.
Run scanimage -h
and look for the indications of the print range, such as -y 0..296.926mm.
I have encountered this problem on HP all-in-one scanners/printers where the y direction is often
just below the 297 mm that would be needed to scan a full A4 page. Thus, set y to 296.9 mm
... and do not forget to complain to the manufacturer.
When you have finished scanning, combine the files into one:
convert -page A4 -compress Zip page*.pnm allpages.pdf
If you want to remove some shades and specks, try something along this line:
convert -level 5%,95% -page A4 -compress Zip page*.pnm allpages.pdf
Note that file sizes and processing time increase roughly with the square of the resolution:
A 150-dpi line art PNM image of an A4 page takes 272 kB, but at 300 dpi the same page occupies 1.1 MB.
The resulting PDF files are smaller; count a factor of 2 to 4 in reduction if the conversion is invoked without compression.—
A significant reduction in filesize can be obtained by invoking the convert
with the option -compress Zip
as shown above; with that, a page that occupied 170 kB as uncompressed PDF went down to just under 60 kB.
Even smaller files can be achieved by converting the scanned PDF into PS, then re-converting back to PDF. In my environment, the following one-liner usually yields a reduction in filesize of about factor 15 and the resulting files remain very readable:
pdf2ps file.pdf out.ps && ps2pdf -dPDFSETTINGS=/ebook out.ps file.small.pdf && rm out.ps
Note that this will only deal with the images. In other words: if you have OCRd the file before, this process will destroy any text data.
Keep in mind that the resulting PDF files contain a series of images of the individual pages, i.e. they are not searchable for text. However, these are "real" PDF.
For OCR (Optical Character Recognition), we use the same image files that we acquired above and process them by an OCR engine. tesseract is a capable open-source engine that supports multiple dictionaries and can (finally!) create searchable PDF straight in the OCR process:
tesseract -l eng image01.png image01 pdf
If you have many pages to process, you may want to speed things up using parallel processing:
ls -1 image*.png | parallel tesseract -l eng '{}' '{.}' pdf
Option -l
specifies the language to be used. For French, use -l fra
, for German, -l deu
.
There are even dictionaries for scripting languages, etc. —
Argument image01.png
is the input file that we want to OCRise,
image01
is the prefix for the output file (which will become image01.pdf)
and the final keyword pdf
indicates that a searchable PDF shall be produced.
Note that tesseract relies on the page size being part of the input (image) file header. This seems not to be the case with the PNM files created by scanimage and will lead to "unusual", huge page sizes (an A4 page grows to 955x675 mm). The workaround is to specify the resolution and unit explicitly when you convert the PNM to PNG:
convert -density 240x240 -units PixelsPerInch page*.pnm allpages.pdf
See Tesseract issue 150 for details.
To combine the PDF created above, we use the "PDF Toolkit" pdftk:
pdftk in*.pdf cat output outfile.pdf
A problem with the PDF created from tesseract is that they can be very large. The challenge is to reduce file size without compromising at all on the textual information, and without compromising too much on the image quality. After some experimentation, I found that the following ghostscript command line works well for my purpose:
gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/default \ -dNOPAUSE -dQUIET -dBATCH -dDetectDuplicateImages -dNOTRANSPARENCY \ -dCompressFonts=true -sOutputFile=outfile.pdf infile.pdf
scan2pdf.sh
is a shell script that performs the tasks described above with some improvements
- such as support for multiple pages, different scanner types with ADF (explicitly: Epson GT-1500 and Fujitsu SP-1120).
Since early 2017, it includes the option to enable OCR:
Download the scan2pdf.sh script (updated continuously).
Thanks to Tibor D. from ch.comp. The scanner button functionality was inspired by an article about scanning multiple pages into one PDF file from the Pro-Linux website. Another series of articles on the same site deals with the same topic and further aspects of archival.
The table below shows a quick comparison of the file sizes. The input was the same 2-page letter everywhere, scanned in grayscale.
Key findings:
scan2pdf.sh
are still
about 50% bigger than these made with commercial software.scan2pdf.sh
often have a better image quality with less blurring.Software | Scanner | Resolution | Size / bytes | Comment |
---|---|---|---|---|
scan2pdf.sh without OCR | Fujitsu SP-1120 | 225 dpi | 657790 | Clear text |
scan2pdf.sh with OCR | Fujitsu SP-1120 | 225 dpi | 683512 | Very clear text |
ABBYY FineReader Sprint 6.0, WinXP | Fujitsu SP-1120 | 300 dpi (225 not possible) | 432274 | heavy shades, heavy blurring |
ABBYY FineReader Sprint 12.0, Win7 | Fujitsu SP-1120 | 300 dpi (225 not possible) | 709222 | minor shades, clear text |
ABBYY FineReader Sprint 12.0, Win7 | Epson 1660 Photo | 300 dpi (225 not possible) | 577245 | minor shades, clear text |
scan2pdf.sh without OCR | Epson GT-1500 | 225 dpi | 997238 | Clear text |
scan2pdf.sh with OCR | Epson GT-1500 | 225 dpi | 1019201 | Clear text |
Epson Scan / ABBYY FineReader Sprint 6.0, WinXP | Epson GT-1500 | 200 dpi (225 not possible) | 433259 | Very clear text, slight wash-out |
If your scanner has ADF but does not support duplex scanning, you will usually scan the odd pages first (1, 3, 5, ...), then flip the pile of paper around and scan the even pages (2, 4, 6, ...). The resulting two PDF files can be combined using pdftk:
pdftk A=myfile.odd.PDF B=myfile.even.PDF shuffle A Bend-1 output myfile.pdf
Source: pdflabs.com
If you have scanned a pile of documents in duplex mode but actually only want to preserve the odd pages (typical example: bank receipts), you can use pdftk:
pdftk allpages.pdf cat 1-endodd 2 output oddpages.pdf
That single "2" adds only the first even page (= page 2) to the end of that pile. If you don't want this, just omit the "2".
To scan unusual paper sizes on a scanner such as the Fujitsu SP-1120:
scanimage --mode Color --resolution 220 --paper-size Custom --page-width 160mm --page-height 410mm \ --batch=out%03d.pnm --source Adf-front --autofeed=yes --multifeed-detection Do-not-detect for i in out*.pnm; do convert -density 220x220 -compress Zip $i ${i%.*}.pdf; done
bib2ris
is a small program to convert bibliography files from VCH Biblio 2.2
"Archive" format to the "RIS" format, a tagged ASCII file format.
I have used the software VCH Biblio to maintain a database of bibliographic references throughout the first decade of my professional career. However, the MS-DOS version of this program became soon obsolete, as the manufacturer concentrated on the MS-Windows versions. Yet ... all MS-Windows versions of VCH Biblio that I have been supplied with had a number of bugs. One of the more important problems was that the "export" function at least of the 16-bit WinBib 3.x-Versions was broken (it omits Patent numbers, bails out with long texts, etc.). As the manufacturer was not able to provide a fixed version within reasonable time, I decided to stay on the safe side, i.e. with the old DOS version. But ... the data format of VCH Biblio is proprietary and not publicly available, thus my data were deemed to be used with only this - now obsolete - software.
To prevent loss of my precious data, I had to find a way to convert them to another format and this is why I finally wrote bib2ris
;-).
The program does exactly one thing: It takes VCH Biblio 2.2 "archive" files and converts them to the "RIS" format. That's it. Nothing else.
Please do not ask about any other version of VCH Biblio or about other output formats. The answer is "no, it doesn't".
This program was designed for a "one-shot" conversion of a bibliography with 1900+ references and the resulting file imported flawlessly into Reference Manager 10.0 ... in other words, it worked perfectly for me.
I am making this software available under the terms and conditions of the GNU Public License (GPL). This means that the software is available free of charge, including the source code and that any future version will also remain free.
Click here to download bib2ris (31 kB). The latest version is 20020906; detailed instructions are included in the README file.
If you want to verify if your software correctly imports a RIS file, try to import this short test file.
If you want to verify if your software correctly supports the complete charset (including Umlauts, accents etc.), try to import and re-export (!) this character set test file.