Man pages sections > man1 > djvu2hocr

djvu2hocr - DjVu to hOCR converter

DJVU2HOCR(1) djvu2hocr manual DJVU2HOCR(1)

NAME

djvu2hocr - DjVu to hOCR converter

SYNOPSIS

djvu2hocr [ option...] djvu-file
djvu2hocr { --version | --help | -h}

DESCRIPTION

djvu2hocr converts hidden text from a DjVu file to the hOCR[1] format.

OPTIONS

Input selection options

-p, --pages=page-range
Specifies pages to covert. page-range is a comma-separated list of sub-ranges. Each sub-range is either a single page (e.g. 17) or a contiguous range of pages (e.g. 37-42). Pages are numbered from 1.
 
The default is to convert all pages.

Text segmentation options

--word-segmentation=simple
Use the same word segmentation as found in the DjVu file.
 
This is the default.
--word-segmentation=uax29
Use the Unicode Text Segmentation[2] algorithm to break lines into words, possibly fixing word segmentation found in the DjVu file.

HTML output options

--title=title
Specifies the document title.
 
The default is “DjVu hidden text layer”.
--css=style
Add the specified CSS style to the document.
 
For example, --css='.ocrx_line { display: block; }' can be used to visually preserve line breaks.

Other options

--version
Output version information and exit.
-h, --help
Display help and exit.

PORTABILITY

djvu2hocr uses a custom extension to hOCR to retain characters which cannot be directly represented in an HTML/XML document. For example, control character BEL (^G, U+0007), is converted into the following HTML chunk: <span class="djvu_char" title="#x07"> </span>

BUGS

Please report bugs at: https://github.com/jwilk/ocrodjvu/issues

SEE ALSO

djvu(1), hocr2djvused(1), ocrodjvu(1)

NOTES

1.
hOCR
https://docs.google.com/View?docid=dfxcv4vc_67g844kf
2.
Unicode Text Segmentation
http://unicode.org/reports/tr29/
2017-02-07 djvu2hocr 0.10.2