Tools for scanning into PDF and performing OCR

Contents

Scanning documents to PDF (using vFlat)
OCRmyPDF (Tool for making scanned PDFs searchable)
Reducing size of scanned PDFs
Convert epub to pdf
Convert folder of jpg files to pdf (OCR and small pdf size)
One-liners
Fix problem with numbering
Fix problem with numbering (single line command)
References

Scanning documents to PDF (using vFlat)

vFlat is an android/iPhone app that makes it very fast to scan books to JPG and export to PDF either in Color or Grayscale. I tried other apps, but this is the easiest and it is free. Now they limit the PDF conversion functionality, but you can export to JPG files and then follow the steps in a later section to convert to PDF. https://play.google.com/store/apps/details?id=com.voyagerx.scanner&hl=en&gl=US

OCRmyPDF (Tool for making scanned PDFs searchable)

Here I put a brief explanation of how to install and use OCRmyPDF, but you can see detailed instructions in OCRmyPDF Documentation

Installation on Mac

Install using the following command: brew install ocrmypdf

Installation on Ubuntu

Install using the following command: sudo apt-get install ocrmypdf sudo apt install tesseract-ocr-jpn sudo apt install tesseract-ocr-jpn-vert

Installation on Windows

Install “Ubuntu” (WSL) from the Windows Store to have access to an Ubuntu terminal from Windows and install from there using the previous instructions.

NOTE: when using Ubuntu from windows, the hard drive will be mounted on /mnt/c/ (important if you want to convert files stored in the windows filesystem, e.g. on your Desktop)

Usage

Change to the directory containing the file you want to convert

cd DIRECTORY
Execute OCRmyPDF with the following command (replace a.pdf and o.pdf with the appropiate names):

ocrmypdf input.pdf --output-type pdf output.pdf
For Japanese (or other languages) just add the option -l with the appropiate option

ocrmypdf -l jpn input.pdf --output-type pdf output.pdf

Reducing size of scanned PDFs

Reference: https://pandemoniumillusion.wordpress.com/2008/05/07/compress-a-pdf-with-pdftk/

pdf2ps large.pdf very_large.ps; ps2pdf very_large.ps small.pdf

Convert epub to pdf

sudo apt-get install calibre

ebook-convert file.epub file.pdf

Convert folder of jpg files to pdf (OCR and small pdf size)

First rename files with the numbering with leading zeros (so that the order is correct when creating the PDF from JPGs)

for f in *; do num=$(echo "$f" | grep -o -E '[0-9]+'); newnum=$(printf "%03d" "$num"); mv "$f" "${f/$num/$newnum}"; done

Convert JPGs to PDF (reducing size by 50%)
```
convert *.jpg -resize 50% p50.pdf
```

Reduce size of PDF

pdf2ps p50.pdf large.ps; ps2pdf large.ps small.pdf

Perform OCR (select the language)

ocrmypdf -l eng small.pdf --output-type pdf small-ocr.pdf

ocrmypdf -l jpn small.pdf --output-type pdf small-ocr.pdf

One-liners

This is the same code as above but in a single line.

English

for f in *; do num=$(echo "$f" | grep -o -E '[0-9]+'); newnum=$(printf "%03d" "$num"); mv "$f" "${f/$num/$newnum}"; done; convert *.jpg -resize 50% p50.pdf; pdf2ps p50.pdf large.ps; ps2pdf large.ps small.pdf; 
ocrmypdf -l eng small.pdf --output-type pdf small-ocr.pdf

Japanese

for f in *; do num=$(echo "$f" | grep -o -E '[0-9]+'); newnum=$(printf "%03d" "$num"); mv "$f" "${f/$num/$newnum}"; done; convert *.jpg -resize 50% p50.pdf; pdf2ps p50.pdf large.ps; ps2pdf large.ps small.pdf; 
ocrmypdf -l jpn small.pdf --output-type pdf small-ocr.pdf

Fix problem with numbering

Rename to file.pdf
Convert pdf to jpgs
```
convert file.pdf f-%d.jpg
```

Set numbering starting from 1 instead of 0

NUM_PAGES=$(($(ls -1|wc -l)-1));
for i in $(eval echo {0..$((NUM_PAGES-1))}); do mv f-${i}.jpg g-$((i+1)).jpg; done

Rename names according to lexicographical order

i=1; for j in $(echo $(eval echo {1..${NUM_PAGES}}) | tr ' ' '\n' | sort); do mv g-${i}.jpg h-${j}.jpg; i=$((i+1)); done 

Rename with leading zeros

for i in $(eval echo {1..${NUM_PAGES}}); do new_num=$(printf "%03d" $i); mv h-${i}.jpg i-${new_num}.jpg; done

Convert to pdf again; reduce size; ocr (jpn in this case)

convert *.jpg p50.pdf; pdf2ps p50.pdf large.ps; ps2pdf large.ps small.pdf; ocrmypdf -l jpn small.pdf --output-type pdf small-ocr.pdf

Fix problem with numbering (single line command)

English

convert file.pdf f-%d.jpg; NUM_PAGES=$(($(ls -1|wc -l)-1)); for i in $(eval echo {0..$((NUM_PAGES-1))}); do mv f-${i}.jpg g-$((i+1)).jpg; done; i=1; for j in $(echo $(eval echo {1..${NUM_PAGES}}) | tr ' ' '\n' | sort); do mv g-${i}.jpg h-${j}.jpg; i=$((i+1)); done ; for i in $(eval echo {1..${NUM_PAGES}}); do new_num=$(printf "%03d" $i); mv h-${i}.jpg i-${new_num}.jpg; done; convert *.jpg p50.pdf; pdf2ps p50.pdf large.ps; ps2pdf large.ps small.pdf; ocrmypdf -l eng small.pdf --output-type pdf small-ocr-eng.pdf

Japanese

convert file.pdf f-%d.jpg; NUM_PAGES=$(($(ls -1|wc -l)-1)); for i in $(eval echo {0..$((NUM_PAGES-1))}); do mv f-${i}.jpg g-$((i+1)).jpg; done; i=1; for j in $(echo $(eval echo {1..${NUM_PAGES}}) | tr ' ' '\n' | sort); do mv g-${i}.jpg h-${j}.jpg; i=$((i+1)); done ; for i in $(eval echo {1..${NUM_PAGES}}); do new_num=$(printf "%03d" $i); mv h-${i}.jpg i-${new_num}.jpg; done; convert *.jpg p50.pdf; pdf2ps p50.pdf large.ps; ps2pdf large.ps small.pdf; ocrmypdf -l jpn small.pdf --output-type pdf small-ocr-jpn.pdf

References

https://play.google.com/store/apps/details?id=com.voyagerx.scanner&hl=en&gl=US
OCRmyPDF Documentation
https://askubuntu.com/questions/113544/how-can-i-reduce-the-file-size-of-a-scanned-pdf-file