Scan


Nobody likes bureaucracy. Letters, invoices, mortgage, car, rent, etc. I’m also somebody who doesn’t like paper. Storing it and searching in piles of it. Please no! Also I don’t like this “story” section of any blog post, so I’ll get to it.

I’ve written a handy bash script to scan documents, correct them and process with OCR.

So this to work properly you need to install iscanimage, imagemagick and tesseract. Also using the amazing zathura to open it for inspection.

pages=${2:-1}
crop_geometry=2400x3480+0+0
skew=0
deborder_cmd=(-fuzz 10% -fill white -draw "color 2540,3500 floodfill" )

for (( i=1; i<=$pages; i++ ))
do
	echo $i
	scanimage --format tiff --resolution 300  > "/tmp/scan$i.tiff"

	skew=`convert -quiet /tmp/scan$i.tiff \
		-crop $crop_geometry \
		+repage \
		-level 10%,80%,1 \
		-monochrome \
		"pnm:-" | pamtilt`

	skew=`perl -e "if (abs($skew)>0.1 && abs($skew)<3.0) { print (- $skew) } else { print 0 }"`

	convert  /tmp/scan$i.tiff \
		-background white \
		-rotate $skew \
		+matte "${deborder_cmd[@]}" \
		-level 10%,80%,1 \
		-crop $crop_geometry \
		+repage \
		+matte \
		-format tiff \
		/tmp/scan_converted$i.tiff
done

tiffcp /tmp/scan_converted*.tiff /tmp/scan_merged.tiff
tesseract /tmp/scan_merged.tiff - pdf > $1.pdf
rm /tmp/scan*
zathura $1.pdf

Save this somewhere in $PATH (don’t forget to chmod +x). And then calling it is easy. Just add a filename and how many pages it will be:

$ scan mortgage_`$date -I` 4

This will scan four pages and save it as mortgage_<CURRENT_DATE>.pdf.

OK, you might ask, this solves the “piles of paper problem”, but how do I search in this mess? Easy just use my (quickly put together, poorly documented) smart little cli tool: pdfq.

Install it using go install git.sr.ht/~ghost08/pdfq@latest. Then navigate to your scan pile directory and create a index:

$ pdfq index

pdfq uses bleve for a fulltext search index and searching documents is easy:

$ pdfq search mortgage
/home/vlado/scanpile/mortgage_20220214.pdf
/home/vlado/scanpile/mortgage_20220321.pdf

Managing bureaucracy with some hacking, linux and open source is a breeze :)