Nobody likes bureaucracy. Letters, invoices, mortgage, car, rent, etc. I’m also somebody who doesn’t like paper. Storing it and searching in piles of it. Please no! Also I don’t like this “story” section of any blog post, so I’ll get to it.
I’ve written a handy bash script to scan documents, correct them and process with OCR.
So this to work properly you need to install iscanimage
, imagemagick
and tesseract
. Also using the amazing zathura
to open it for inspection.
pages=${2:-1}
crop_geometry=2400x3480+0+0
skew=0
deborder_cmd=(-fuzz 10% -fill white -draw "color 2540,3500 floodfill" )
for (( i=1; i<=$pages; i++ ))
do
echo $i
scanimage --format tiff --resolution 300 > "/tmp/scan$i.tiff"
skew=`convert -quiet /tmp/scan$i.tiff \
-crop $crop_geometry \
+repage \
-level 10%,80%,1 \
-monochrome \
"pnm:-" | pamtilt`
skew=`perl -e "if (abs($skew)>0.1 && abs($skew)<3.0) { print (- $skew) } else { print 0 }"`
convert /tmp/scan$i.tiff \
-background white \
-rotate $skew \
+matte "${deborder_cmd[@]}" \
-level 10%,80%,1 \
-crop $crop_geometry \
+repage \
+matte \
-format tiff \
/tmp/scan_converted$i.tiff
done
tiffcp /tmp/scan_converted*.tiff /tmp/scan_merged.tiff
tesseract /tmp/scan_merged.tiff - pdf > $1.pdf
rm /tmp/scan*
zathura $1.pdf
Save this somewhere in $PATH
(don’t forget to chmod +x
). And then calling it is easy. Just add a filename and how many pages it will be:
$ scan mortgage_`$date -I` 4
This will scan four pages and save it as mortgage_<CURRENT_DATE>.pdf
.
OK, you might ask, this solves the “piles of paper problem”, but how do I search in this mess? Easy just use my (quickly put together, poorly documented) smart little cli tool: pdfq.
Install it using go install git.sr.ht/~ghost08/pdfq@latest
. Then navigate to your scan pile directory and create a index:
$ pdfq index
pdfq
uses bleve for a fulltext search index and searching documents is easy:
$ pdfq search mortgage
/home/vlado/scanpile/mortgage_20220214.pdf
/home/vlado/scanpile/mortgage_20220321.pdf
Managing bureaucracy with some hacking, linux and open source is a breeze :)