After pressing "play" after pause (manual or caused by incoming call) the player should rewind to start of current phrase (seeking for nearest silence) or, if not possible, rewind 2-3 seconds back and then start play to keep context of last words.
Also, the cover image should not be simply scaled to 4x3 but should keep it's aspect ratio. In my bash script I use something like:
convert "$picture" -geometry 240x320 \
-gravity center \
-background '#4444aa' \
-extent 240x320 \
"$outdir"/"$outdir".jpg
but IMHO it would be better to do it by player itself, not by data preparation program.
Anyway, the application is exactly what I need :)