code:start
NDAC PDF scrape - use this to bulk download the PDFs
NDAC txt scrape - this will extract the text from the downloaded PDFs.
PDF to TXT - this is used by the previous program
the following script merged together the individual sections into the single-file chapters:
- chaptermerge.sh
#!/bin/bash BASE=$HOME/tmp/ndac-2020-10-new for i in `find ./[0-9]* -type d -name '*-*-*' -print` ; do echo $i CHAPTER=`echo ${i}|cut -d '/' -f 4` echo chapter = $CHAPTER cd $BASE/$i echo "in directory:" pwd echo "creating ${CHAPTER}.txt" echo "====== Chapter ${CHAPTER} ======" > ${CHAPTER}.txt cat ${CHAPTER}-*.txt >> ${CHAPTER}.txt; done
code/start.txt · Last modified: 2022/03/04 10:31 by 127.0.0.1