User Tools

Site Tools


code:start

NDAC PDF scrape - use this to bulk download the PDFs

NDAC txt scrape - this will extract the text from the downloaded PDFs.

PDF to TXT - this is used by the previous program

the following script merged together the individual sections into the single-file chapters:

chaptermerge.sh
#!/bin/bash

BASE=$HOME/tmp/ndac-2020-10-new

for i in `find ./[0-9]* -type d -name '*-*-*' -print` ; do
  echo $i
  CHAPTER=`echo ${i}|cut -d '/' -f 4`
  echo chapter = $CHAPTER
  cd $BASE/$i 
  echo "in directory:"
  pwd
  echo "creating ${CHAPTER}.txt" 
  echo "====== Chapter ${CHAPTER} ======" > ${CHAPTER}.txt 
  cat ${CHAPTER}-*.txt >> ${CHAPTER}.txt;
done
code/start.txt · Last modified: 2020/10/25 12:57 (external edit)