Using Perl to extract specific text from multiple files

So after I finish this book I will need to review it again. I will try to get 80% or more retention before I sit the test. To help me do that I am following the review spread sheet provided and will do the mind-map activities.
Another step I will add is to use my vocabulary app to help remember the key terms from the text. They are provided at the end of each chapter:
keywords.png
I have a html copy of this book I read on my tablet so all I need todo is write a perl script like this:
[code language=”perl”]
use Modern::Perl;
use HTML::Strip;
#set the relevant utf8 modes
binmode (STDOUT, ‘:utf8’);
binmode (STDIN, ‘:utf8′);
my $MYFILE;
open ($MYFILE, ">>ccnawordlist.txt") or die "Error opening";
my $strip = HTML::Strip->new();
my $line;
#read all files from argv or stdin
while (<<>>) {
$line .= $_;
}
#this html file wordwraps to 72 charaters.
#\R removes UTF8 newlines and the wierd = formatting thing.
$line =~ s/=\R//g;
#match everything between these phrases
my @matches = $line =~ /Key Terms You Should Know(.*?)Command References/msg or die "error";
for my $match (@matches) {
#strip all html formatting
my $cleantext = $strip->parse( $match);
$strip->eof;
#$cleantext =~ s/=\R//g;
print $MYFILE $cleantext;
}
[/code]
Now this code DOES’T work. It’s because of a bug in perl I think as the string
gets too large perl stops matching what’s in the regexp and matches everything!
So I wrote this shell scripts to run it on a single file at a time.
[code language=”bash”]
#/bin/sh
IFS=$’\n’
for file in ./*.mhtml; do
perl build.pl $file;
done
[/code]
outputlog.png
I will try and see if the original program runs on the latest version of perl and
if the bug exists I will report it.
So at the end of it I only have 400 or so words to work through!.
ccnawordlist
But it just goes to show how a little bit of knowledge of a programming language
can help.

Leave a comment

Your email address will not be published. Required fields are marked *