1 Regular Expressions
A regular expression is a pattern that a string is searched for.
Unix commands such as "ls *.*" are similar to regular expressions,
but the syntax of regular expressions is more elaborated.
Several Unix programs (grep, sed, awk, ed, vi, emacs) use regular
expressions and many modern programming languages (such as
Java) also support them.
$line =~/the /
searches for the occurrence of the four character sequence "the " in the
string in $line. If "!~" was used instead of "=~" then the script
would search for strings that do not contain the four character
sequence "the ".
The list of special characters summarises
characters that have special meanings or need to be escaped in regular expressions.
1.1 Example:
#!/usr/local/bin/perl
#
# Regular expressions
#
# reading a file:
open(ALICE, "alice.txt");
@lines = <ALICE> ;
close(ALICE);
# searching the file content line by line:
foreach $line (@lines){
if ($line =~/the /){
print $line;
} # end of if
} # end of foreach
1.2 Exercises
For the exercises you can use the alice.txt file.
(Note for MS Windows users:
It is best to save this file by copying and pasting
it into a Unix editor. If you save the file under DOS/Windows, each
line ends with "\r\n" instead of "\n". You need two chop commands
instead of one chomp command if you want to remove these characters. )
1) Retrieve all lines from alice.txt that do not contain /the /.
Retrieve all lines that contain "the" with lower or upper case
letters (hint: inserting an "i" after the
expression means "ignore case":/the /i).
2) a) Retrieve lines that contain a word of any length that starts with
t and ends with e. Modify this so that the word has at least
three characters.
b) Retrieve lines that start with a. Retrieve lines that start with a and
end with n. Hint: You need to specify the beginning of the line,
"a", any number of any characters in the middle, "n", end of line.
c) Retrieve blank lines. Think of at least two ways of doing this.
d) Retrieve lines that contain a word that starts with an upper case letter.
3) What is the difference between the following expressions?
a) abc* and (abc)*
b) !/yes/ and /[^y][^e][^s]/
c) [A-Z][a-z]* and [A-Z][a-z]+
2 Substitution and Transliteration
Using
$string_variable =~ s/search_pattern/replace_string/
a search_pattern in $string_variable can be replaced by a replace_string.
s/search_pattern/replace_string/g
stands for global replacement, i.e. not only the first occurrence is
replaced.
s/search_pattern/replace_string/i
stands for ignore case. "gi" stands for global replacement, ignore case.
2.1 Examples
a) s/[Ll][Oo][Nn][Dd][Oo][Nn]/London/g
replaces LOndoN or loNDON etc by London.
This is equivalent to s/london/London/gi.
b) s/Alice/Mary/
replaces every occurrence of Alice by Mary.
2.2 Optional: Transliteration
There is also something called "transliteration" which replaces
single characters not strings.
$string_variable =~ tr/character_sequence/character_sequence/
Most of the regular expression
special characters are not valid for transliteration but "-" can
be used as in tr/a-z// which would delete all letters.
2.3 The script to be used for the exercises below
#!/usr/local/bin/perl
#
# Regular expressions
#
# reading a file:
open(ALICE, "alice.txt");
@lines = <ALICE> ;
close(ALICE);
# searching the file content line by line:
foreach $line (@lines){
$line =~s/T/t/g;
print $line;
} # end of foreach
2.4 Exercises
4) Using the alice.txt file:
a) Replace all upper case A by lower case a.
b) Delete all words with more than 3 characters.
3 Non-greedy Multipliers and Patterns
By default the multipliers * and + are "greedy" which means they
match as many characters as possible. For example,
/(\b.+\b)/ matches any non-empty line.
A question mark behind a multiplier forces it to be non-greedy.
Therefore /(\b.+?\b)/ matches the first word in a line.
3.1 Exercise
5) Write a replace statement that deletes all HTML markup from a file.
You need non-greedy multipliers because otherwise
the text between tags may be deleted in a line that contains several tags.
3.2 Remembering Patterns
Patterns within parentheses are remembered. Using \1, \2, etc they can
be referred to within the same regular expression (or search expression)
and using $1, $2, etc they can be referred to in a print statement (or
replace string).
3.3 Examples
/(t.*e)/;
print "$1";
prints strings that start with "t" and end with "e".
s/(t.*e)/:$1:/g;
places a ":" in front and behind each string t...e.
/(...)\1/
matches a three character string that is repeated.
s/^(.)(.*)(.)$/$3$2$1/
switches the first and last character of a line.
3.4 Exercise
6) Print double characters within parenthesis "()".
For example, replace "arrived" by "a(rr)ived".
3.5 Optional: Special variables:
$` | contains the string before the pattern |
$& | contains the pattern that is matched |
$' | contains the string after the pattern |
For example:
$line="The cat that sat on the mat.";
$line =~ /c.t/;
$` contains "The "
$& contains "cat"
$' contains " that sat on the mat."
4 Split and Join
Using @personal = split(/:/, $line);
a string such as
$line = "Caine:Michael:Actor:14, Leafy Drive";
can be split into an array such as
@personal = ("Caine", "Michael", "Actor", "14, Leafy Drive");
Other examples:
@chars = split(//, $word);
@words = split(/ /, $sentence);
@sentences = split(/\./, $paragraph);
"join" does the opposite of "split":
$bigstring = join(":",@personal);
4.1 Exercises
7) Read the alice.txt file into an array. Chomp it. Using "join"
concatenate it into one string. Then split it into
words (or sentences) and print it one word (sentence) per line.
8) Write a script that takes an HTML source file as input and prints
it so that a newline follows only "closing tags", i.e. tags that are
of the form </...>.