Thursday, February 5, 2009

Ruby - Regex Non-Greedy Operator (?)- Using Ruby to easily scrape/search/spider a web page for "things" - Part 2

Hello again,

This is part 2 of the "Using Ruby to easily scrape/search/spider a web page for "things"" multi-part post.

In this post I will explain the use of the non-greedy regex operator. In the spider example used in part 1, I scraped for all uses of the html heading element. The html heading element includes h1, h2, h3, etc. More specifically, I was looking for all matches of the regex: <h[0-9]>(.*?)</h[0-9]>

The following breaks down and explains the above regex:
<h[0-9]> → Find <h followed by one digit ([0-9]) followed by >
.*? → Find any character (.) zero or more times (*) non-greedily (?)
</h[0-9]> → Find </h followed by one digit ([0-9]) followed by >

The parentheses (), mark what I want to capture from my regex match. In this case, it is the actual heading, I don't want to capture the <h[0-9]> or the </h[0-9]>.

The non-greedy operator (?) means that the regex should not be greedy; it should look-ahead to see if it can break what it is currently looking at. In the above example, the non-greedy operator was used to prevent the .* from matching everything and thus never allowing the regex to match </h[0-9]>. The following examples demonstrate greedy vs non-greedy:

Example 1: Greedy:
Regex: <h[0-9]>.*</h[0-9]>
Input: <h3>My Title</h3>
<h[0-9]> matches: <h3>
.* matches: My Title</h3>

Example 2: Non-Greedy:
Regex: <h[0-9]>.*?</h[0-9]>
Input: <h3>My Title</h3>
<h[0-9]> matches: <h3>
.*? matches: My Title
</h[0-9]> matches: </h3>

Further Reading:
Ruby regex, quick reference guide
Ruby-doc, user's guide to regex
Ruby API: Regexp
Rubular: A Ruby regular expression editor (Interactive)

4 comments:

simon said...

To save some typing use \d instead of [0-9]. (There are a few of those, eg \s (whitespace), \S (non-whitespace), \w (word char).

Robert Pyke said...

I will add a link to further reading with regards to regex. That said, I am intentionally choosing not to use many shortcut regex tokens in my post. This is because I am trying to give a "simple" tutorial. I imagine it is easier for a regex newcomer to read regex with character classes in it rather than the regex shortcuts.

Wolf said...

Regular expression is really wonderful to parsing HTML or matching pattern. I use this a lot when i code. Actually when I learn any new langauge, first of all I first try whether it supports regex or not. I feel ezee when I found that.

http://icfun.blogspot.com/2008/04/ruby-regular-expression-handling.html

Here is about ruby regex. This was posted by me when I first learn ruby regex. So it will be helpfull for New coders.

Wolf said...

Regular expression is really wonderful to parsing HTML or matching pattern. I use this a lot when i code. Actually when I learn any new langauge, first of all I first try whether it supports regex or not. I feel ezee when I found that.

http://icfun.blogspot.com/2008/04/ruby-regular-expression-handling.html

Here is about ruby regex. This was posted by me when I first learn ruby regex. So it will be helpfull for New coders.