This is part 2 of the "Using Ruby to easily scrape/search/spider a web page for "things"" multi-part post.
In this post I will explain the use of the non-greedy regex operator. In the spider example used in part 1, I scraped for all uses of the html heading element. The html heading element includes h1, h2, h3, etc. More specifically, I was looking for all matches of the regex:
<h[0-9]>(.*?)</h[0-9]>The following breaks down and explains the above regex:
<h[0-9]> → Find <h followed by one digit ([0-9]) followed by >.*? → Find any character (.) zero or more times (*) non-greedily (?)</h[0-9]> → Find </h followed by one digit ([0-9]) followed by >The parentheses (), mark what I want to capture from my regex match. In this case, it is the actual heading, I don't want to capture the
<h[0-9]> or the </h[0-9]>.The non-greedy operator
(?) means that the regex should not be greedy; it should look-ahead to see if it can break what it is currently looking at. In the above example, the non-greedy operator was used to prevent the .* from matching everything and thus never allowing the regex to match </h[0-9]>. The following examples demonstrate greedy vs non-greedy:Example 1: Greedy:
Regex:
<h[0-9]>.*</h[0-9]>Input:
<h3>My Title</h3><h[0-9]> matches: <h3>.* matches: My Title</h3>Example 2: Non-Greedy:
Regex:
<h[0-9]>.*?</h[0-9]>Input:
<h3>My Title</h3><h[0-9]> matches: <h3>.*? matches: My Title</h[0-9]> matches: </h3>Further Reading:
Ruby regex, quick reference guide
Ruby-doc, user's guide to regex
Ruby API: Regexp
Rubular: A Ruby regular expression editor (Interactive)
4 comments:
To save some typing use \d instead of [0-9]. (There are a few of those, eg \s (whitespace), \S (non-whitespace), \w (word char).
I will add a link to further reading with regards to regex. That said, I am intentionally choosing not to use many shortcut regex tokens in my post. This is because I am trying to give a "simple" tutorial. I imagine it is easier for a regex newcomer to read regex with character classes in it rather than the regex shortcuts.
Regular expression is really wonderful to parsing HTML or matching pattern. I use this a lot when i code. Actually when I learn any new langauge, first of all I first try whether it supports regex or not. I feel ezee when I found that.
http://icfun.blogspot.com/2008/04/ruby-regular-expression-handling.html
Here is about ruby regex. This was posted by me when I first learn ruby regex. So it will be helpfull for New coders.
Regular expression is really wonderful to parsing HTML or matching pattern. I use this a lot when i code. Actually when I learn any new langauge, first of all I first try whether it supports regex or not. I feel ezee when I found that.
http://icfun.blogspot.com/2008/04/ruby-regular-expression-handling.html
Here is about ruby regex. This was posted by me when I first learn ruby regex. So it will be helpfull for New coders.
Post a Comment