This is an excerpt from the latest version perlfaq9.pod, which
comes with the standard Perl distribution. These postings aim to
reduce the number of repeated questions as well as allow the community
to review and update the answers. The latest version of the complete
perlfaq is at
http://faq.perl.org .
--------------------------------------------------------------------
9.5: How do I extract URLs?
You can easily extract all sorts of URLs from HTML with
"HTML::SimpleLinkExtor" which handles anchors, images, objects, frames,
and many other tags that can contain a URL. If you need anything more
complex, you can create your own subclass of "HTML::LinkExtor" or
"HTML::Parser". You might even use "HTML::SimpleLinkExtor" as an example
for something specifically suited to your needs.
You can use URI::Find to extract URLs from an arbitrary text document.
Less complete solutions involving regular expressions can save you a lot
of processing time if you know that the input is simple. One solution
from Tom Christiansen runs 100 times faster than most module based
approaches but only extracts URLs from anchors where the first attribute
is HREF and there are no other attributes.