Ruby Spider

From Schmid.wiki
Jump to: navigation, search

The REXML package for Ruby is complex and initially only works on well-formed XML documents. A hack like the following may be more suitable for crawling the web:

require 'open-uri'

# open url, read webpage, scan for links
# note the usage of the '.*?' non-greedy regex wildcard
open("http://www.stuff.things").read.scan(/<a.*?href="(.*?)"/).each do |match|
    puts match[0]
end
Personal tools