Overview
Ruby's URI module provides classes and methods for handling Uniform Resource Identifiers. The module contains multiple classes representing different URI schemes, with URI::Generic serving as the base class and specialized classes like URI::HTTP, URI::HTTPS, URI::FTP handling specific protocols.
The primary entry points are URI()
and URI.parse()
methods, both parsing string representations into URI objects. These objects expose components like scheme, host, port, path, query, and fragment through accessor methods. The module handles URL encoding, validation, and normalization automatically.
require 'uri'
uri = URI('https://example.com/path?query=value#fragment')
# => #<URI::HTTPS https://example.com/path?query=value#fragment>
uri.scheme # => "https"
uri.host # => "example.com"
uri.path # => "/path"
uri.query # => "query=value"
URI objects are immutable by default but provide methods to create modified copies. The module includes utilities for encoding and decoding URI components, joining relative paths, and comparing URIs.
# Building URIs programmatically
uri = URI::HTTPS.build(host: 'api.example.com', path: '/v1/users', query: 'limit=10')
# => #<URI::HTTPS https://api.example.com/v1/users?limit=10>
# Joining paths
base = URI('https://example.com/api/')
relative = URI('users/123')
combined = base + relative
# => #<URI::HTTPS https://example.com/api/users/123>
The module supports RFC 3986 compliance and handles international domain names through IDN encoding when combined with other libraries.
Basic Usage
URI parsing handles most common URL formats through the URI()
method or URI.parse()
. Both methods accept strings and return appropriate URI subclass instances based on the detected scheme.
# Different URI schemes
http_uri = URI('http://example.com') # => URI::HTTP
https_uri = URI('https://example.com') # => URI::HTTPS
ftp_uri = URI('ftp://files.example.com') # => URI::FTP
file_uri = URI('file:///path/to/file.txt') # => URI::File
mailto_uri = URI('mailto:user@example.com') # => URI::MailTo
Accessing URI components uses straightforward accessor methods. Each component returns a string or nil for undefined parts:
uri = URI('https://user:pass@example.com:8080/api/v1/resource?search=ruby&limit=50#results')
uri.scheme # => "https"
uri.userinfo # => "user:pass"
uri.user # => "user"
uri.password # => "pass"
uri.host # => "example.com"
uri.port # => 8080
uri.path # => "/api/v1/resource"
uri.query # => "search=ruby&limit=50"
uri.fragment # => "results"
Building URIs programmatically uses the build
class method. Each URI class accepts different parameters relevant to its scheme:
# HTTP URIs
http_uri = URI::HTTP.build(
host: 'api.example.com',
port: 3000,
path: '/users',
query: 'active=true'
)
# => #<URI::HTTP http://api.example.com:3000/users?active=true>
# HTTPS with authentication
secure_uri = URI::HTTPS.build(
userinfo: 'api_key:secret',
host: 'secure-api.com',
path: '/data'
)
# => #<URI::HTTPS https://api_key:secret@secure-api.com/data>
Query parameter handling requires manual parsing and encoding. Ruby doesn't parse query strings into hashes automatically:
uri = URI('https://example.com/search?q=ruby+programming&type=tutorial')
query_string = uri.query # => "q=ruby+programming&type=tutorial"
# Manual query parsing
require 'cgi'
params = CGI.parse(query_string)
# => {"q"=>["ruby programming"], "type"=>["tutorial"]}
# Building query strings
new_params = { search: 'rails', category: 'web development' }
uri.query = URI.encode_www_form(new_params)
# => "search=rails&category=web+development"
Path manipulation supports joining relative paths and normalizing directory traversals:
base = URI('https://example.com/api/v1/')
relative_path = 'users/123/posts'
full_uri = base + relative_path
# => #<URI::HTTPS https://example.com/api/v1/users/123/posts>
# Path normalization
messy_path = URI('https://example.com/api/../admin/./users/../roles')
clean_path = messy_path.normalize
# => #<URI::HTTPS https://example.com/admin/roles>
Error Handling & Debugging
URI parsing raises URI::InvalidURIError
for malformed URI strings. This exception includes the original string and position information for debugging:
begin
uri = URI('https://example.com:invalid_port/path')
rescue URI::InvalidURIError => e
puts "Invalid URI: #{e.message}"
# => Invalid URI: bad component(expected port component): https://example.com:invalid_port/path
end
# Common parsing failures
invalid_uris = [
'http://[invalid-ipv6', # Malformed IPv6
'https://example.com:99999', # Port out of range
'http://exam ple.com', # Spaces in hostname
'https://example.com/path with spaces' # Unencoded spaces
]
invalid_uris.each do |uri_string|
begin
URI(uri_string)
rescue URI::InvalidURIError => e
puts "#{uri_string}: #{e.class}"
end
end
Encoding issues arise when URIs contain special characters. The module provides encoding and decoding methods for different URI components:
# URL encoding for different components
path_component = "files/my document.pdf"
encoded_path = URI.encode_www_form_component(path_component)
# => "files%2Fmy+document.pdf"
# Different encoding for URI paths vs form data
uri_path_encoded = URI::Parser.new.escape(path_component, URI::REGEXP::PATTERN::UNRESERVED)
# => "files/my%20document.pdf"
# Query parameter encoding
query_params = { search: "ruby & rails", filter: "type=tutorial" }
encoded_query = URI.encode_www_form(query_params)
# => "search=ruby+%26+rails&filter=type%3Dtutorial"
uri = URI("https://example.com/search?#{encoded_query}")
Host validation catches many common mistakes but doesn't verify DNS resolution or reachability:
def validate_uri_host(uri_string)
uri = URI(uri_string)
# Check for required components
raise ArgumentError, "Missing scheme" unless uri.scheme
raise ArgumentError, "Missing host" unless uri.host
# Validate port range
if uri.port && (uri.port < 1 || uri.port > 65535)
raise ArgumentError, "Port out of valid range: #{uri.port}"
end
# Check for suspicious characters
if uri.host.include?(' ') || uri.host.include?('\t')
raise ArgumentError, "Host contains whitespace: #{uri.host}"
end
uri
rescue URI::InvalidURIError => e
raise ArgumentError, "Malformed URI: #{e.message}"
end
# Usage
begin
validate_uri_host('https://example.com:8080/path') # Valid
validate_uri_host('https://bad host.com') # Raises ArgumentError
rescue ArgumentError => e
puts "Validation failed: #{e.message}"
end
Debugging URI construction problems often involves inspecting intermediate steps:
def debug_uri_build(components)
puts "Building URI with components: #{components.inspect}"
components.each do |key, value|
puts " #{key}: #{value.inspect} (#{value.class})"
end
begin
uri = URI::HTTPS.build(components)
puts "Result: #{uri}"
puts "Absolute URI: #{uri.to_s}"
return uri
rescue => e
puts "Error: #{e.class}: #{e.message}"
raise
end
end
# Debug problematic URI construction
debug_uri_build(
host: 'example.com',
port: '8080', # String instead of Integer - potential issue
path: '/api/users',
query: { search: 'ruby' } # Hash instead of String - will cause error
)
Production Patterns
Web applications frequently process user-supplied URLs requiring validation and normalization. A common pattern involves creating URI wrapper classes that handle validation and provide safe defaults:
class SafeURI
attr_reader :uri, :errors
ALLOWED_SCHEMES = %w[http https].freeze
BLOCKED_HOSTS = %w[localhost 127.0.0.1 0.0.0.0 ::1].freeze
def initialize(uri_string, allowed_schemes: ALLOWED_SCHEMES)
@errors = []
@uri_string = uri_string.to_s.strip
validate_and_parse(allowed_schemes)
end
def valid?
@errors.empty? && @uri
end
def to_s
@uri&.to_s
end
def sanitize_for_redirect
return nil unless valid?
# Remove userinfo for security
sanitized = @uri.dup
sanitized.userinfo = nil
# Normalize path
sanitized.path = sanitized.path.squeeze('/')
sanitized.normalize
end
private
def validate_and_parse(allowed_schemes)
if @uri_string.empty?
@errors << 'URI cannot be empty'
return
end
begin
@uri = URI(@uri_string)
rescue URI::InvalidURIError => e
@errors << "Invalid URI format: #{e.message}"
return
end
validate_scheme(allowed_schemes)
validate_host
validate_port
end
def validate_scheme(allowed_schemes)
unless allowed_schemes.include?(@uri.scheme)
@errors << "Scheme '#{@uri.scheme}' not allowed"
end
end
def validate_host
if @uri.host.nil? || @uri.host.empty?
@errors << 'Host is required'
elsif BLOCKED_HOSTS.include?(@uri.host.downcase)
@errors << "Host '#{@uri.host}' is blocked"
end
end
def validate_port
if @uri.port && (@uri.port < 1 || @uri.port > 65535)
@errors << "Port #{@uri.port} is out of valid range"
end
end
end
# Usage in web applications
def redirect_to_external(url_param)
safe_uri = SafeURI.new(url_param)
if safe_uri.valid?
redirect_to safe_uri.sanitize_for_redirect.to_s
else
logger.warn "Invalid redirect URL: #{safe_uri.errors.join(', ')}"
redirect_to root_path
end
end
API clients benefit from URI template patterns that handle base URLs, authentication, and parameter encoding consistently:
class APIClient
def initialize(base_url, api_key: nil)
@base_uri = URI(base_url)
@api_key = api_key
@http = Net::HTTP.new(@base_uri.host, @base_uri.port)
@http.use_ssl = @base_uri.scheme == 'https'
end
def get(path, params: {})
uri = build_request_uri(path, params)
request = Net::HTTP::Get.new(uri)
add_authentication(request)
response = @http.request(request)
handle_response(response)
end
def post(path, data: {}, params: {})
uri = build_request_uri(path, params)
request = Net::HTTP::Post.new(uri)
request.body = JSON.generate(data)
request['Content-Type'] = 'application/json'
add_authentication(request)
response = @http.request(request)
handle_response(response)
end
private
def build_request_uri(path, params)
# Clean and join paths
clean_base_path = @base_uri.path.chomp('/')
clean_path = path.start_with?('/') ? path : "/#{path}"
full_path = "#{clean_base_path}#{clean_path}"
# Build complete URI
uri = @base_uri.dup
uri.path = full_path
uri.query = URI.encode_www_form(params) unless params.empty?
uri.request_uri
end
def add_authentication(request)
if @api_key
request['Authorization'] = "Bearer #{@api_key}"
end
end
def handle_response(response)
case response.code
when '200', '201'
JSON.parse(response.body)
when '401'
raise AuthenticationError, 'API key invalid or missing'
when '404'
raise NotFoundError, 'Resource not found'
else
raise APIError, "HTTP #{response.code}: #{response.message}"
end
end
end
# Usage
client = APIClient.new('https://api.example.com/v2', api_key: ENV['API_KEY'])
users = client.get('users', params: { limit: 100, active: true })
new_user = client.post('users', data: { name: 'John', email: 'john@example.com' })
Web scraping and crawling applications need robust URL resolution and duplicate detection:
class URLCrawler
def initialize(base_url, max_depth: 3)
@base_uri = URI(base_url)
@visited = Set.new
@queue = [[@base_uri, 0]]
@max_depth = max_depth
end
def crawl(&block)
while @queue.any?
uri, depth = @queue.shift
next if visited?(uri) || depth > @max_depth
mark_visited(uri)
yield uri, depth
# Extract and queue new URLs (implementation would fetch page)
found_urls = extract_urls_from_page(uri)
queue_new_urls(found_urls, depth + 1)
end
end
private
def visited?(uri)
@visited.include?(normalize_for_comparison(uri))
end
def mark_visited(uri)
@visited.add(normalize_for_comparison(uri))
end
def normalize_for_comparison(uri)
# Normalize for deduplication
normalized = uri.normalize
# Remove default ports
if (normalized.scheme == 'http' && normalized.port == 80) ||
(normalized.scheme == 'https' && normalized.port == 443)
normalized = normalized.dup
normalized.port = nil
end
# Remove trailing slash from path
normalized.path = normalized.path.chomp('/') if normalized.path != '/'
# Remove fragment
normalized.fragment = nil
normalized.to_s
end
def queue_new_urls(urls, depth)
urls.each do |url_string|
begin
# Resolve relative URLs
absolute_uri = @base_uri + url_string
# Only queue URLs from same domain
if same_domain?(absolute_uri)
@queue << [absolute_uri, depth]
end
rescue URI::InvalidURIError
# Skip malformed URLs
next
end
end
end
def same_domain?(uri)
uri.host == @base_uri.host
end
def extract_urls_from_page(uri)
# Placeholder - would actually fetch and parse HTML
[]
end
end
Reference
Core Classes
Class | Purpose | Example |
---|---|---|
URI::Generic |
Base class for all URI types | - |
URI::HTTP |
HTTP scheme URIs | URI::HTTP.build(host: 'example.com') |
URI::HTTPS |
HTTPS scheme URIs | URI::HTTPS.build(host: 'secure.com') |
URI::FTP |
FTP scheme URIs | URI::FTP.build(host: 'ftp.example.com') |
URI::File |
File scheme URIs | URI::File.build(path: '/path/to/file') |
URI::MailTo |
Mailto scheme URIs | URI::MailTo.build(to: 'user@example.com') |
Parsing Methods
Method | Parameters | Returns | Description |
---|---|---|---|
URI(string) |
string (String) |
URI::Generic subclass |
Parse string into appropriate URI object |
URI.parse(string) |
string (String) |
URI::Generic subclass |
Identical to URI() method |
URI.join(base, *paths) |
base (String/URI), paths (String) |
URI::Generic subclass |
Join base URI with relative paths |
URI.extract(string) |
string (String), schemes (Array) |
Array<String> |
Extract URI strings from text |
Building Methods
Method | Parameters | Returns | Description |
---|---|---|---|
URI::HTTP.build(hash) |
Hash with :scheme, :host, :port, :path, :query |
URI::HTTP |
Build HTTP URI from components |
URI::HTTPS.build(hash) |
Hash with :scheme, :host, :port, :path, :query |
URI::HTTPS |
Build HTTPS URI from components |
URI::Generic.build(hash) |
Hash with URI components | URI::Generic |
Build generic URI from components |
Component Access Methods
Method | Returns | Description |
---|---|---|
#scheme |
String or nil |
URI scheme (http, https, ftp) |
#host |
String or nil |
Hostname or IP address |
#port |
Integer or nil |
Port number |
#path |
String |
Path component (never nil, defaults to "") |
#query |
String or nil |
Query string without leading ? |
#fragment |
String or nil |
Fragment without leading # |
#userinfo |
String or nil |
Complete user information (user:pass) |
#user |
String or nil |
Username portion of userinfo |
#password |
String or nil |
Password portion of userinfo |
Comparison and Normalization
Method | Parameters | Returns | Description |
---|---|---|---|
#normalize |
None | URI::Generic subclass |
Return normalized copy of URI |
#normalize! |
None | self |
Normalize URI in place |
#==(other) |
other (URI) |
Boolean |
Compare URIs for equality |
#eql?(other) |
other (URI) |
Boolean |
Strict equality comparison |
#hash |
None | Integer |
Hash value for URI |
String Conversion
Method | Parameters | Returns | Description |
---|---|---|---|
#to_s |
None | String |
Complete URI as string |
#to_str |
None | String |
Alias for #to_s |
#request_uri |
None | String |
Path, query, and fragment for HTTP requests |
Encoding and Decoding
Method | Parameters | Returns | Description |
---|---|---|---|
URI.encode_www_form(params) |
params (Array/Hash) |
String |
Encode parameters as form data |
URI.encode_www_form_component(str) |
str (String) |
String |
Encode single component for forms |
URI.decode_www_form(str) |
str (String) |
Array<Array> |
Decode form data to array of pairs |
URI.decode_www_form_component(str) |
str (String) |
String |
Decode single form component |
Path Operations
Method | Parameters | Returns | Description |
---|---|---|---|
#+(path) |
path (String/URI) |
URI::Generic subclass |
Join URI with relative path |
#merge(path) |
path (String/URI) |
URI::Generic subclass |
Merge with relative or absolute URI |
#route_from(base) |
base (URI) |
URI::Generic subclass |
Calculate relative path from base |
#route_to(target) |
target (URI) |
URI::Generic subclass |
Calculate relative path to target |
Exception Classes
Exception | Inherits From | Raised When |
---|---|---|
URI::InvalidURIError |
StandardError |
URI string is malformed |
URI::InvalidComponentError |
StandardError |
URI component is invalid |
URI::BadURIError |
URI::InvalidURIError |
URI is structurally invalid |
Default Ports
Scheme | Default Port |
---|---|
http |
80 |
https |
443 |
ftp |
21 |
ssh |
22 |
telnet |
23 |
Common Regular Expressions
# Available through URI::REGEXP constant
URI::REGEXP::PATTERN[:ABS_URI] # Absolute URI pattern
URI::REGEXP::PATTERN[:REL_URI] # Relative URI pattern
URI::REGEXP::PATTERN[:HOST] # Host pattern
URI::REGEXP::PATTERN[:UNRESERVED] # Unreserved characters