CrackedRuby logo

CrackedRuby

URI Module

Overview

Ruby's URI module provides classes and methods for handling Uniform Resource Identifiers. The module contains multiple classes representing different URI schemes, with URI::Generic serving as the base class and specialized classes like URI::HTTP, URI::HTTPS, URI::FTP handling specific protocols.

The primary entry points are URI() and URI.parse() methods, both parsing string representations into URI objects. These objects expose components like scheme, host, port, path, query, and fragment through accessor methods. The module handles URL encoding, validation, and normalization automatically.

require 'uri'

uri = URI('https://example.com/path?query=value#fragment')
# => #<URI::HTTPS https://example.com/path?query=value#fragment>

uri.scheme  # => "https"
uri.host    # => "example.com"
uri.path    # => "/path"
uri.query   # => "query=value"

URI objects are immutable by default but provide methods to create modified copies. The module includes utilities for encoding and decoding URI components, joining relative paths, and comparing URIs.

# Building URIs programmatically
uri = URI::HTTPS.build(host: 'api.example.com', path: '/v1/users', query: 'limit=10')
# => #<URI::HTTPS https://api.example.com/v1/users?limit=10>

# Joining paths
base = URI('https://example.com/api/')
relative = URI('users/123')
combined = base + relative
# => #<URI::HTTPS https://example.com/api/users/123>

The module supports RFC 3986 compliance and handles international domain names through IDN encoding when combined with other libraries.

Basic Usage

URI parsing handles most common URL formats through the URI() method or URI.parse(). Both methods accept strings and return appropriate URI subclass instances based on the detected scheme.

# Different URI schemes
http_uri = URI('http://example.com')          # => URI::HTTP
https_uri = URI('https://example.com')        # => URI::HTTPS  
ftp_uri = URI('ftp://files.example.com')      # => URI::FTP
file_uri = URI('file:///path/to/file.txt')    # => URI::File
mailto_uri = URI('mailto:user@example.com')   # => URI::MailTo

Accessing URI components uses straightforward accessor methods. Each component returns a string or nil for undefined parts:

uri = URI('https://user:pass@example.com:8080/api/v1/resource?search=ruby&limit=50#results')

uri.scheme      # => "https"
uri.userinfo    # => "user:pass"
uri.user        # => "user"
uri.password    # => "pass"
uri.host        # => "example.com"
uri.port        # => 8080
uri.path        # => "/api/v1/resource"
uri.query       # => "search=ruby&limit=50"
uri.fragment    # => "results"

Building URIs programmatically uses the build class method. Each URI class accepts different parameters relevant to its scheme:

# HTTP URIs
http_uri = URI::HTTP.build(
  host: 'api.example.com',
  port: 3000,
  path: '/users',
  query: 'active=true'
)
# => #<URI::HTTP http://api.example.com:3000/users?active=true>

# HTTPS with authentication
secure_uri = URI::HTTPS.build(
  userinfo: 'api_key:secret',
  host: 'secure-api.com',
  path: '/data'
)
# => #<URI::HTTPS https://api_key:secret@secure-api.com/data>

Query parameter handling requires manual parsing and encoding. Ruby doesn't parse query strings into hashes automatically:

uri = URI('https://example.com/search?q=ruby+programming&type=tutorial')
query_string = uri.query  # => "q=ruby+programming&type=tutorial"

# Manual query parsing
require 'cgi'
params = CGI.parse(query_string)
# => {"q"=>["ruby programming"], "type"=>["tutorial"]}

# Building query strings
new_params = { search: 'rails', category: 'web development' }
uri.query = URI.encode_www_form(new_params)
# => "search=rails&category=web+development"

Path manipulation supports joining relative paths and normalizing directory traversals:

base = URI('https://example.com/api/v1/')
relative_path = 'users/123/posts'
full_uri = base + relative_path
# => #<URI::HTTPS https://example.com/api/v1/users/123/posts>

# Path normalization
messy_path = URI('https://example.com/api/../admin/./users/../roles')
clean_path = messy_path.normalize
# => #<URI::HTTPS https://example.com/admin/roles>

Error Handling & Debugging

URI parsing raises URI::InvalidURIError for malformed URI strings. This exception includes the original string and position information for debugging:

begin
  uri = URI('https://example.com:invalid_port/path')
rescue URI::InvalidURIError => e
  puts "Invalid URI: #{e.message}"
  # => Invalid URI: bad component(expected port component): https://example.com:invalid_port/path
end

# Common parsing failures
invalid_uris = [
  'http://[invalid-ipv6',           # Malformed IPv6
  'https://example.com:99999',      # Port out of range
  'http://exam ple.com',            # Spaces in hostname
  'https://example.com/path with spaces'  # Unencoded spaces
]

invalid_uris.each do |uri_string|
  begin
    URI(uri_string)
  rescue URI::InvalidURIError => e
    puts "#{uri_string}: #{e.class}"
  end
end

Encoding issues arise when URIs contain special characters. The module provides encoding and decoding methods for different URI components:

# URL encoding for different components
path_component = "files/my document.pdf"
encoded_path = URI.encode_www_form_component(path_component)
# => "files%2Fmy+document.pdf"

# Different encoding for URI paths vs form data
uri_path_encoded = URI::Parser.new.escape(path_component, URI::REGEXP::PATTERN::UNRESERVED)
# => "files/my%20document.pdf"

# Query parameter encoding
query_params = { search: "ruby & rails", filter: "type=tutorial" }
encoded_query = URI.encode_www_form(query_params)
# => "search=ruby+%26+rails&filter=type%3Dtutorial"

uri = URI("https://example.com/search?#{encoded_query}")

Host validation catches many common mistakes but doesn't verify DNS resolution or reachability:

def validate_uri_host(uri_string)
  uri = URI(uri_string)
  
  # Check for required components
  raise ArgumentError, "Missing scheme" unless uri.scheme
  raise ArgumentError, "Missing host" unless uri.host
  
  # Validate port range
  if uri.port && (uri.port < 1 || uri.port > 65535)
    raise ArgumentError, "Port out of valid range: #{uri.port}"
  end
  
  # Check for suspicious characters
  if uri.host.include?(' ') || uri.host.include?('\t')
    raise ArgumentError, "Host contains whitespace: #{uri.host}"
  end
  
  uri
rescue URI::InvalidURIError => e
  raise ArgumentError, "Malformed URI: #{e.message}"
end

# Usage
begin
  validate_uri_host('https://example.com:8080/path')  # Valid
  validate_uri_host('https://bad host.com')           # Raises ArgumentError
rescue ArgumentError => e
  puts "Validation failed: #{e.message}"
end

Debugging URI construction problems often involves inspecting intermediate steps:

def debug_uri_build(components)
  puts "Building URI with components: #{components.inspect}"
  
  components.each do |key, value|
    puts "  #{key}: #{value.inspect} (#{value.class})"
  end
  
  begin
    uri = URI::HTTPS.build(components)
    puts "Result: #{uri}"
    puts "Absolute URI: #{uri.to_s}"
    return uri
  rescue => e
    puts "Error: #{e.class}: #{e.message}"
    raise
  end
end

# Debug problematic URI construction
debug_uri_build(
  host: 'example.com',
  port: '8080',  # String instead of Integer - potential issue
  path: '/api/users',
  query: { search: 'ruby' }  # Hash instead of String - will cause error
)

Production Patterns

Web applications frequently process user-supplied URLs requiring validation and normalization. A common pattern involves creating URI wrapper classes that handle validation and provide safe defaults:

class SafeURI
  attr_reader :uri, :errors
  
  ALLOWED_SCHEMES = %w[http https].freeze
  BLOCKED_HOSTS = %w[localhost 127.0.0.1 0.0.0.0 ::1].freeze
  
  def initialize(uri_string, allowed_schemes: ALLOWED_SCHEMES)
    @errors = []
    @uri_string = uri_string.to_s.strip
    
    validate_and_parse(allowed_schemes)
  end
  
  def valid?
    @errors.empty? && @uri
  end
  
  def to_s
    @uri&.to_s
  end
  
  def sanitize_for_redirect
    return nil unless valid?
    
    # Remove userinfo for security
    sanitized = @uri.dup
    sanitized.userinfo = nil
    
    # Normalize path
    sanitized.path = sanitized.path.squeeze('/')
    sanitized.normalize
  end
  
  private
  
  def validate_and_parse(allowed_schemes)
    if @uri_string.empty?
      @errors << 'URI cannot be empty'
      return
    end
    
    begin
      @uri = URI(@uri_string)
    rescue URI::InvalidURIError => e
      @errors << "Invalid URI format: #{e.message}"
      return
    end
    
    validate_scheme(allowed_schemes)
    validate_host
    validate_port
  end
  
  def validate_scheme(allowed_schemes)
    unless allowed_schemes.include?(@uri.scheme)
      @errors << "Scheme '#{@uri.scheme}' not allowed"
    end
  end
  
  def validate_host
    if @uri.host.nil? || @uri.host.empty?
      @errors << 'Host is required'
    elsif BLOCKED_HOSTS.include?(@uri.host.downcase)
      @errors << "Host '#{@uri.host}' is blocked"
    end
  end
  
  def validate_port
    if @uri.port && (@uri.port < 1 || @uri.port > 65535)
      @errors << "Port #{@uri.port} is out of valid range"
    end
  end
end

# Usage in web applications
def redirect_to_external(url_param)
  safe_uri = SafeURI.new(url_param)
  
  if safe_uri.valid?
    redirect_to safe_uri.sanitize_for_redirect.to_s
  else
    logger.warn "Invalid redirect URL: #{safe_uri.errors.join(', ')}"
    redirect_to root_path
  end
end

API clients benefit from URI template patterns that handle base URLs, authentication, and parameter encoding consistently:

class APIClient
  def initialize(base_url, api_key: nil)
    @base_uri = URI(base_url)
    @api_key = api_key
    @http = Net::HTTP.new(@base_uri.host, @base_uri.port)
    @http.use_ssl = @base_uri.scheme == 'https'
  end
  
  def get(path, params: {})
    uri = build_request_uri(path, params)
    request = Net::HTTP::Get.new(uri)
    add_authentication(request)
    
    response = @http.request(request)
    handle_response(response)
  end
  
  def post(path, data: {}, params: {})
    uri = build_request_uri(path, params)
    request = Net::HTTP::Post.new(uri)
    request.body = JSON.generate(data)
    request['Content-Type'] = 'application/json'
    add_authentication(request)
    
    response = @http.request(request)
    handle_response(response)
  end
  
  private
  
  def build_request_uri(path, params)
    # Clean and join paths
    clean_base_path = @base_uri.path.chomp('/')
    clean_path = path.start_with?('/') ? path : "/#{path}"
    full_path = "#{clean_base_path}#{clean_path}"
    
    # Build complete URI
    uri = @base_uri.dup
    uri.path = full_path
    uri.query = URI.encode_www_form(params) unless params.empty?
    uri.request_uri
  end
  
  def add_authentication(request)
    if @api_key
      request['Authorization'] = "Bearer #{@api_key}"
    end
  end
  
  def handle_response(response)
    case response.code
    when '200', '201'
      JSON.parse(response.body)
    when '401'
      raise AuthenticationError, 'API key invalid or missing'
    when '404'
      raise NotFoundError, 'Resource not found'
    else
      raise APIError, "HTTP #{response.code}: #{response.message}"
    end
  end
end

# Usage
client = APIClient.new('https://api.example.com/v2', api_key: ENV['API_KEY'])
users = client.get('users', params: { limit: 100, active: true })
new_user = client.post('users', data: { name: 'John', email: 'john@example.com' })

Web scraping and crawling applications need robust URL resolution and duplicate detection:

class URLCrawler
  def initialize(base_url, max_depth: 3)
    @base_uri = URI(base_url)
    @visited = Set.new
    @queue = [[@base_uri, 0]]
    @max_depth = max_depth
  end
  
  def crawl(&block)
    while @queue.any?
      uri, depth = @queue.shift
      next if visited?(uri) || depth > @max_depth
      
      mark_visited(uri)
      yield uri, depth
      
      # Extract and queue new URLs (implementation would fetch page)
      found_urls = extract_urls_from_page(uri)
      queue_new_urls(found_urls, depth + 1)
    end
  end
  
  private
  
  def visited?(uri)
    @visited.include?(normalize_for_comparison(uri))
  end
  
  def mark_visited(uri)
    @visited.add(normalize_for_comparison(uri))
  end
  
  def normalize_for_comparison(uri)
    # Normalize for deduplication
    normalized = uri.normalize
    
    # Remove default ports
    if (normalized.scheme == 'http' && normalized.port == 80) ||
       (normalized.scheme == 'https' && normalized.port == 443)
      normalized = normalized.dup
      normalized.port = nil
    end
    
    # Remove trailing slash from path
    normalized.path = normalized.path.chomp('/') if normalized.path != '/'
    
    # Remove fragment
    normalized.fragment = nil
    
    normalized.to_s
  end
  
  def queue_new_urls(urls, depth)
    urls.each do |url_string|
      begin
        # Resolve relative URLs
        absolute_uri = @base_uri + url_string
        
        # Only queue URLs from same domain
        if same_domain?(absolute_uri)
          @queue << [absolute_uri, depth]
        end
      rescue URI::InvalidURIError
        # Skip malformed URLs
        next
      end
    end
  end
  
  def same_domain?(uri)
    uri.host == @base_uri.host
  end
  
  def extract_urls_from_page(uri)
    # Placeholder - would actually fetch and parse HTML
    []
  end
end

Reference

Core Classes

Class Purpose Example
URI::Generic Base class for all URI types -
URI::HTTP HTTP scheme URIs URI::HTTP.build(host: 'example.com')
URI::HTTPS HTTPS scheme URIs URI::HTTPS.build(host: 'secure.com')
URI::FTP FTP scheme URIs URI::FTP.build(host: 'ftp.example.com')
URI::File File scheme URIs URI::File.build(path: '/path/to/file')
URI::MailTo Mailto scheme URIs URI::MailTo.build(to: 'user@example.com')

Parsing Methods

Method Parameters Returns Description
URI(string) string (String) URI::Generic subclass Parse string into appropriate URI object
URI.parse(string) string (String) URI::Generic subclass Identical to URI() method
URI.join(base, *paths) base (String/URI), paths (String) URI::Generic subclass Join base URI with relative paths
URI.extract(string) string (String), schemes (Array) Array<String> Extract URI strings from text

Building Methods

Method Parameters Returns Description
URI::HTTP.build(hash) Hash with :scheme, :host, :port, :path, :query URI::HTTP Build HTTP URI from components
URI::HTTPS.build(hash) Hash with :scheme, :host, :port, :path, :query URI::HTTPS Build HTTPS URI from components
URI::Generic.build(hash) Hash with URI components URI::Generic Build generic URI from components

Component Access Methods

Method Returns Description
#scheme String or nil URI scheme (http, https, ftp)
#host String or nil Hostname or IP address
#port Integer or nil Port number
#path String Path component (never nil, defaults to "")
#query String or nil Query string without leading ?
#fragment String or nil Fragment without leading #
#userinfo String or nil Complete user information (user:pass)
#user String or nil Username portion of userinfo
#password String or nil Password portion of userinfo

Comparison and Normalization

Method Parameters Returns Description
#normalize None URI::Generic subclass Return normalized copy of URI
#normalize! None self Normalize URI in place
#==(other) other (URI) Boolean Compare URIs for equality
#eql?(other) other (URI) Boolean Strict equality comparison
#hash None Integer Hash value for URI

String Conversion

Method Parameters Returns Description
#to_s None String Complete URI as string
#to_str None String Alias for #to_s
#request_uri None String Path, query, and fragment for HTTP requests

Encoding and Decoding

Method Parameters Returns Description
URI.encode_www_form(params) params (Array/Hash) String Encode parameters as form data
URI.encode_www_form_component(str) str (String) String Encode single component for forms
URI.decode_www_form(str) str (String) Array<Array> Decode form data to array of pairs
URI.decode_www_form_component(str) str (String) String Decode single form component

Path Operations

Method Parameters Returns Description
#+(path) path (String/URI) URI::Generic subclass Join URI with relative path
#merge(path) path (String/URI) URI::Generic subclass Merge with relative or absolute URI
#route_from(base) base (URI) URI::Generic subclass Calculate relative path from base
#route_to(target) target (URI) URI::Generic subclass Calculate relative path to target

Exception Classes

Exception Inherits From Raised When
URI::InvalidURIError StandardError URI string is malformed
URI::InvalidComponentError StandardError URI component is invalid
URI::BadURIError URI::InvalidURIError URI is structurally invalid

Default Ports

Scheme Default Port
http 80
https 443
ftp 21
ssh 22
telnet 23

Common Regular Expressions

# Available through URI::REGEXP constant
URI::REGEXP::PATTERN[:ABS_URI]      # Absolute URI pattern
URI::REGEXP::PATTERN[:REL_URI]      # Relative URI pattern  
URI::REGEXP::PATTERN[:HOST]         # Host pattern
URI::REGEXP::PATTERN[:UNRESERVED]   # Unreserved characters