CrackedRuby logo

CrackedRuby

RSS/Atom Feeds

Overview

Ruby provides built-in support for RSS and Atom feeds through the rss library, which handles parsing, generation, and manipulation of syndication feeds. The library supports RSS 0.9, 1.0, 2.0, and Atom 1.0 formats with automatic format detection and conversion capabilities.

The RSS module serves as the primary interface, containing parser classes for different feed formats and maker classes for feed generation. Ruby automatically detects feed format during parsing and provides unified access patterns regardless of the underlying format.

require 'rss'
require 'open-uri'

# Parse a feed from URL
feed = RSS::Parser.parse(URI.open('https://example.com/feed.xml'))
puts feed.channel.title
puts feed.items.first.title

The library handles XML namespace resolution, character encoding detection, and provides accessor methods that abstract format differences between RSS and Atom. Feed objects maintain the original structure while offering normalized access patterns.

# Parse from string content
xml_content = File.read('feed.xml')
feed = RSS::Parser.parse(xml_content)

# Access works consistently across formats
feed.items.each do |item|
  puts item.title
  puts item.link
  puts item.description
end

Ruby's RSS implementation includes validation capabilities, format conversion, and extension support for common RSS modules like Dublin Core and Content. The parser handles malformed feeds gracefully while providing access to validation errors.

# Parse with validation
begin
  feed = RSS::Parser.parse(xml_content, validate: true)
rescue RSS::InvalidRSSError => e
  puts "Feed validation failed: #{e.message}"
  # Parse without validation for error recovery
  feed = RSS::Parser.parse(xml_content, validate: false)
end

Basic Usage

Feed parsing begins with the RSS::Parser.parse method, which accepts URLs, file paths, or string content. The parser automatically detects RSS and Atom formats and returns appropriate feed objects with unified interfaces.

require 'rss'
require 'open-uri'

# Parse from URL
feed = RSS::Parser.parse(URI.open('https://feeds.example.com/news.xml'))

# Parse from file
feed = RSS::Parser.parse(File.read('local_feed.xml'))

# Parse from string
xml_string = '<rss version="2.0">...</rss>'
feed = RSS::Parser.parse(xml_string)

Feed objects provide structured access to metadata and items. RSS feeds expose channel information through the channel property, while Atom feeds provide direct access to feed-level properties. Item iteration works consistently across formats.

# Access feed metadata
puts feed.channel.title          # RSS format
puts feed.channel.description
puts feed.channel.link
puts feed.channel.language

# Atom feeds access metadata directly
puts feed.title                  # Atom format
puts feed.subtitle
puts feed.link.href

Item processing involves iterating through the items collection, which contains entry objects with normalized property access. Each item provides title, content, links, and metadata regardless of the underlying feed format.

feed.items.each do |item|
  puts "Title: #{item.title}"
  puts "Link: #{item.link}"
  puts "Date: #{item.pubDate}"    # RSS
  puts "Date: #{item.updated}"    # Atom
  puts "Summary: #{item.description}"
  puts "---"
end

Feed generation uses maker classes that provide programmatic feed construction. The RSS::Maker module contains format-specific builders with chainable methods for setting feed properties and adding items.

require 'rss'

# Create RSS 2.0 feed
feed = RSS::Maker.make("2.0") do |maker|
  maker.channel.title = "My Blog"
  maker.channel.description = "Latest blog posts"
  maker.channel.link = "https://myblog.example.com"
  maker.channel.language = "en"
  
  # Add items
  maker.items.new_item do |item|
    item.title = "First Post"
    item.link = "https://myblog.example.com/post/1"
    item.description = "This is my first blog post"
    item.pubDate = Time.now
  end
end

puts feed.to_s

Format conversion occurs automatically through the maker interface. Parse an existing feed and regenerate it in a different format by specifying the target format version.

# Convert RSS to Atom
rss_feed = RSS::Parser.parse(rss_content)

atom_feed = RSS::Maker.make("atom") do |maker|
  maker.channel.title = rss_feed.channel.title
  maker.channel.description = rss_feed.channel.description
  maker.channel.link = rss_feed.channel.link
  
  rss_feed.items.each do |rss_item|
    maker.items.new_item do |item|
      item.title = rss_item.title
      item.link = rss_item.link
      item.description = rss_item.description
      item.updated = rss_item.pubDate
    end
  end
end

Error Handling & Debugging

RSS parsing encounters various error conditions including malformed XML, invalid feed structures, encoding issues, and network failures. Ruby's RSS library provides specific exception classes for different error types and validation levels.

require 'rss'
require 'open-uri'

def parse_feed_safely(source)
  begin
    feed = RSS::Parser.parse(source, validate: true)
    return feed
  rescue RSS::InvalidRSSError => e
    puts "Invalid RSS structure: #{e.message}"
    # Attempt parsing without validation
    begin
      feed = RSS::Parser.parse(source, validate: false)
      puts "Parsed with validation disabled"
      return feed
    rescue RSS::Error => e
      puts "RSS parsing failed completely: #{e.message}"
      return nil
    end
  rescue OpenURI::HTTPError => e
    puts "HTTP error fetching feed: #{e.message}"
    return nil
  rescue SocketError => e
    puts "Network error: #{e.message}"
    return nil
  rescue StandardError => e
    puts "Unexpected error: #{e.message}"
    return nil
  end
end

Encoding problems occur frequently with international feeds. Ruby handles encoding detection automatically, but manual encoding specification helps with problematic feeds that declare incorrect encodings or contain mixed encodings.

def handle_encoding_issues(xml_content)
  # Try parsing with detected encoding
  begin
    return RSS::Parser.parse(xml_content)
  rescue RSS::NotWellFormedError => e
    puts "Encoding issue detected: #{e.message}"
  end
  
  # Force UTF-8 encoding
  begin
    utf8_content = xml_content.force_encoding('UTF-8')
    return RSS::Parser.parse(utf8_content)
  rescue RSS::Error
    # Try common problematic encodings
    ['ISO-8859-1', 'Windows-1252'].each do |encoding|
      begin
        converted = xml_content.encode('UTF-8', encoding, invalid: :replace, undef: :replace)
        return RSS::Parser.parse(converted)
      rescue RSS::Error
        next
      end
    end
  end
  
  raise "Unable to parse feed with any encoding"
end

XML parsing errors result from malformed markup, unclosed tags, or invalid characters. The RSS library provides detailed error messages that identify problematic sections, enabling targeted fixing or content sanitization.

def debug_xml_structure(xml_content)
  begin
    RSS::Parser.parse(xml_content, validate: true)
  rescue RSS::NotWellFormedError => e
    # Extract line and column information
    if e.message =~ /line (\d+), column (\d+)/
      line_num = $1.to_i
      column_num = $2.to_i
      
      lines = xml_content.split("\n")
      problematic_line = lines[line_num - 1]
      
      puts "XML Error at line #{line_num}, column #{column_num}:"
      puts problematic_line
      puts " " * (column_num - 1) + "^"
      puts "Context:"
      
      # Show surrounding lines
      start_line = [0, line_num - 3].max
      end_line = [lines.length - 1, line_num + 2].min
      
      (start_line..end_line).each do |i|
        marker = i == line_num - 1 ? ">>> " : "    "
        puts "#{marker}#{i + 1}: #{lines[i]}"
      end
    end
    
    raise e
  end
end

Validation debugging involves examining feed structure compliance with RSS and Atom specifications. Ruby provides detailed validation feedback when strict parsing fails, identifying missing required elements or invalid content structures.

def validate_feed_structure(xml_content)
  errors = []
  
  begin
    feed = RSS::Parser.parse(xml_content, validate: true)
    puts "Feed validation successful"
    return true
  rescue RSS::MissingTagError => e
    errors << "Missing required tag: #{e.tag}"
  rescue RSS::TooMuchTagError => e
    errors << "Too many instances of tag: #{e.tag}"
  rescue RSS::MissingAttributeError => e
    errors << "Missing required attribute: #{e.attribute} in tag #{e.tag}"
  rescue RSS::UnknownTagError => e
    errors << "Unknown tag: #{e.tag}"
  rescue RSS::InvalidRSSError => e
    errors << "Invalid RSS structure: #{e.message}"
  end
  
  puts "Validation errors found:"
  errors.each { |error| puts "  - #{error}" }
  
  # Check if parseable without validation
  begin
    RSS::Parser.parse(xml_content, validate: false)
    puts "Feed is parseable but not strictly valid"
  rescue RSS::Error => e
    puts "Feed is completely unparseable: #{e.message}"
  end
  
  false
end

Production Patterns

RSS feed processing in production environments requires robust error handling, caching strategies, performance optimization, and monitoring. Production systems handle feed updates, content extraction, and integration with web applications and background job systems.

class FeedProcessor
  attr_reader :url, :last_updated, :etag, :last_modified
  
  def initialize(url)
    @url = url
    @last_updated = nil
    @etag = nil
    @last_modified = nil
  end
  
  def fetch_updates
    headers = {}
    headers['If-None-Match'] = @etag if @etag
    headers['If-Modified-Since'] = @last_modified if @last_modified
    
    begin
      response = URI.open(@url, headers)
      
      # Update cache headers
      @etag = response.meta['etag']
      @last_modified = response.meta['last-modified']
      @last_updated = Time.now
      
      parse_and_process(response.read)
      
    rescue OpenURI::HTTPError => e
      case e.message
      when /304/
        puts "Feed not modified since last fetch"
        return :not_modified
      when /404/
        puts "Feed not found: #{@url}"
        return :not_found
      else
        puts "HTTP error: #{e.message}"
        return :error
      end
    end
  end
  
  private
  
  def parse_and_process(content)
    feed = RSS::Parser.parse(content, validate: false)
    process_items(feed.items)
  rescue RSS::Error => e
    puts "Feed parsing error: #{e.message}"
    return :parse_error
  end
  
  def process_items(items)
    items.each do |item|
      # Extract and store item data
      item_data = {
        title: item.title,
        link: item.link,
        content: extract_content(item),
        published_at: extract_date(item),
        guid: extract_guid(item)
      }
      
      store_item(item_data)
    end
  end
end

Background job integration handles feed processing asynchronously to avoid blocking web requests. Jobs manage feed fetching, parsing, content extraction, and database updates with proper error handling and retry logic.

class FeedUpdateJob
  include Sidekiq::Worker
  sidekiq_options retry: 3, dead: false
  
  def perform(feed_id)
    feed = Feed.find(feed_id)
    processor = FeedProcessor.new(feed.url)
    
    result = processor.fetch_updates
    
    case result
    when :not_modified
      feed.touch(:last_checked_at)
    when :not_found
      feed.increment!(:not_found_count)
      disable_feed_if_needed(feed)
    when :error, :parse_error
      feed.increment!(:error_count)
      schedule_retry_if_needed(feed)
    else
      feed.update!(
        last_successful_fetch_at: Time.current,
        error_count: 0,
        not_found_count: 0
      )
    end
    
  rescue StandardError => e
    Rails.logger.error "Feed update failed for feed #{feed_id}: #{e.message}"
    raise e
  end
  
  private
  
  def disable_feed_if_needed(feed)
    if feed.not_found_count >= 5
      feed.update!(active: false)
      NotificationMailer.feed_disabled(feed).deliver_now
    end
  end
  
  def schedule_retry_if_needed(feed)
    if feed.error_count < 10
      delay = [feed.error_count * 30, 3600].min
      FeedUpdateJob.perform_in(delay.seconds, feed.id)
    end
  end
end

Rails integration involves creating models for feeds and items with proper associations, validations, and callback handling. ActiveRecord provides persistence while background jobs handle the actual feed processing.

class Feed < ApplicationRecord
  has_many :items, dependent: :destroy
  
  validates :url, presence: true, uniqueness: true
  validates :title, presence: true
  
  scope :active, -> { where(active: true) }
  scope :due_for_update, -> { where('last_checked_at < ?', 1.hour.ago) }
  
  after_create :schedule_initial_fetch
  
  def fetch_updates!
    FeedUpdateJob.perform_async(id)
  end
  
  def self.schedule_updates
    active.due_for_update.find_each(&:fetch_updates!)
  end
  
  private
  
  def schedule_initial_fetch
    FeedUpdateJob.perform_async(id)
  end
end

class Item < ApplicationRecord
  belongs_to :feed
  
  validates :title, presence: true
  validates :guid, uniqueness: { scope: :feed_id }
  
  scope :recent, -> { order(published_at: :desc) }
  scope :published_since, ->(date) { where('published_at > ?', date) }
  
  before_create :extract_content_preview
  
  private
  
  def extract_content_preview
    if content.present?
      self.preview = ActionController::Base.helpers.strip_tags(content).truncate(200)
    end
  end
end

Monitoring and alerting track feed health, processing performance, and error rates. Production systems need visibility into feed update frequency, parsing success rates, and content quality metrics.

class FeedMonitor
  def self.health_check
    stats = {
      total_feeds: Feed.count,
      active_feeds: Feed.active.count,
      feeds_due_for_update: Feed.due_for_update.count,
      feeds_with_recent_errors: Feed.where('error_count > 0').count,
      average_items_per_feed: Item.joins(:feed).where(feeds: { active: true }).count.to_f / Feed.active.count,
      last_successful_update: Feed.maximum(:last_successful_fetch_at)
    }
    
    # Check for concerning metrics
    alerts = []
    alerts << "Many feeds due for update" if stats[:feeds_due_for_update] > stats[:active_feeds] * 0.5
    alerts << "High error rate" if stats[:feeds_with_recent_errors] > stats[:active_feeds] * 0.1
    alerts << "No recent updates" if stats[:last_successful_update] < 2.hours.ago
    
    { stats: stats, alerts: alerts }
  end
  
  def self.performance_metrics
    {
      avg_processing_time: FeedUpdateJob.average_processing_time,
      job_queue_size: Sidekiq::Queue.new('default').size,
      failed_jobs_count: Sidekiq::RetrySet.new.size,
      items_processed_today: Item.where('created_at > ?', 1.day.ago).count
    }
  end
end

Reference

Core Classes and Modules

Class/Module Purpose Key Methods
RSS Main module containing all RSS functionality ::Parser, ::Maker
RSS::Parser Feed parsing functionality ::parse(source, validate: true)
RSS::Maker Feed generation functionality ::make(version, &block)
RSS::Rss RSS format feed objects #channel, #items, #version
RSS::Atom::Feed Atom format feed objects #title, #entries, #updated

Parser Methods

Method Parameters Returns Description
RSS::Parser.parse(source, validate: true) source (String/URI), validate (Boolean) Feed object Parse RSS/Atom feed from source
RSS::Parser.parse(source, do_validate: false) source (String/URI), do_validate (Boolean) Feed object Legacy validation parameter name

Feed Object Properties

Property RSS Access Atom Access Returns Description
Title feed.channel.title feed.title.content String Feed title
Description feed.channel.description feed.subtitle.content String Feed description
Link feed.channel.link feed.link.href String Feed homepage URL
Language feed.channel.language feed.lang String Feed language code
Copyright feed.channel.copyright feed.rights.content String Copyright information
Items feed.items feed.entries Array Collection of feed items

Item Object Properties

Property RSS Access Atom Access Returns Description
title item.title entry.title.content String Item title
link item.link entry.link.href String Item URL
description item.description entry.summary.content String Item summary/description
content item.content_encoded entry.content.content String Full item content
pubDate item.pubDate entry.published.content Time Publication date
guid item.guid.content entry.id.content String Unique identifier
author item.author entry.author.name.content String Item author
category item.category entry.category.term String Item category/tag

Maker Interface

Method Parameters Returns Description
RSS::Maker.make(version, &block) version (String), block Feed object Create new feed of specified version
maker.channel.title = value value (String) String Set feed title
maker.channel.description = value value (String) String Set feed description
maker.channel.link = value value (String) String Set feed link
maker.items.new_item(&block) block Item object Add new item to feed

Supported Feed Versions

Version String Feed Format Description
"0.91" RSS 0.91 Early RSS format
"0.92" RSS 0.92 Enhanced RSS 0.91
"1.0" RSS 1.0 RDF-based RSS
"2.0" RSS 2.0 Most common RSS format
"atom" Atom 1.0 IETF Atom Syndication Format

Exception Hierarchy

Exception Parent Description
RSS::Error StandardError Base RSS exception
RSS::InvalidRSSError RSS::Error Invalid feed structure
RSS::NotWellFormedError RSS::InvalidRSSError Malformed XML
RSS::MissingTagError RSS::InvalidRSSError Required tag missing
RSS::TooMuchTagError RSS::InvalidRSSError Too many tag instances
RSS::MissingAttributeError RSS::InvalidRSSError Required attribute missing
RSS::UnknownTagError RSS::InvalidRSSError Unrecognized tag found

Common Validation Options

Option Type Default Description
validate Boolean true Enable strict RSS/Atom validation
do_validate Boolean true Legacy parameter name for validation
ignore_unknown_element Boolean false Skip unknown XML elements
compatible Boolean false Enable compatibility mode for malformed feeds

Content Extraction Patterns

Pattern Usage Example
Plain text extraction item.description Standard RSS description field
HTML content item.content_encoded Full HTML content from RSS
Atom content entry.content.content Atom entry content
Summary text entry.summary.content Atom entry summary
CDATA handling Automatic XML CDATA sections parsed automatically

Date Handling

Format RSS Field Atom Field Ruby Conversion
RFC 822 pubDate N/A Time.parse(date_string)
ISO 8601 N/A published, updated Time.iso8601(date_string)
Custom parsing Various Various DateTime.strptime(date, format)

Feed Detection Patterns

# Detect feed format from content
def detect_feed_format(content)
  case content
  when /<rss/i
    'RSS'
  when /<feed.*xmlns.*atom/i
    'Atom'
  when /<rdf:RDF/i
    'RSS 1.0'
  else
    'Unknown'
  end
end

Performance Considerations

Aspect Recommendation Impact
Validation Disable for production parsing 2-3x faster parsing
Encoding Specify encoding when known Reduces encoding detection overhead
Memory usage Process items iteratively Reduces memory footprint for large feeds
Network timeouts Set reasonable timeouts Prevents hanging requests
Caching Cache parsed feed objects Reduces parsing overhead