CrackedRuby - Scanf

Overview

Scanf provides pattern-based string parsing in Ruby through format specifiers similar to those used in C's scanf function. The module extracts data from strings by matching against format templates, converting matched substrings into appropriate Ruby data types. Ruby implements Scanf as a module that extends String and IO classes, requiring explicit inclusion via require 'scanf'.

The primary method String#scanf accepts a format string containing conversion specifiers and returns an array of parsed values. Each specifier defines the expected data type and format of input segments. The scanning process consumes characters from the input string according to the format pattern, converting matched portions into integers, floats, strings, or other specified types.

require 'scanf'

"123 45.6 hello".scanf("%d %f %s")
# => [123, 45.6, "hello"]

"John,25,Engineer".scanf("%[^,],%d,%s") 
# => ["John", 25, "Engineer"]

Scanf operates differently from regular expressions by providing structured data extraction with automatic type conversion. The format string acts as a template describing expected input structure, while the method handles parsing and conversion automatically. This approach works particularly well for processing structured text data like CSV files, log entries, or configuration files where data appears in predictable formats.

The scanning process stops when input no longer matches the format specification or when the entire format string has been processed. Unmatched portions remain in the original string, and the method returns all successfully parsed values up to the point of failure.

Basic Usage

String scanning requires format specifiers that define expected data patterns. The most common specifiers include %d for integers, %f for floating-point numbers, and %s for strings. Each specifier consumes characters from the input until it encounters a delimiter or reaches a format boundary.

require 'scanf'

# Basic numeric parsing
data = "Temperature: 23.5 degrees"
temperature = data.scanf("Temperature: %f degrees")
# => [23.5]

# Multiple values with different types
record = "ID:1001 Score:95.5 Grade:A"
parsed = record.scanf("ID:%d Score:%f Grade:%s")
# => [1001, 95.5, "A"]

String specifiers provide additional control over parsing behavior. The %s specifier reads until whitespace, while bracket notation %[...] defines custom character sets for matching. The caret operator %[^...] creates exclusion sets, reading characters until encountering any character in the specified set.

# String extraction with custom delimiters
email = "user@example.com"
parts = email.scanf("%[^@]@%s")
# => ["user", "example.com"]

# Reading until specific characters
path = "/home/user/documents/file.txt"
components = path.scanf("/%[^/]/%[^/]/%[^/]/%s")
# => ["home", "user", "documents", "file.txt"]

Width specifiers limit the number of characters consumed by each format element. This prevents over-reading and allows precise control over field boundaries in fixed-width formats.

# Fixed-width field parsing
record = "001234567John    Doe     25"
fields = record.scanf("%3d%6d%8s%8s%2d")
# => [1, 234567, "John", "Doe", 25]

# Limiting string length
input = "VeryLongStringHere"
result = input.scanf("%5s")
# => ["VeryL"]

The scanning process handles whitespace automatically for most specifiers. Numeric specifiers skip leading whitespace, while string specifiers typically treat whitespace as delimiters. This behavior simplifies parsing of space-separated data formats.

# Automatic whitespace handling
messy_data = "  123   45.6    hello   "
clean_result = messy_data.scanf("%d %f %s")
# => [123, 45.6, "hello"]

# Tab-separated values
tsv_line = "name\tage\tcity"
fields = tsv_line.scanf("%s\t%s\t%s")
# => ["name", "age", "city"]

Error Handling & Debugging

Scanf parsing failures occur silently, returning fewer elements than expected rather than raising exceptions. This behavior requires explicit validation of return array length to detect incomplete parsing. Failed conversions result in nil values or empty arrays depending on the failure point.

require 'scanf'

# Detecting parsing failures
input = "123 invalid 456"
result = input.scanf("%d %d %d")
# => [123]  # Only first value parsed successfully

# Validating expected number of fields
expected_fields = 3
if result.length != expected_fields
  puts "Parsing failed: expected #{expected_fields}, got #{result.length}"
  puts "Remaining input: #{input[input.scanf('%d').to_s.length..-1]}"
end

Type conversion errors manifest as parsing termination rather than exceptions. When scanf encounters data that doesn't match the expected format specifier, it stops processing and returns values parsed up to that point. Debugging requires examining both the returned array length and the input string position where parsing stopped.

# Debugging type conversion failures
problematic_input = "123.45 67 text"
result = problematic_input.scanf("%d %d %d")
# => [123]  # Stops at decimal point in first number

# Creating debugging wrapper
def debug_scanf(string, format)
  result = string.scanf(format)
  consumed = result.map(&:to_s).join(' ').length
  remaining = string[consumed..-1]
  
  puts "Format: #{format}"
  puts "Result: #{result.inspect}"
  puts "Consumed: '#{string[0...consumed]}'"
  puts "Remaining: '#{remaining}'"
  
  result
end

debug_scanf("123.45 67", "%d %d")

Format string validation prevents many parsing issues. Mismatched format specifiers and input data cause silent failures that can be difficult to trace. Creating validation functions that check format string syntax and expected input patterns improves reliability.

# Format validation helper
def validate_scanf_result(input, format, expected_count)
  result = input.scanf(format)
  
  case
  when result.length < expected_count
    raise "Insufficient fields: expected #{expected_count}, got #{result.length}"
  when result.any?(&:nil?)
    raise "Conversion failed: nil values in result"
  else
    result
  end
end

# Usage with validation
begin
  data = validate_scanf_result("123 456", "%d %d", 2)
  puts "Success: #{data}"
rescue => e
  puts "Error: #{e.message}"
end

Complex format strings require systematic testing with edge cases. Whitespace handling, delimiter matching, and type conversion all present potential failure points. Creating comprehensive test cases that cover boundary conditions prevents production parsing failures.

# Edge case testing framework
test_cases = [
  ["123 456", "%d %d", [123, 456]],
  ["  123   456  ", "%d %d", [123, 456]],
  ["123", "%d %d", [123]],  # Incomplete input
  ["abc 456", "%d %d", []],  # Invalid first field
  ["123 abc", "%d %d", [123]]  # Invalid second field
]

test_cases.each do |input, format, expected|
  result = input.scanf(format)
  success = result == expected
  puts "#{success ? 'PASS' : 'FAIL'}: '#{input}' -> #{result.inspect}"
end

Common Pitfalls

Format specifier precedence creates unexpected parsing behavior when multiple specifiers could match the same input segment. The %s specifier consumes characters greedily until whitespace, potentially preventing subsequent specifiers from matching expected data. Understanding specifier behavior prevents parsing conflicts.

require 'scanf'

# Problematic greedy matching
input = "filename.txt 1024"
result = input.scanf("%s %d")
# => ["filename.txt", 1024]  # Works correctly

# But this fails unexpectedly
input = "file name.txt 1024"  # Space in filename
result = input.scanf("%s %d")
# => ["file"]  # Stops at first space, doesn't parse number

# Solution: Use bracket notation for filenames with spaces
input = "file name.txt 1024"
result = input.scanf("%[^ ] %d")
# => ["file name.txt", 1024]

Whitespace handling inconsistencies between format specifiers cause parsing failures in mixed-format data. While numeric specifiers skip leading whitespace automatically, string and bracket specifiers treat whitespace literally. This mismatch creates unexpected behavior when parsing data with irregular spacing.

# Whitespace handling mismatch
input = "123  abc"  # Extra spaces between fields
result1 = input.scanf("%d %s")
# => [123, "abc"]  # Works fine

# But literal string matching fails
input = "123  abc"
result2 = input.scanf("%d abc")
# => [123]  # Fails due to extra space before 'abc'

# Solution: Include flexible whitespace in format
input = "123  abc"
result3 = input.scanf("%d%*[ \t]abc")  # %*[ \t] consumes spaces/tabs
# => [123]  # Successfully matches 'abc' after spaces

Bracket notation character set definitions require careful escaping and ordering. Special characters like ], ^, and - have special meanings within bracket expressions. Incorrect character set definitions lead to unexpected matching behavior or parsing failures.

# Character set pitfalls
input = "a-z test"

# Wrong: Tries to match range 'a' through 'z'
result1 = input.scanf("%[a-z] %s")
# => ["a"]  # Only matches first 'a'

# Wrong: Unescaped dash creates range
input = "a-b-c test"
result2 = input.scanf("%[a-c] %s")
# => ["a"]  # Matches 'a' from range, stops at dash

# Correct: Escape dash or place at end
result3 = input.scanf("%[abc-] %s")
# => ["a-b-c", "test"]  # Dash treated literally

# Correct: Dash at beginning
result4 = input.scanf("%[-abc] %s")
# => ["a-b-c", "test"]  # Dash treated literally

Return value interpretation mistakes occur when developers expect consistent array lengths. Scanf returns arrays with varying lengths depending on parsing success, not fixed-size arrays matching format specifier count. Code that assumes specific array indices without length validation fails unpredictably.

# Dangerous index assumptions
def parse_record(line)
  parts = line.scanf("%s %d %f")
  name, age, salary = parts[0], parts[1], parts[2]  # Unsafe!
  # parts[2] could be nil if parsing fails
end

# Safe destructuring approach
def parse_record_safe(line)
  parts = line.scanf("%s %d %f")
  return nil unless parts.length == 3
  
  name, age, salary = parts
  { name: name, age: age, salary: salary }
end

# Even safer with validation
def parse_record_validated(line)
  parts = line.scanf("%s %d %f")
  
  case parts.length
  when 3
    { name: parts[0], age: parts[1], salary: parts[2] }
  when 0
    raise "No fields parsed from: #{line}"
  else
    raise "Partial parsing: expected 3 fields, got #{parts.length} from: #{line}"
  end
end

Numeric conversion edge cases create subtle bugs when input contains valid numbers in unexpected formats. Scanf handles scientific notation, leading zeros, and sign characters differently than developers might expect, leading to parsing surprises in production data.

# Numeric conversion surprises
test_numbers = [
  "007",      # Leading zeros
  "+123",     # Explicit positive sign  
  "1.23e4",   # Scientific notation
  "0x1A",     # Hexadecimal (doesn't work with %d)
  "1,234"     # Thousands separator
]

test_numbers.each do |num|
  result = num.scanf("%d")
  puts "'#{num}' -> #{result.inspect}"
end

# Output:
# '007' -> [7]          # Leading zeros stripped
# '+123' -> [123]       # Sign handled correctly
# '1.23e4' -> [1]       # Stops at decimal point
# '0x1A' -> [0]         # Stops after '0', doesn't parse hex
# '1,234' -> [1]        # Stops at comma

Reference

Format Specifiers

Specifier	Type	Description	Example Input	Result
`%d`, `%i`	Integer	Decimal integer	`"123"`	`123`
`%o`	Integer	Octal integer	`"755"`	`493`
`%x`, `%X`	Integer	Hexadecimal integer	`"ff"`	`255`
`%f`, `%g`, `%e`	Float	Floating-point number	`"3.14"`	`3.14`
`%s`	String	String (until whitespace)	`"hello"`	`"hello"`
`%c`	String	Single character	`"A"`	`"A"`
`%[chars]`	String	Character set match	`"%[abc]"`	Matches 'a', 'b', or 'c'
`%[^chars]`	String	Character set exclusion	`"%[^,]"`	Matches until comma

Width Specifiers

Format	Description	Example	Input	Result
`%5d`	Maximum 5 digits	`"12345678".scanf("%5d")`	`"12345678"`	`[12345]`
`%3s`	Maximum 3 characters	`"hello".scanf("%3s")`	`"hello"`	`["hel"]`
`%10[^,]`	Max 10 chars until comma	`"very,long,string".scanf("%10[^,]")`	`"verylongname,data"`	`["verylongn"]`

Assignment Suppression

Format	Description	Example	Input	Result
`%*d`	Skip integer	`"123 456".scanf("%*d %d")`	`"123 456"`	`[456]`
`%*s`	Skip string	`"skip this".scanf("%*s %s")`	`"skip this"`	`["this"]`
`%*[^,]`	Skip until delimiter	`"skip,keep".scanf("%*[^,],%s")`	`"skip,keep"`	`["keep"]`

Character Set Patterns

Pattern	Description	Matches
`%[abc]`	Any of specified chars	'a', 'b', or 'c'
`%[a-z]`	Character range	Lowercase letters
`%[A-Za-z0-9]`	Multiple ranges	Alphanumeric characters
`%[^abc]`	Exclusion set	Any char except 'a', 'b', 'c'
`%[^\n]`	Until newline	Everything except newline
`%[^,;]`	Until delimiters	Until comma or semicolon

Return Value Patterns

Scenario	Input	Format	Result	Notes
Complete match	`"123 456"`	`"%d %d"`	`[123, 456]`	All specifiers matched
Partial match	`"123 abc"`	`"%d %d"`	`[123]`	Second specifier failed
No match	`"abc 456"`	`"%d %d"`	`[]`	First specifier failed
Type mismatch	`"12.34"`	`"%d"`	`[12]`	Conversion stopped at decimal

Common Method Patterns

# String scanning
require 'scanf'
result = "data string".scanf("format string")

# IO scanning  
File.open("data.txt") do |file|
  result = file.scanf("format string")
end

# Block scanning for multiple records
"record1\nrecord2\nrecord3".each_line do |line|
  fields = line.scanf("%s %d %f")
  process_record(fields) if fields.length == 3
end

Error Conditions

Error Type	Symptom	Example	Solution
Format mismatch	Fewer results than expected	`"abc".scanf("%d")` returns `[]`	Validate result array length
Type conversion	Partial results	`"12.34".scanf("%d %d")` returns `[12]`	Use appropriate format specifiers
Missing delimiters	Unexpected parsing	`"123abc".scanf("%d %s")` returns `[123]`	Include literal delimiters in format
Greedy matching	Over-consumption	`"%s %s"` on `"a b c"` returns `["a", "b"]`	Use bracket notation for control

Scanf