Overview
Scanf provides pattern-based string parsing in Ruby through format specifiers similar to those used in C's scanf function. The module extracts data from strings by matching against format templates, converting matched substrings into appropriate Ruby data types. Ruby implements Scanf as a module that extends String and IO classes, requiring explicit inclusion via require 'scanf'
.
The primary method String#scanf
accepts a format string containing conversion specifiers and returns an array of parsed values. Each specifier defines the expected data type and format of input segments. The scanning process consumes characters from the input string according to the format pattern, converting matched portions into integers, floats, strings, or other specified types.
require 'scanf'
"123 45.6 hello".scanf("%d %f %s")
# => [123, 45.6, "hello"]
"John,25,Engineer".scanf("%[^,],%d,%s")
# => ["John", 25, "Engineer"]
Scanf operates differently from regular expressions by providing structured data extraction with automatic type conversion. The format string acts as a template describing expected input structure, while the method handles parsing and conversion automatically. This approach works particularly well for processing structured text data like CSV files, log entries, or configuration files where data appears in predictable formats.
The scanning process stops when input no longer matches the format specification or when the entire format string has been processed. Unmatched portions remain in the original string, and the method returns all successfully parsed values up to the point of failure.
Basic Usage
String scanning requires format specifiers that define expected data patterns. The most common specifiers include %d
for integers, %f
for floating-point numbers, and %s
for strings. Each specifier consumes characters from the input until it encounters a delimiter or reaches a format boundary.
require 'scanf'
# Basic numeric parsing
data = "Temperature: 23.5 degrees"
temperature = data.scanf("Temperature: %f degrees")
# => [23.5]
# Multiple values with different types
record = "ID:1001 Score:95.5 Grade:A"
parsed = record.scanf("ID:%d Score:%f Grade:%s")
# => [1001, 95.5, "A"]
String specifiers provide additional control over parsing behavior. The %s
specifier reads until whitespace, while bracket notation %[...]
defines custom character sets for matching. The caret operator %[^...]
creates exclusion sets, reading characters until encountering any character in the specified set.
# String extraction with custom delimiters
email = "user@example.com"
parts = email.scanf("%[^@]@%s")
# => ["user", "example.com"]
# Reading until specific characters
path = "/home/user/documents/file.txt"
components = path.scanf("/%[^/]/%[^/]/%[^/]/%s")
# => ["home", "user", "documents", "file.txt"]
Width specifiers limit the number of characters consumed by each format element. This prevents over-reading and allows precise control over field boundaries in fixed-width formats.
# Fixed-width field parsing
record = "001234567John Doe 25"
fields = record.scanf("%3d%6d%8s%8s%2d")
# => [1, 234567, "John", "Doe", 25]
# Limiting string length
input = "VeryLongStringHere"
result = input.scanf("%5s")
# => ["VeryL"]
The scanning process handles whitespace automatically for most specifiers. Numeric specifiers skip leading whitespace, while string specifiers typically treat whitespace as delimiters. This behavior simplifies parsing of space-separated data formats.
# Automatic whitespace handling
messy_data = " 123 45.6 hello "
clean_result = messy_data.scanf("%d %f %s")
# => [123, 45.6, "hello"]
# Tab-separated values
tsv_line = "name\tage\tcity"
fields = tsv_line.scanf("%s\t%s\t%s")
# => ["name", "age", "city"]
Error Handling & Debugging
Scanf parsing failures occur silently, returning fewer elements than expected rather than raising exceptions. This behavior requires explicit validation of return array length to detect incomplete parsing. Failed conversions result in nil values or empty arrays depending on the failure point.
require 'scanf'
# Detecting parsing failures
input = "123 invalid 456"
result = input.scanf("%d %d %d")
# => [123] # Only first value parsed successfully
# Validating expected number of fields
expected_fields = 3
if result.length != expected_fields
puts "Parsing failed: expected #{expected_fields}, got #{result.length}"
puts "Remaining input: #{input[input.scanf('%d').to_s.length..-1]}"
end
Type conversion errors manifest as parsing termination rather than exceptions. When scanf encounters data that doesn't match the expected format specifier, it stops processing and returns values parsed up to that point. Debugging requires examining both the returned array length and the input string position where parsing stopped.
# Debugging type conversion failures
problematic_input = "123.45 67 text"
result = problematic_input.scanf("%d %d %d")
# => [123] # Stops at decimal point in first number
# Creating debugging wrapper
def debug_scanf(string, format)
result = string.scanf(format)
consumed = result.map(&:to_s).join(' ').length
remaining = string[consumed..-1]
puts "Format: #{format}"
puts "Result: #{result.inspect}"
puts "Consumed: '#{string[0...consumed]}'"
puts "Remaining: '#{remaining}'"
result
end
debug_scanf("123.45 67", "%d %d")
Format string validation prevents many parsing issues. Mismatched format specifiers and input data cause silent failures that can be difficult to trace. Creating validation functions that check format string syntax and expected input patterns improves reliability.
# Format validation helper
def validate_scanf_result(input, format, expected_count)
result = input.scanf(format)
case
when result.length < expected_count
raise "Insufficient fields: expected #{expected_count}, got #{result.length}"
when result.any?(&:nil?)
raise "Conversion failed: nil values in result"
else
result
end
end
# Usage with validation
begin
data = validate_scanf_result("123 456", "%d %d", 2)
puts "Success: #{data}"
rescue => e
puts "Error: #{e.message}"
end
Complex format strings require systematic testing with edge cases. Whitespace handling, delimiter matching, and type conversion all present potential failure points. Creating comprehensive test cases that cover boundary conditions prevents production parsing failures.
# Edge case testing framework
test_cases = [
["123 456", "%d %d", [123, 456]],
[" 123 456 ", "%d %d", [123, 456]],
["123", "%d %d", [123]], # Incomplete input
["abc 456", "%d %d", []], # Invalid first field
["123 abc", "%d %d", [123]] # Invalid second field
]
test_cases.each do |input, format, expected|
result = input.scanf(format)
success = result == expected
puts "#{success ? 'PASS' : 'FAIL'}: '#{input}' -> #{result.inspect}"
end
Common Pitfalls
Format specifier precedence creates unexpected parsing behavior when multiple specifiers could match the same input segment. The %s
specifier consumes characters greedily until whitespace, potentially preventing subsequent specifiers from matching expected data. Understanding specifier behavior prevents parsing conflicts.
require 'scanf'
# Problematic greedy matching
input = "filename.txt 1024"
result = input.scanf("%s %d")
# => ["filename.txt", 1024] # Works correctly
# But this fails unexpectedly
input = "file name.txt 1024" # Space in filename
result = input.scanf("%s %d")
# => ["file"] # Stops at first space, doesn't parse number
# Solution: Use bracket notation for filenames with spaces
input = "file name.txt 1024"
result = input.scanf("%[^ ] %d")
# => ["file name.txt", 1024]
Whitespace handling inconsistencies between format specifiers cause parsing failures in mixed-format data. While numeric specifiers skip leading whitespace automatically, string and bracket specifiers treat whitespace literally. This mismatch creates unexpected behavior when parsing data with irregular spacing.
# Whitespace handling mismatch
input = "123 abc" # Extra spaces between fields
result1 = input.scanf("%d %s")
# => [123, "abc"] # Works fine
# But literal string matching fails
input = "123 abc"
result2 = input.scanf("%d abc")
# => [123] # Fails due to extra space before 'abc'
# Solution: Include flexible whitespace in format
input = "123 abc"
result3 = input.scanf("%d%*[ \t]abc") # %*[ \t] consumes spaces/tabs
# => [123] # Successfully matches 'abc' after spaces
Bracket notation character set definitions require careful escaping and ordering. Special characters like ]
, ^
, and -
have special meanings within bracket expressions. Incorrect character set definitions lead to unexpected matching behavior or parsing failures.
# Character set pitfalls
input = "a-z test"
# Wrong: Tries to match range 'a' through 'z'
result1 = input.scanf("%[a-z] %s")
# => ["a"] # Only matches first 'a'
# Wrong: Unescaped dash creates range
input = "a-b-c test"
result2 = input.scanf("%[a-c] %s")
# => ["a"] # Matches 'a' from range, stops at dash
# Correct: Escape dash or place at end
result3 = input.scanf("%[abc-] %s")
# => ["a-b-c", "test"] # Dash treated literally
# Correct: Dash at beginning
result4 = input.scanf("%[-abc] %s")
# => ["a-b-c", "test"] # Dash treated literally
Return value interpretation mistakes occur when developers expect consistent array lengths. Scanf returns arrays with varying lengths depending on parsing success, not fixed-size arrays matching format specifier count. Code that assumes specific array indices without length validation fails unpredictably.
# Dangerous index assumptions
def parse_record(line)
parts = line.scanf("%s %d %f")
name, age, salary = parts[0], parts[1], parts[2] # Unsafe!
# parts[2] could be nil if parsing fails
end
# Safe destructuring approach
def parse_record_safe(line)
parts = line.scanf("%s %d %f")
return nil unless parts.length == 3
name, age, salary = parts
{ name: name, age: age, salary: salary }
end
# Even safer with validation
def parse_record_validated(line)
parts = line.scanf("%s %d %f")
case parts.length
when 3
{ name: parts[0], age: parts[1], salary: parts[2] }
when 0
raise "No fields parsed from: #{line}"
else
raise "Partial parsing: expected 3 fields, got #{parts.length} from: #{line}"
end
end
Numeric conversion edge cases create subtle bugs when input contains valid numbers in unexpected formats. Scanf handles scientific notation, leading zeros, and sign characters differently than developers might expect, leading to parsing surprises in production data.
# Numeric conversion surprises
test_numbers = [
"007", # Leading zeros
"+123", # Explicit positive sign
"1.23e4", # Scientific notation
"0x1A", # Hexadecimal (doesn't work with %d)
"1,234" # Thousands separator
]
test_numbers.each do |num|
result = num.scanf("%d")
puts "'#{num}' -> #{result.inspect}"
end
# Output:
# '007' -> [7] # Leading zeros stripped
# '+123' -> [123] # Sign handled correctly
# '1.23e4' -> [1] # Stops at decimal point
# '0x1A' -> [0] # Stops after '0', doesn't parse hex
# '1,234' -> [1] # Stops at comma
Reference
Format Specifiers
Specifier | Type | Description | Example Input | Result |
---|---|---|---|---|
%d , %i |
Integer | Decimal integer | "123" |
123 |
%o |
Integer | Octal integer | "755" |
493 |
%x , %X |
Integer | Hexadecimal integer | "ff" |
255 |
%f , %g , %e |
Float | Floating-point number | "3.14" |
3.14 |
%s |
String | String (until whitespace) | "hello" |
"hello" |
%c |
String | Single character | "A" |
"A" |
%[chars] |
String | Character set match | "%[abc]" |
Matches 'a', 'b', or 'c' |
%[^chars] |
String | Character set exclusion | "%[^,]" |
Matches until comma |
Width Specifiers
Format | Description | Example | Input | Result |
---|---|---|---|---|
%5d |
Maximum 5 digits | "12345678".scanf("%5d") |
"12345678" |
[12345] |
%3s |
Maximum 3 characters | "hello".scanf("%3s") |
"hello" |
["hel"] |
%10[^,] |
Max 10 chars until comma | "very,long,string".scanf("%10[^,]") |
"verylongname,data" |
["verylongn"] |
Assignment Suppression
Format | Description | Example | Input | Result |
---|---|---|---|---|
%*d |
Skip integer | "123 456".scanf("%*d %d") |
"123 456" |
[456] |
%*s |
Skip string | "skip this".scanf("%*s %s") |
"skip this" |
["this"] |
%*[^,] |
Skip until delimiter | "skip,keep".scanf("%*[^,],%s") |
"skip,keep" |
["keep"] |
Character Set Patterns
Pattern | Description | Matches |
---|---|---|
%[abc] |
Any of specified chars | 'a', 'b', or 'c' |
%[a-z] |
Character range | Lowercase letters |
%[A-Za-z0-9] |
Multiple ranges | Alphanumeric characters |
%[^abc] |
Exclusion set | Any char except 'a', 'b', 'c' |
%[^\n] |
Until newline | Everything except newline |
%[^,;] |
Until delimiters | Until comma or semicolon |
Return Value Patterns
Scenario | Input | Format | Result | Notes |
---|---|---|---|---|
Complete match | "123 456" |
"%d %d" |
[123, 456] |
All specifiers matched |
Partial match | "123 abc" |
"%d %d" |
[123] |
Second specifier failed |
No match | "abc 456" |
"%d %d" |
[] |
First specifier failed |
Type mismatch | "12.34" |
"%d" |
[12] |
Conversion stopped at decimal |
Common Method Patterns
# String scanning
require 'scanf'
result = "data string".scanf("format string")
# IO scanning
File.open("data.txt") do |file|
result = file.scanf("format string")
end
# Block scanning for multiple records
"record1\nrecord2\nrecord3".each_line do |line|
fields = line.scanf("%s %d %f")
process_record(fields) if fields.length == 3
end
Error Conditions
Error Type | Symptom | Example | Solution |
---|---|---|---|
Format mismatch | Fewer results than expected | "abc".scanf("%d") returns [] |
Validate result array length |
Type conversion | Partial results | "12.34".scanf("%d %d") returns [12] |
Use appropriate format specifiers |
Missing delimiters | Unexpected parsing | "123abc".scanf("%d %s") returns [123] |
Include literal delimiters in format |
Greedy matching | Over-consumption | "%s %s" on "a b c" returns ["a", "b"] |
Use bracket notation for control |