askill
edge-case-handling

edge-case-handlingSafety 95Repository

The html-to-markdown converter handles numerous edge cases and adversarial inputs with robust validation, error recovery, and fallback mechanisms. This skill documents binary detection, encoding issues, malformed HTML recovery, and other robustness strategies.

484 stars
9.7k downloads
Updated 1/28/2026

Package Files

Loading files...
SKILL.md

Edge Case Handling in html-to-markdown

Overview

The html-to-markdown converter handles numerous edge cases and adversarial inputs with robust validation, error recovery, and fallback mechanisms. This skill documents binary detection, encoding issues, malformed HTML recovery, and other robustness strategies.

Binary File Detection

Multi-Layer Binary Detection Strategy

The converter implements multiple detection techniques to prevent binary data from being processed as HTML.

Layer 1: Magic Prefix Detection

Located in /crates/html-to-markdown/src/lib.rs (lines 50-57, 111-118):

const BINARY_MAGIC_PREFIXES: &[(&[u8], &str)] = &[
    (b"\x1F\x8B", "gzip-compressed data"),      // gzip
    (b"\x28\xB5\x2F\xFD", "zstd-compressed data"), // zstd
    (b"PK\x03\x04", "zip archive"),             // ZIP
    (b"PK\x05\x06", "zip archive"),
    (b"PK\x07\x08", "zip archive"),
    (b"%PDF-", "PDF data"),                      // PDF
];

fn detect_binary_magic(bytes: &[u8]) -> Option<&'static str> {
    for (prefix, label) in BINARY_MAGIC_PREFIXES {
        if bytes.starts_with(prefix) {
            return Some(*label);
        }
    }
    None
}

Detection Flow:

Input bytes
    |
    +-- Check for known magic signatures (first 4-8 bytes)
    |
    +-- Match found? Return error immediately
    |
    +-- No match? Continue to next detection layer

Example:

// Gzip-compressed file
let html = b"\x1F\x8B\x08\x00...gzipped content...".to_vec();
let result = convert(&String::from_utf8_lossy(&html), None);
// Error: "binary data detected (gzip-compressed data)"

Layer 2: UTF-16 Encoding Detection

Located in /crates/html-to-markdown/src/lib.rs (lines 120-147):

BOM (Byte Order Mark) Detection:

// UTF-16LE BOM
if bytes.starts_with(b"\xFF\xFE") {
    return Some("UTF-16LE BOM");
}

// UTF-16BE BOM
if bytes.starts_with(b"\xFE\xFF") {
    return Some("UTF-16BE BOM");
}

Heuristic UTF-16 Detection (without BOM):

const BINARY_UTF16_NULL_RATIO: f64 = 0.2;  // 20% null bytes

// Count null bytes in sample
let nul_ratio = nul_count as f64 / sample_len as f64;
if nul_ratio < BINARY_UTF16_NULL_RATIO {
    return None;  // Not enough nulls for UTF-16
}

// UTF-16 has consistent null pattern (even or odd positions)
let dominant_ratio = (even_nul_count.max(odd_nul_count) as f64) / nul_count as f64;
if dominant_ratio >= 0.9 {
    return Some("UTF-16 data without BOM");
}

Examples:

UTF-16LE without BOM: <\0h\0t\0m\0l\0>\0
- Nulls at even indices (2, 4, 6, 8, 10, 12)
- Detected as UTF-16 heuristic

UTF-16BE without BOM: \0<\0h\0t\0m\0l\0>
- Nulls at odd indices (1, 3, 5, 7, 9, 11)
- Detected as UTF-16 heuristic

Layer 3: Control Character Detection

Located in /crates/html-to-markdown/src/lib.rs (lines 71-106):

const BINARY_SCAN_LIMIT: usize = 8192;      // Scan first 8KB
const BINARY_CONTROL_RATIO: f64 = 0.3;      // 30% threshold

fn validate_input(html: &str) -> Result<()> {
    let bytes = html.as_bytes();
    if bytes.is_empty() {
        return Ok(());
    }

    // Check magic prefixes first
    if let Some(label) = detect_binary_magic(bytes) {
        return Err(ConversionError::InvalidInput(format!(
            "binary data detected ({label}); decode/decompress to UTF-8 HTML first"
        )));
    }

    let sample_len = bytes.len().min(BINARY_SCAN_LIMIT);
    let mut control_count = 0usize;

    for &byte in bytes[..sample_len].iter() {
        let is_control = (byte < 0x09) || (0x0E..0x20).contains(&byte);
        if is_control {
            control_count += 1;
        }
    }

    let control_ratio = control_count as f64 / sample_len as f64;
    if control_ratio > BINARY_CONTROL_RATIO {
        return Err(ConversionError::InvalidInput(
            "binary data detected (excess control bytes)".to_string(),
        ));
    }

    Ok(())
}

Control Character Ranges:

  • 0x00-0x08: NUL, SOH, STX, ETX, EOT, ENQ, ACK, BEL, BS
  • 0x0E-0x1F: Shift Out through Unit Separator (except TAB 0x09, LF 0x0A, CR 0x0D)
  • Threshold: If > 30% are control characters in sample, reject as binary

Examples:

HTML with 35% control chars: REJECTED
"Hello\x00\x01\x02World\x03\x04\x05..."
     ^^^^^^         ^^^^^^^ = 6/8 sample = 75% > 30%

Text with \r\n newlines: ACCEPTED
"Hello\r\nWorld\r\n" = 2 CR, 2 LF within normal ranges (0x0D, 0x0A allowed)

Detection Performance

  • Magic prefix: O(1), first 8 bytes checked
  • Full sample scan: O(8192) = O(1) constant
  • Total: < 1ms for all checks
  • Position: Happens before any HTML parsing

Error Messages

All binary detection errors return ConversionError::InvalidInput:

pub enum ConversionError {
    InvalidInput(String),  // Binary, encoding, or malformed
    ParseError(String),
    ConfigError(String),
    Panic(String),
    Other(String),
}

Example Error Flow:

let html = read_file("archive.zip");  // Binary ZIP file
match convert(&html, None) {
    Err(ConversionError::InvalidInput(msg)) => {
        eprintln!("Cannot convert: {}", msg);
        // Error: "binary data detected (zip archive); decode/decompress to UTF-8 HTML first"
    }
    _ => {}
}

UTF-16 Encoding Issues

Why UTF-16 Causes Problems

  1. Two bytes per character minimum: ASCII "a" = \x61\x00 in UTF-16LE or \x00\x61 in UTF-16BE
  2. HTML requires UTF-8 or explicit charset: HTTP headers or <meta charset> specify UTF-8
  3. Common mistake: Saving HTML from Windows Notepad in UTF-16 without BOM
  4. JavaScript/web incompatibility: Browsers expect UTF-8 HTML

Conversion Strategy for UTF-16

Since detection happens before parsing:

// User receives error
Err("binary data detected (UTF-16LE BOM); decode to UTF-8 HTML first")

// User must decode first:
let utf16_html = fs::read("utf16.html").unwrap();  // Vec<u8>
let utf8_html = String::from_utf16_le(&utf16_html)
    .expect("decode UTF-16LE to UTF-8")
    .into_string();
let markdown = convert(&utf8_html, None)?;

Python Example (Common Case)

# Common mistake: file saved as UTF-16
with open('document.html', 'rb') as f:
    raw_bytes = f.read()

# Detect and convert
try:
    # Try UTF-16LE (Windows)
    html = raw_bytes.decode('utf-16-le')
except UnicodeDecodeError:
    # Try UTF-16BE (Mac)
    html = raw_bytes.decode('utf-16-be')

# Now convert
markdown = html_to_markdown.convert(html)

Malformed HTML Recovery

Parser Fallback Strategy

The converter uses two parsers with different robustness levels:

astral-tl (fast, primary):

  • Works well for well-formed HTML
  • Fails gracefully on malformed markup
  • Used by default in /crates/html-to-markdown/src/converter.rs

html5ever (robust, fallback):

  • Full HTML5 spec compliance
  • Automatic error recovery
  • Tag closure, namespace handling
  • Used when astral-tl insufficient (feature-gated)

Malformed HTML Examples

Example 1: Unclosed Tags

<!-- Input -->
<div>
  <p>Paragraph 1
  <p>Paragraph 2
</div>

<!-- astral-tl behavior: Treats as flat text flow -->
<!-- html5ever behavior: Auto-closes <p>, handles nesting -->

Output:

Paragraph 1

Paragraph 2

Example 2: Mismatched Nesting

<!-- Input: Divs inside table cells (invalid structure) -->
<table>
  <tr>
    <td>
      <div>Content in div</div>
    </td>
</table>
<!-- Missing closing </tr>, </td> -->

<!-- astral-tl: Best-effort recovery -->
<!-- html5ever: Proper reconstruction of table structure -->

Example 3: Invalid Characters in Attributes

<!-- Input: Unescaped quotes in attribute values -->
<a href="page?a=1&b=2">Link</a>

<!-- astral-tl: Attribute parsing stops at first "
<a href="page?a=1">
<!-- html5ever: Proper entity decoding (&amp; → &) -->
<a href="page?a=1&amp;b=2">

Recovery Mechanisms

1. Whitespace Normalization

Before parsing, normalize line endings:

fn normalize_line_endings(html: &str) -> Cow<'_, str> {
    if html.contains('\r') {
        Cow::Owned(html.replace("\r\n", "\n").replace('\r', "\n"))
    } else {
        Cow::Borrowed(html)
    }
}

Input: <p>Line1\r\nLine2\r<br></p> (mixed CRLF, CR, LF) Output: <p>Line1\nLine2\n<br></p> (normalized to LF)

2. Fast Text Path

For text-only inputs (no < character), skip HTML parsing entirely:

fn fast_text_only(html: &str, options: &ConversionOptions) -> Option<String> {
    if html.contains('<') {
        return None;  // Has HTML, can't use fast path
    }

    // Apply text processing directly
    let decoded = text::decode_html_entities_cow(html);
    let normalized = if options.whitespace_mode == WhitespaceMode::Normalized {
        text::normalize_whitespace_cow(&decoded)
    } else {
        Cow::Borrowed(&decoded)
    };

    Some(normalized.into_owned() + "\n")
}

Benefit: Plain text documents converted without parsing overhead

3. Entity Decoding

Handle HTML entities even with malformed markup:

// From text.rs
pub fn decode_html_entities_cow(text: &str) -> Cow<'_, str> {
    // Handles:
    // &amp; → &
    // &#123; → {
    // &#x1F; → (Unicode char)
    // &nbsp; → (non-breaking space)
    // ... 100+ HTML5 entities
}

Examples:

&lt;script&gt;  → <script>
&quot;test&quot; → "test"
&#8364;         → (Euro symbol)
&#x1F600;       → (Emoji)

4. Attribute Error Recovery

When parser encounters malformed attributes, graceful degradation:

<!-- Input -->
<img src="image.jpg" alt=Unquoted width=100 data-x='mixed quote>

<!-- Outcome: Some attributes lost, image still converts -->
![Unquoted](image.jpg)

Panic Recovery

Located in /crates/html-to-markdown/src/error.rs:

pub enum ConversionError {
    Panic(String),  // Catches panics in conversion pipeline
}

Panic Catching Strategy

Bindings wrap conversion in catch-panic:

// From Python bindings
#[pyfunction]
fn convert(html: String, options: Option<PyObject>) -> PyResult<String> {
    let result = std::panic::catch_unwind(|| {
        html_to_markdown_rs::convert(&html, None)
    });

    match result {
        Ok(Ok(md)) => Ok(md),
        Ok(Err(e)) => Err(PyErr::new::<PyException, _>(e.to_string())),
        Err(_) => Err(PyErr::new::<PyException, _>("panic in conversion")),
    }
}

Safe Error Paths

Never unwrap in conversion code:

// WRONG - Can panic
let tag_name = element.name().expect("tag name");

// CORRECT - Returns Result
match element.name() {
    Some(name) => { /* use name */ },
    None => { /* handle missing name gracefully */ },
}

Empty/Null Input Handling

Validation Rules

fn validate_input(html: &str) -> Result<()> {
    let bytes = html.as_bytes();
    if bytes.is_empty() {
        return Ok(());  // Empty string is valid
    }

    // Continue with other checks
    // ...
}

Output for Empty Input

convert("", None)     → Ok("\n")           // Single newline
convert("   ", None)  → Ok("\n")           // Whitespace stripped
convert(null, None)   → Error (from binding)  // Python None → error

Large Document Handling

Memory Constraints

The converter handles large documents efficiently:

Tested sizes:

  • Typical document: 1-10 MB HTML → < 100 MB peak memory
  • Large document: 50+ MB HTML → proportional memory growth
  • Maximum: Limited by system RAM, no inherent limit

Streaming Considerations

For extremely large documents (> 100 MB), consider:

  1. Streaming parsing (future enhancement):

    • Process HTML in chunks
    • Emit markdown incrementally
  2. Document splitting:

    • Break HTML by <div> sections
    • Convert each part separately
    • Concatenate results
  3. Chunked metadata:

    • Extract only document metadata (< 1KB)
    • Skip headers/links for size savings

Memory Optimization

// Use pre-allocated buffers
let mut output = String::with_capacity(html.len());

// LRU cache for common patterns
static PATTERN_CACHE: Lazy<Mutex<LruCache<String, String>>> =
    Lazy::new(|| Mutex::new(LruCache::new(NonZeroUsize::new(1024).unwrap())));

// Metadata size-bounded
config.max_structured_data_size = 10 * 1024 * 1024;  // 10 MB limit

Malformed Table Recovery

Invalid Table Structures

<!-- Missing tbody, thead -->
<table>
  <tr><td>Cell 1</td></tr>
</table>

<!-- Cells spanning beyond row width -->
<table>
  <tr>
    <td colspan="5">Wide cell</td>
    <td>Normal</td>
  </tr>
</table>

<!-- Nested tables (complex) -->
<table>
  <tr>
    <td>
      <table><tr><td>Inner</td></tr></table>
    </td>
  </tr>
</table>

Conversion Behavior

Single-cell table:

<table><tr><td>Data</td></tr></table>
→ | Data |
  |------|

Colspan expansion:

<table>
  <tr><td colspan="3">Wide</td></tr>
  <tr><td>A</td><td>B</td><td>C</td></tr>
</table>
→ | Wide | | |
  |------|---|---|
  | A | B | C |

Nested tables: Converts outer table, treats inner table as cell content:

<table><tr><td>Cell with [nested table] inside</td></tr></table>
→ Outer table with nested table markdown in cell

Quote and Escape Edge Cases

Special Character Escaping

Based on ConversionOptions:

pub struct ConversionOptions {
    pub escape_misc: bool,           // \ & < ` [ > ~ # = + | -
    pub escape_asterisks: bool,      // *
    pub escape_underscores: bool,    // _
    pub escape_ascii: bool,          // All ASCII punctuation
}

Problematic Patterns

Input: "Price: $10 & shipping"
escape_misc=true  → "Price: $10 \& shipping"

Input: "*not bold*"
escape_asterisks=true → "\*not bold\*"

Input: "Text_with_underscore"
escape_underscores=true → "Text\_with\_underscore"

Input: "1. First point"
No explicit option, but auto-escaped to "1\\. First point" for ordered list safety

Performance Edge Cases

Slow Cases

  1. Deep nesting (h1 inside 50 nested divs)

    • O(depth) memory for context stack
    • ~1ms per 10 levels
  2. Large attribute maps

    • Elements with 50+ attributes
    • O(n) linear lookup in visitor pattern
  3. Regex escaping

    • Many special characters in text
    • Multiple regex replacements sequential

Optimization Advice

// Cache compiled regex patterns (already done in text.rs)
static ESCAPE_MISC_RE: Lazy<Regex> =
    Lazy::new(|| Regex::new(r"([\\&<`\[\]>~#=+|\-])").unwrap());

// Use iterators instead of allocating vectors
for byte in bytes.iter() {
    // Process in-place
}

// Limit depth checks in visitor
if ctx.depth > 100 {
    return VisitResult::Skip;  // Avoid stack exhaustion
}

Error Types and Recovery

ConversionError Variants

pub enum ConversionError {
    InvalidInput(String),    // Binary, encoding, empty (recoverable)
    ParseError(String),      // HTML parsing failed (unrecoverable)
    ConfigError(String),     // Invalid config (unrecoverable)
    Panic(String),          // Unexpected panic (unrecoverable)
    Other(String),          // Misc errors (mixed)
}

Recovery Strategy by Error Type

ErrorCauseRecovery
InvalidInputUTF-16, gzip, control charsUser must fix input, then retry
ParseErrorHTML unparseableFall back to html5ever or give up
ConfigErrorBad optionsFix config, retry
PanicCode bugReport as issue, workaround TBD

Testing Edge Cases

Location: /crates/html-to-markdown/src/lib.rs (lines 722-765)

#[test]
fn test_binary_input_rejected() {
    let html = "PDF\0DATA";
    let result = convert(html, None);
    assert!(matches!(result, Err(ConversionError::InvalidInput(_))));
}

#[test]
fn test_binary_magic_rejected() {
    let html = String::from_utf8_lossy(b"\x1F\x8B\x08\x00gzip").to_string();
    let result = convert(&html, None);
    assert!(matches!(result, Err(ConversionError::InvalidInput(_))));
}

#[test]
fn test_utf16_hint_rejected() {
    let html = String::from_utf8_lossy(b"\xFF\xFE<\0h\0t\0m\0l\0>\0").to_string();
    let result = convert(&html, None);
    assert!(matches!(result, Err(ConversionError::InvalidInput(_))));
}

#[test]
fn test_plain_text_allowed() {
    let result = convert("Just text", None).unwrap();
    assert!(result.contains("Just text"));
}

Quick Reference: Edge Cases

Input TypeDetectionOutcomeUser Action
Empty stringWhitespace checkOk("\n")None needed
Plain text (no HTML)fast_text_only()Fast path OKNone
UTF-16 BOMMagic prefixErrorDecode to UTF-8 first
UTF-16 heuristicNull byte patternErrorDecode to UTF-8 first
Gzip/ZIPMagic prefixErrorDecompress first
PDFMagic prefixErrorExtract text first
35% control charsRatio > 0.3ErrorCheck encoding
Malformed HTMLastral-tl best-effortBest-effort conversionCheck for quality
Deep nestingDepth trackingSlowdown (acceptable)Consider chunking
Large documentMemory allocationProportional memoryMonitor RAM

Install

Download ZIP
Requires askill CLI v1.0+

AI Quality Score

95/100Analyzed 2/4/2026

An exceptionally detailed technical reference for handling edge cases in an HTML-to-Markdown converter, covering binary detection, malformed HTML recovery, and performance considerations.

95
100
70
100
95

Metadata

Licenseunknown
Version-
Updated1/28/2026
Publisherkreuzberg-dev

Tags

apici-cdtesting