- 
                Notifications
    
You must be signed in to change notification settings  - Fork 524
 
Open
Description
Hi,
I’m currently using diffArrays with a custom tokenizer to compare two HTML documents, treating tags and text separately. Here’s a simplified version of my setup:
- I tokenize the HTML into words and tags (preserving spaces).
 - I use 
diffArrays()with a custom comparator that tries to ignore formatting-only changes. - I wrap added/removed tokens in 
<span>tags to render diffs inline in the browser. 
However, I’m running into a few key issues:
- Formatting-only changes (like switching from 
<b>to<strong>) are still flagged as additions/removals, even when they are semantically equivalent. - Sentence-level changes are broken down into many small word-level diffs instead of treating them as grouped phrases.
 - I saw in the README that extending the 
Diffclass could allow deeper customization, but I couldn’t find any examples or guidance on how to use it in this context. 
Here’s the code I’m using:
export const compareTwoDocuments = (original: string, modified: string) => {
  // Minify HTML by removing extra whitespace and normalizing newlines
  const minifyHtml = (html: string): string => {
    return html
      .replace(/ /g, ' ') // Replace   with space
      .replace(/\s+/g, ' ') // Replace multiple spaces with single space
      .replace(/>\s+</g, '><') // Remove spaces between tags
      .replace(/\s+>/g, '>') // Remove spaces before closing tags
      .replace(/<\s+/g, '<') // Remove spaces after opening tags
      .trim()
  }
  // Custom tokenizer that separates HTML tags and content while preserving spaces
  const tokenizeHtml = (text: string): string[] => {
    const tokens: string[] = []
    let currentIndex = 0
    const tagRegex = /<[^>]+>|<\/[^>]+>/g
    let match
    while ((match = tagRegex.exec(text)) !== null) {
      // Add content before the tag if it exists
      if (match.index > currentIndex) {
        const content = text.slice(currentIndex, match.index)
        if (content) {
          // Split content into words and spaces
          const words = content.split(/(\s+)/)
          tokens.push(...words.filter((w) => w.length > 0))
        }
      }
      // Add the tag
      tokens.push(match[0])
      currentIndex = match.index + match[0].length
    }
    // Add any remaining content after the last tag
    if (currentIndex < text.length) {
      const remainingContent = text.slice(currentIndex)
      if (remainingContent) {
        // Split remaining content into words and spaces
        const words = remainingContent.split(/(\s+)/)
        tokens.push(...words.filter((w) => w.length > 0))
      }
    }
    return tokens
  }
  const originalTokens = tokenizeHtml(minifyHtml(original))
  const modifiedTokens = tokenizeHtml(minifyHtml(modified))
  const differences = jsdiff.diffArrays(originalTokens, modifiedTokens, {
    comparator: (left, right) => {
      // Skip spaces (cause we have so many of them)
      if (left === ' ' && right === ' ') return true
      // Compare HTML tags exactly
      if (left.startsWith('<') && right.startsWith('<')) {
        return left === right
      }
      // For content, compare case-insensitively and normalize whitespace
      const normalize = (str: string) => str.replace(/\s+/g, ' ').trim().toLowerCase()
      return normalize(left) === normalize(right)
    },
  })
  const finalHtmlResult = differences
    .map((part) => {
      const value = part.value.join('')
      if (part.added) return `<span class="text-green-600 bg-green-100">${value}</span>`
      if (part.removed) return `<span class="text-red-600 bg-red-100 line-through">${value}</span>`
      return value
    })
    .join('')
  return finalHtmlResult
}My questions:
- Is there a recommended way to improve grouping of similar phrases/sentences instead of word-by-word diffs?
 - Is there any documentation or example for subclassing Diff to improve diff scoring or heuristics?
 - Would it make sense to preprocess HTML into block-level elements and diff those instead? Or is there a more robust way to make jsdiff HTML-aware?
 
Any guidance or references would be really appreciated.
Thanks.
smohammadhn
Metadata
Metadata
Assignees
Labels
No labels