Graphtage: A New Semantic Diffing Tool

blog.trailofbits.com

104 points by mdelias 5 years ago · 15 comments

Reader

Hideous screenshots aside, Graphtage itself looks very useful. Can it generate Git-compatible diffs for use as a Git difftool?

Also-- for "standalone" tools like this that are written in Python, I highly recommend Pipx for installing them: https://pipxproject.github.io/pipx/. It installs each tool into a separate self-contained virtual environment and symlinks the executable itself to a "bin" directory, which prevents tools with different dependencies from conflicting.

ivansavz 5 years ago

This is very interesting and a much needed tool. I have been searching for a tool like this for a long time. There are so many tree-like structures that I'm sure there will be interesting use cases...

I was recently working on a similar tool[1] but specific to the domain of "content trees" that consist of content nodes organized into a hierarchical structure. In my case each tree node has a persistent `content_id` associate with the underlying content file and independent of its position within the tree, which allows me to detect "move" operations[2] (a node with the same `content_id` appearing in a different place in the tree).

The use case is for educational content: Kolibri channels[3] are these huge trees that consist of thousands of nodes and it's difficult to know what has changed when we create new versions of the channels. I tried all kinds of general-purpose diffing tools and failed miserably so I started working on treediffer. It's almost done; I hope to finish it later this fall, and will look at graphtage to see how it works.

[1] https://github.com/learningequality/treediffer [2] https://treediffer.readthedocs.io/en/latest/diff_formats.htm... [3] https://kolibri-demo.learningequality.org/en/learn/#/topics

lewisjoe 5 years ago

Has anybody went through React's HTML diffing algorithm? If this one's good, we could write a JS version and use it for HTML diffing in browsers.

brunoqc 5 years ago

Graphtage could be compiled to wasm and used in a browser.

hinkley 5 years ago

I was staring at a diff today and longing for better semantic diffing.

I’d changed a shell script, with a chain of commands. I added a second call to the same command with different args and the diff was just... bad.

    something && fizz foo && another

    something && fizz bar && fizz foo && another

It decided that “bar && fizz” was my edit, and I just stared at it (it was already a tough day). Even if they had just weighted punctuation characters differently, it would have gotten the right answer, as it would with adding new functions or array entries, which it always gets wrong too.

Sort it out please.

tingletech 5 years ago

interesting "This tool was partially developed with funding from the Defense Advanced Research Projects Agency (DARPA) on the SafeDocs project."

I like the idea that it can do semantic diffs across different formats.

setpatchaddress 5 years ago

I would recommend deleting the screenshots, though. I looked at them and thought "so what? that's been done many times before" until I read the text more carefully.

hinkley 5 years ago

I would recommend reshooting the screenshots. Navy blue on a jet black background? Removing new lines in the initial example but not in the diffs? Fixing those would get the point across better.
Also, turn the saturation down. That’s the greenest green and the reddest red next to the darkest blue. My eyes.
- throwaway_pdp09 5 years ago
  
  I can't see a problem - there's no pic. I guess they need JS to show images.
  Back on point, I see so much of this grey-on-grey type thing, just a little common sense would suggest it's very poor practice but it keeps happening.
  - hinkley 5 years ago
    
    Exhibit A:
    https://i1.wp.com/blog.trailofbits.com/wp-content/uploads/20...
    
    throwaway_pdp09 5 years ago
    
    First thought was you'd given me a nethack screenshot by accident, but thanks! Interesting project.

sendbits 5 years ago

super cool, having worked on related problems independently (tree-based file compression & arbitrary graph-based file compares) and currently been in search of better way to compare web scrapes over time

kudos for putting the two concepts together / will give it a go

anotheryou 5 years ago

I want one that can also find non-perfectly matching moved lines :)

looks cool already though, got to try it some time.

idubrov 5 years ago

At my previous job I've built a tool that was capable of doing that (we were merging XMLs with form definitions). The main idea was an interactive mode.
Initially, tool would merge based on series of heuristics and then user would manually adjust "matching" nodes (user could say "actually, this A on the left and B on the right are the same, it's just that it was heavily modified").
- hinkley 5 years ago
  
  It seems like if the editor produced hints this would work better, but your target audience also shrinks.

Settings

Graphtage: A New Semantic Diffing Tool

Keyboard Shortcuts