Settings

Theme

Dgsh – Directed graph shell

dmst.aueb.gr

178 points by nerdlogic 9 years ago · 52 comments

Reader

chubot 9 years ago

This looks pretty interesting, although I'll have to dig more into the examples to see why they chose this set of primitives (multipipes, multipipe blocks, and stored values).

Here is a 2009 paper, "Composing and executing parallel data-flow graphs with shell pipes", which is also a bash extension. (I'm impressed with anyone who successfully enhances bash's source code.)

Although it has a completely different model and I think more suitable for "big data".

https://scholar.google.com/scholar?cluster=98697598478714306...

http://dl.acm.org/citation.cfm?id=1645175

In this paper we extend the concept of shell pipes to incorporate forks, joins, cycles, and key-value aggregation.

I have a printout of this paper, but unfortunately it doesn't appear to be online :-(

xiaq 9 years ago

I've always thought about integrating this functionality into elvish https://github.com/elves/elvish but cannot cone up with a good syntax. dgsh has a good one, but unfortunately using & breaks its traditional semantics. Does anyone has some idea of a tradition-compatible grammar?

Also, to nitpick, this is more accurately called a directed acyclic graph shell, or simply a DAG shell. The language doesn't seem to allow cycles. dagsh reads nicer than dgsh too.

mtrn 9 years ago

I've worked with and looked at a lot of data processing helpers. Tools, that try to help you build data pipelines, for the sake of performance, reproducibility or simply code uniformity.

What I found so far: Most tools, that invent a new language or try to cram complex processes into lesser suited syntactical environments are not loved too much.

A few people like XSLT, most seem to dislike it, although it has a nice functional core hidden under a syntax that seems to come from a time, where the answer to everything was XML. There are big data orchestration frameworks, that use an XML as configuration language, which can be ok, if you have clear processing steps.

Every time a tool invents a DSL for data processing, I grab my list of ugly real world use cases and most of the tools fail soon, if not immediately. That's a pity.

Programming languages can be effective as they are, and with the exceptions that unclean data brings, you want to have a programming language at your disposal anyway.

I'll give dgsh a try. The tool reuse approach and the UNIX spirit seems nice. But my initial impression of the "C code metrics" example from the site is mixed: It reminds me of awk, about which one of the authors said, that it's a beautiful language, but if your programs getting longer than hundred lines, you might want to switch to something else.

Two libraries which have a great grip at the plumbing aspect of data processing systems are airflow and luigi. They are python libraries and with it you have a concise syntax and basically all python libraries plus non-python tools with a command line interface at you fingertips.

I am curious, what kind of process orchestration tools people use and can recommend?

  • samuell 9 years ago

    Exactly our experience too, from complex machine learning workflows in various aspects of drug discovery.

    We basically did not really find any of the popular DSL-based bioinformatics pipeline tools (snakemake, bpipe etc) to fit the bill. Nextflow came close, but in fact allows quite some custom code too.

    What worked for us was to use Spotify's Luigi, which is a python library rather than DSL.

    The only thing was that we had to develop a flow-based inspired API on top of Luigi's more functional programming based one, in order to make defining dependencies fluent and easy enough to specify for our complex workflows.

    Our flow-based inspired Luigi API (SciLuigi) for complex workflows, is available at:

    https://github.com/pharmbio/sciluigi

    We wrote up a paper on it as well, detailing a lot of the design decisions behind it:

    http://dx.doi.org/10.1186/s13321-016-0179-6

    Then, lately we are working on a pure Go alternative to Luigi/SciLuigi, since we realized that with the flow-based paradigm, we could just as well just rely on the Go channels and go-routines to create an "implicit scheduler" very simply and robustly. This is work in progress, but a lot of example workflows already work well (it has 3 times less LOC than a recent bioinformatics pipeline tool written in python and put into production). Code available at:

    https://github.com/scipipe/scipipe

    It is also very much a programming library rather than a DSL.

    It in fact even implements streaming via named pipes, seemingly allowing somewhat similar operations as dgsh, with a bit more code probably, but with the (seeming) benefit of a bit easier handling of multiple inputs and outputs (via the flow-based progr. ports concept).

    dgsh looks real interesting for simpler operations where there is one main input and output though - which occur a lot for ad-hoc work in the shell, in our experience. Will have to test it out for sure!

    • baq 9 years ago

      Have you checked out airflow? Any opinions?

      • samuell 9 years ago

        I have looked a bit at code examples of Airflow, but was worried that it seems to have a similar problem as a lot of other pipeline tools: That in the main workflow specification, dependencies are specified between tasks only, not between the individual inputs and outputs of each task (between tasks rather than data).

        This means that this info needs to be implemented "manually" in some less declarative manner somewhere else, breaking the declarative-ness of the workflow specification.

        I have posted about it some time ago here, mentioning AirFlow specifically: http://bionics.it/posts/workflows-dataflow-not-task-deps

        • pptyp 9 years ago

          We wrote a package to go with our Airflow installation to borrow some of the data flow (as opposed to Airflow's exclusive task deps flow you mention) concepts we liked from Make/Drake/Luigi. You may be interested: github.com/industrydive/fileflow

          • samuell 9 years ago

            That's nice! Didn't know Airflow did in-memory passing (as I now understand it does?), so I can see that this must be needed for larger data items, right?

            Does it also help with making it easier to route individual multiple outputs to separate downstream components etc?

        • baq 9 years ago

          thanks!

  • dwhitena 9 years ago

    Thanks for sharing your experience. I work with Pachyderm, which is an open source data pipelining and data versioning framework. Some things like might be relevant to this conversation are the fact that Pachyderm is language agnostic and that it keeps analyses in sync with data (because it triggers off of commits to data versioning). This makes it distinct from Airflow or Luigi, for example.

    • samuell 9 years ago

      Pachyderm, with its "git for big data" approach is one of, if not THE, coolest thing I learned about in 2016.

      Only I hope to get time to test it out in some more depth sooner rather than later (it is one of my top goals for 2017).

      Also, the pipeline feature in Pachyderm does not suffer from the "dependencies between tasks rather than data" problem that I mentioned in another post here, but properly identifies separate inputs and outputs declaratively.

      Pachyderm specifies workflows in a kind of DSL AFAIK, and I'm very much interested to see if it could natively fit the bill for our complex workflows. But if not, I think we can always use it in a a light-weight way to fire off scipipe workflows (instead of the applications directly), and so let scipipe take care of the complex data wiring.

      We would still like to benefit from the seemingly groundbreaking "git for big data" paradigm, and auto-executed workflow on updated data, which should enable something as impactful as on-line data analyses (auto-updated upon new data) in a manageable way.

    • mtrn 9 years ago

      Thanks for the pachyderm pointer. I just installed it and will give it a try.

  • steveb 9 years ago

    We've been working on a directed graph execution engine called Converge https://github.com/asteris-llc/converge.

    In this case the task resource http://converge.aster.is/0.5.0/resources/task/ might help, as it allows you to create a directed graph using any kind of interpreter (for example, Python or Ruby) instead of having to use the DSL.

    • mtrn 9 years ago

      Nice, thanks for the pointer. It's nice to see templated shell calls, as these can be a powerful bridge between orchestration and execution.

    • nerdponx 9 years ago

      You call this as a configuration management tools on Github. Does that make this a competitor to Ansible, etc as well?

      • steveb 9 years ago

        Yes, we have designed it to deploy things like Kubernetes and Mesos clusters via integrations with Terraform and Packer.

  • nerdponx 9 years ago

    This post is making me think it would be a great educational exercise to construct equivalent data processing flows in some popular tools: Make, Airflow, Luigi, Snakemake, Rake, others?

    • samuell 9 years ago

      Indeed, not only for education, but also as a tool to evaluate tools for various use cases, I think. Have been thinking the same and looked hard for anything like a set of evaluation workflows, incorporating various specific "motifs" if you like (such as nested parameter sweeps).

      Unfortunately haven't found anything, so for our use cases in bioinformatics, I basically took an example workflow that was used in a course in next-gen sequencing analysis as a starting point:

      https://github.com/NBISweden/workflow-tools-evaluation/tree/...

      Only partly implemented it in Common Workflow Language [1] and SciPipe [2] so far ... the implementation turned out to take a tremendous of work :P

      Much interested if anyone has found / created a more general such set of example workflows.

      [1] http://commonwl.org

      [2] https://github.com/scipipe/scipipe

      • nerdponx 9 years ago

        Yes, thank you! I'll see if maybe I can throw something similar together for a social science data project, like a Titanic dataset run-through.

  • rtpg 9 years ago

    I've been thinking about this space a lot too, would you mind listing out some of the messier use cases that you have?

    • mtrn 9 years ago

      > I've been thinking about this space a lot

      Me too, for better or for worse.

      As for the issues, there are many. Just quickly a few:

      * Data provider has an FTP server, most files are automatically generated, some are hand-named (with inconsistencies). How do you handle (without a lot of effort) a list of exceptions along with the regular files?

      * Data provider has a good strict XML schema, but the relevant information for a single item is spread across three files, inside a tar archive. Since the there are 500k files inside the archive, you best not want to extract it, but process it on the fly.

      * Data provider chooses layout that saves every item in a single XML file, inside 2-3 levels of directories. There are 20M of them. Unzipping the archive alone takes more than a day with default system settings and the usual tools. How do you process these things fast?

      There are more subtle issues as well:

      * FFFD regularly occurs in natural language strings. Can you correct these strings?

      * File has .csv ending, looks like CSV on first glance, but all the standard RFC compliant parsers choke on it.

      * XML file that elements, that have RTF tags embedded in it. You need to parse the RTF in the elements, because there is relevant information there, that you need to add to the transformed version.

      * Date issues. Inconsistent formats and almost-valid dates.

      * Combine data, coming from an API with data fetched from ten different servers to produce a transformed version with a legacy command line application (that might be slow, so you have to split your data first and parallelize the work, combine it and make sure it's complete).

      I am thinking about a longer article or even a short book about these kind of data handling and quality questions and what ways there are to address them. Would you read a book like this and what topic would be the most pressing or relevant?

      • DenisM 9 years ago

        That kind of book would be a great service to humanity. I don't know if you will sell many, but anyone inventing a new ETL tool would be served well by reading it. Perhaps a paper for a journal like ACM would be a better format. Or you could make it into a wiki. Or an "ETL Nightmares monthly" newsletter, with best user submissions.

      • voltagex_ 9 years ago

        This is what an "ETL" (Extract-Transform-Load) tool is for. Something like FME Server [1] would handle the first two points and the last point well.

        For unzipping something that crazy, I'm interested in your solution - I think I'd have to write a custom zip library and use a RAMdisk or similar.

        1: https://www.safe.com/fme/fme-server/

        • mtrn 9 years ago

          Yes, that's ETL. Classic ETL dealt with databases, the modern variant has relaxed this constraint.

          As for the zip: We simply "unzip -p" and stream process it carefully (with a custom program reading XML and transforming it). Cuts processing time from hours (extracting the zip and creating all directories, then visiting each file) to minutes (read from a single file).

    • rcthompson 9 years ago

      Here's one example where I had to use a kind of ugly hack ot make it work with Snakemake, a Python Makefile-style "DAG-of-rules" workflow tool: https://github.com/DarwinAwardWinner/CD4-csaw

      Basically, I need to first fetch the metadata on all the samples, and then later group them by treatment based on that metadata. In other words, the structure of later parts of the DAG depends on the results of executing earlier parts of the DAG, so the full structure of the DAG is not known initially. The solution I used was to split the workflow in two: a "pre-workflow workflow" that fetches the sample metadata and then the main workflow which reads the metadata and builds the DAG based on it. See here: https://github.com/DarwinAwardWinner/CD4-csaw/blob/master/Sn...

      This a common pattern that I see when putting together bioinformatics workflows: the full DAG of actions to execute cannot be known until part of the way through executing that DAG. Most workflow tools can't handle this gracefully. Another Python DAG-executor, called doit, can handle this case, by specifying that some rules should not be evaluated until after others have finished running. But it doesn't have some features that I wanted from Snakemake (e.g. compute cluster execution), so I ended up with the above solution instead.

  • cturner 9 years ago

    Something I have found fun in the past: using xslt where the underlying document is not xml. In order for xslt to work (in java setting, apache libs) you do not need an underlying xml document, just something that satisfies the appropriate java interface. For example, you could wrap a filesystem directory structure.

    • wfunction 9 years ago

      Is it possible to show what XSLT is and why it's useful in like 5 minutes? I've always wanted a transformation language of some sort, but I've never managed to figure out XSLT (probably because I've never needed it) so I don't know what problems it solves or doesn't solve.

      • fatihpense 9 years ago

        For example, It is very easy to wrap some XML in other XML. Selecting XML nodes with XPath is also powerful. You don't have to write boilerplate Java etc. However, its template logic has a learning curve and it is only useful in work related to XML.

      • cturner 9 years ago

             I've always wanted a transformation language of some
             sort, but I've never managed to figure out XSLT
             (probably because I've never needed it) so I don't
             know what problems it solves or doesn't solve.
        
        You'd be familiar with SQL. SQL is a declarative language for interacting with a relational structure. You say what you want and from where. It outputs to a table. Or, with significant effort, to more complex forms.

        XPath + XSLT are declarative languages for interacting with a tree structure. You say what you want, and how you want the results laid out. It's particularly useful in integration scenarios. (I need to take data in horror format X, and then transform it into the completely different horror format Y)

        Example A: you have a directory full of XML files that represent streets, estates, houses, buildings and apartments across the nation. The depth of data in these nodes is inconsistent: houses are generally top-level; apartments are nested within buildings within estates. You need to (1) select one or two bedroom homes their are in a particular set of postcodes. And (2) to capture some facts about each of those homes in a completely different XML format.

        XPath is useful for finding the things. XSLT is your tool for manipulating the results from the XPath query into the output document format.

        Example B: a vendor sends you accounting data. They are set up to send you one nasty format only. You have a third-party internal finance system that requires a separate specific format.

        Source example:

            <document>
                <account name="1234">
                    <payment date="20150808" ccy="USD" amt="500" />
                    <payment date="20150810" ccy="USD" amt="600" />
                    <payment date="20150810" ccy="NZD" amt="700" />
                </account>
            </document>
        
        Destination example:

            <document>
                <date="20150808" account="1234">
                    <trans>USD 500</trans>
                </date>
                <date="20150810" account="1234>
                    <trans>USD 600</trans>
                    <trans>NZD 700</trans>
                </date>
            </document>
        
        You could definitely knock something up that did this transform in python or perl. Particularly if you were confident where the newlines would be. There are situations where this makes sense: XML tooling is not as strong in those platforms as Java/C#, and you may want colleagues to be able to maintain this stuff without them having to learn entirely new technology stacks.

        However, once you're dealing with a complex problem, XSLT+XPath are what you want. If you wrote perl or python to do this, your perl or python would evolve to 80% of a slow, ill-conceived, badly implemented ripoff of the apache XPath+XSLT. And you'd run into all kinds of problems with edge-case stuff like unicode.

        If I was building an editorial pipeline for a newspaper or publisher, it'd be XML+XSLT all the way. But there's a lot of places where I would avoid XML and not need XSLT.

        XML is flawed for the domain where it gets the most action: system APIs. XML encourages complex, monolithic, document-separated interfaces. To correct for this, the community has layered yet more complex schema systems on top of it.

        System interfaces should steer towards being tight, flat, specific, discoverable and stream-oriented. System interfaces with those qualities are easier to build and maintain and learn.

        In place of XML, I prefer the approach below. At the start of your feed, assert the interface you think the other person should be receiving on. Then send messages over those vectors.

            # i lines assert the interface (emphasis: this is an assertion, /not/ an IDL)
            i ccy h
            i account h name
            i trans h date account_h ccy_h amount
                i leg trans_h amount
            #
            # now send your data stream over those vectors
            ccy USD
            ccy NZD
            account account/1234 "John Smith"
            trans trans/0 20150808 account/1234 USD 500
            trans trans/1 20150810 account/1234 USD 600
            trans trans/2 20150810 account/1234 NZD 700
                leg trans/2 200
                leg trans/2 500
        
        If the receiver disagrees with the interface, then it errors at startup and not half way through the stream.

        In this format, tree structures are possible. But you have to work for them. This nudges interfaces towards flat forms that are more greppable and awkable.

        Imagine a complex business where all the interchange formats were captured in this interface script. A studious non-developer could quickly learn to really dance with it and think in terms of their data flows. They could discover things, and respond to emergencies with a text editor. You could give trusted users access to a kind of power that is rarely shared with non-developers. Users who have worked on systems like this talk of them in hushed tones that acknowledge the respect and power that was shown to them.

        With XML it's harder to make reliable inferences about the schema, and harder to debug entry errors. For this reason you generally can't trust end-users with it.

        Why have I gone through this? Because: if you're careful about designing your serialisation mechanisms, you can get further along before you need to resort to XSLT.

        There are python3 parsing and producer mechanisms for interface script at github.com/cratuki/solent in package solent.util.interface_script (or: pip3 install solent). It wouldn't be much work to write Java/C# SAX interfaces to it.

  • timthelion 9 years ago

    I downvoted your comment, because it doesn't seem to me that you read the article and are responding to the contents. You are simply responding with a pre-formed opinion. Conversations only work when you read first, then think, and finally respond. But I guess conversations cannot happen on HN, because everything has to be so FAST in silicone valley.

    • samuell 9 years ago

      This is not really a post. Rather a documentation website. Not sure if it makes sense to have to read through the full documentation to make any comment.

karlmdavis 9 years ago

This is perhaps a bit off-topic, but what I really wish more data processing/ETL tools supported is the concept of transactional units. Too many of them seem to start with the worldview that "we need to shove in as many of the separate bits as we possibly can."

What's often needed for robust systems, instead, is solid support for error handling such that "if this bit doesn't make it in, then neither does that bit." Data is always messy and dirty, and too many ETL systems don't seem architected to cope with that reality.

Of course, maybe I just haven't found the right tools. Anyone know of tools that handle this particularly well?

visarga 9 years ago

I write complex shell commands every day, but when it gets longer than 2-3 rows I switch to a text editor and write it in Perl instead. I see no need to use bash up to that complexity, doesn't look good in terminal.

Poorman version of multiple pipes is to write intermediate results into files, then "cat" the files as many times as needed for the following processes. I use short file names "o1", "o2" standing for output-1, output-2 and see them as temp variables.

  • vinceguidry 9 years ago

    This is what it comes down to to me too. Using the shell to do programming seems to me like putting your job on hard mode.

    When I had to do a lot of data processing at my last job, I started building up tools in Ruby. If I had time, I'd hack the workflow so that the next time I needed it, I could just run the tool from the command line.

    Eventually I had a pluggable architecture that I could use to pull data from any number of sources and mix it with any other data. Do that with a shell? Why?

    • DSpinellis 9 years ago

      The advantage of using the shell are the hundreds of powerful command-line tools you can use. Increasingly, there are Perl/Python/Ruby packages that offer similar functionality, but these require some ceremony to use and therefore prohibit rapid prototyping and experimentation.

db48x 9 years ago

Funny, just two/three weeks ago I was saying that I really needed a dag of pipes in a shell script that I was writing...

tingletech 9 years ago

Interesting, this seems to be from a couple of people at Information Systems Technology Laboratory (ISTLab) at the Athens University of Economics and Business. I wonder what the motivation is. Security, or does it utilize multiple processor cores better than traditional pipes?

  • ufo 9 years ago

    The impression I got is that it is still using traditional unix tools and pipes under the hood so I would expect the same efficiency as now. I think the big difference here is the syntax. Traditional shells are great if you have a linear dataflow where each program has one standard input and one standard output. However, if you want to have programs receiving multiple inputs from pipes or writing to multiple pipes then the `|` syntax is not enough.

mtdewcmu 9 years ago

This looks like potentially a great tool. It might be helpful if the author showed the code examples alongside the equivalent code in bash, so it's easy to see both what the example code is doing and how much effort is saved by doing it in dgsh.

  • nerdponx 9 years ago

    It doesn't look all that different to me. Seems like it's just saving you mess around with assigning function inputs and outputs to shell variables. Otherwise it just looks like piping stuff around between functions.

    • DSpinellis 9 years ago

      You can write many of the examples we provide in bash using tee and tee >(process) syntax when you pipe data into multipipe blocks. To collect the data from multipipe blocks you need to construct Unix domain named pipes and use them in exactly the right order. It quickly gets complicated and ugly. This is our fourth stab at the problem. The earlier ones generated bash scripts, which looked awful and were unreliable.

    • mtdewcmu 9 years ago

      I think my main reason for posting was to suggest showing the equivalent bash so it was easier to see what the tool did. I just threw in the "looks potentially great" as a little sweetener. ;)

be21 9 years ago

I am not familiar with the project. What are the advantages of Dgsh in comparision to pipexec: https://github.com/flonatel/pipexec

  • DSpinellis 9 years ago

    Pipexec offers a versatile pipeline construction syntax, where you specify the topology of arbitrary graphs through the numbering of pipe descriptors. Dgsh offers a declarative directed graph construction syntax and automatically connects the parts for you. Also dgsh comes with familiar tools (tee, cat, paste, grep, sort) written to support the creation of such graphs.

CDokolas 9 years ago

Author's page: http://www.dmst.aueb.gr/dds/index.en.html

haddr 9 years ago

I wonder if there is any perfomance benchmark of this graph shell? Especially on some complex pipelines running huge datasets?

  • DSpinellis 9 years ago

    We have measured many of the examples against the use of temporary files and the web report one against (single-threaded) implementations in Perl and Java. In almost all cases dgsh takes less wall clock time, but often consumes more CPU resources.

nerdponx 9 years ago

Fun fact: "dgsh" is also the name of a CLI tool for DMs to manage RPG campaigns: http://dgsh.sourceforge.net/

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection