A pure-Go library and CLI that converts documents to Markdown. Go port of the Python markitdown library.
Features
- Pure Go, no CGO, no external runtime dependencies
- 12 format converters: PDF, DOCX, PPTX, XLSX, XLS, HTML, RSS/Atom, CSV, EPUB, Jupyter, plain text, ZIP
- Deterministic output with golden test suite
- PDF extraction via PDFium (WebAssembly, no CGO) with heading/bold/italic detection
Supported formats
| Format | Extensions | Notes |
|---|---|---|
.pdf |
Text extraction via PDFium (WebAssembly, no CGO) | |
| Word | .docx |
Headings, tables, lists, hyperlinks, comments, math (OMML to LaTeX) |
| PowerPoint | .pptx |
Slides, tables, notes, image alt text |
| Excel | .xlsx |
Multi-sheet markdown tables |
| Excel (legacy) | .xls |
Multi-sheet markdown tables |
| HTML | .html, .htm |
Full HTML-to-Markdown conversion |
| RSS/Atom | .xml, .rss, .atom |
Feed items with titles, dates, content |
| CSV | .csv |
Markdown table with auto charset detection |
| EPUB | .epub |
Metadata, table of contents, chapter content |
| Jupyter | .ipynb |
Markdown + fenced code cells with output |
| Plain text | .txt, .md, .json, .jsonl |
Charset detection and UTF-8 conversion |
| ZIP | .zip |
Recursively converts supported files inside |
Install
go get github.com/conductor-oss/markitdown
Library quick start
package main import ( "fmt" "log" markitdown "github.com/conductor-oss/markitdown" ) func main() { m := markitdown.New() // Convert a local file result, err := m.ConvertFile("report.pdf") if err != nil { log.Fatal(err) } fmt.Println(result.Markdown) }
More examples
// Convert a URL result, err := m.ConvertURL("https://example.com/page.html") // Convert with auto-detection (file path or URL) result, err := m.Convert("report.pdf") result, err := m.Convert("https://example.com/page.html") // Convert from a reader with metadata hints f, _ := os.Open("data.csv") result, err := m.ConvertReader(f, markitdown.StreamInfo{ Extension: ".csv", MIMEType: "text/csv", Charset: "shift_jis", }) // Options m := markitdown.New( markitdown.WithKeepDataURIs(true), // preserve base64 data URIs in output )
CLI quick start
Build:
go build -o markitdown ./cmd/markitdown
Convert a file to stdout:
Convert and write to a file:
./markitdown -o output.md report.docx
Convert from stdin with format hint:
cat data.csv | ./markitdown -x csvConvert a URL:
./markitdown https://example.com/page.html
CLI flags
Usage: markitdown [flags] [source]
Arguments:
source File path or URL to convert (reads stdin if omitted)
Flags:
-o, --output string Output file (default: stdout)
-x, --extension string File extension hint for stdin input (e.g. "pdf", ".csv")
-m, --mime-type string MIME type hint
-c, --charset string Charset hint (e.g. "shift_jis", "utf-8")
-v, --version Show version
--keep-data-uris Keep full base64-encoded data URIs in output
Notes
- PDF extraction is text-based; image-only PDFs produce no output without OCR.
- DOCX math equations (OMML) are converted to LaTeX notation.
- CJK charset detection works without hints but is most reliable when
Charsetis provided inStreamInfo.
Acknowledgements
This project is a Go port of Microsoft's markitdown Python library. The original project provides the reference implementation, test fixtures, and design that this port is based on.