A Python library and CLI tool for inspecting ePub from the terminal.
Features
- Complete EPUB Support - Parse both EPUB 2.0.1 and EPUB 3.0+ specifications with container, package, manifest, spine, and table of contents inspection
- Rich Metadata Extraction - Extract Dublin Core metadata (title, author, language, publisher) with key-value, XML, and raw output formats for easy scripting
- Content Analysis - Access document content by manifest ID or file path, with plain text extraction for content analysis and word counting
- File System Navigation - Browse and extract any file within EPUB archives (XHTML, CSS, images, fonts) with detailed file information including sizes and compression ratios
- Multiple Output Formats - XML with syntax highlighting, raw content, key-value pairs, plain text, and formatted tables to suit different workflows
- CLI and Python API - Comprehensive command-line tool for terminal workflows plus a clean Python library for programmatic access
- Standards Compliance - Built-in validation capabilities and adherence to W3C/IDPF specifications for reliable EPUB processing
- Performance Optimized - Lazy loading, efficient ZIP parsing, and optional lxml support for handling large EPUB collections
Installation
epub-utils is available as a PyPI package
Use as a CLI tool
The basic format is:
epub-utils EPUB_PATH COMMAND [OPTIONS]
Commands
-
container- Display the container.xml contents# Show container.xml with syntax highlighting epub-utils book.epub container # Show container.xml as raw content epub-utils book.epub container --format raw # Show container.xml with pretty formatting epub-utils book.epub container --pretty-print
-
package- Display the package OPF file contents# Show package.opf with syntax highlighting epub-utils book.epub package # Show package.opf as raw content epub-utils book.epub package --format raw
-
toc- Display the table of contents file contents# Show toc.ncx/nav.xhtml with syntax highlighting (auto-detect) epub-utils book.epub toc # Show toc.ncx/nav.xhtml as raw content epub-utils book.epub toc --format raw # Force NCX format (EPUB 2 navigation control file) epub-utils book.epub toc --ncx # Force Navigation Document (EPUB 3 navigation file) epub-utils book.epub toc --nav
-
metadata- Display the metadata information from the package file# Show metadata with syntax highlighting epub-utils book.epub metadata # Show metadata as key-value pairs epub-utils book.epub metadata --format kv # Show metadata with pretty formatting epub-utils book.epub metadata --pretty-print
-
manifest- Display the manifest information from the package file# Show manifest with syntax highlighting epub-utils book.epub manifest # Show manifest as raw content epub-utils book.epub manifest --format raw
-
spine- Display the spine information from the package file# Show spine with syntax highlighting epub-utils book.epub spine # Show spine as raw content epub-utils book.epub spine --format raw
-
content- Display the content of a document by its manifest item ID# Show content with syntax highlighting epub-utils book.epub content chapter1 # Show raw HTML/XML content epub-utils book.epub content chapter1 --format raw # Show plain text content (HTML tags stripped) epub-utils book.epub content chapter1 --format plain
-
files- List all files in the EPUB archive or display content of a specific file# List all files in table format (default) epub-utils book.epub files # List all files as simple paths epub-utils book.epub files --format raw # Display content of a specific file by path epub-utils book.epub files OEBPS/chapter1.xhtml # Display XHTML file content in different formats epub-utils book.epub files OEBPS/chapter1.xhtml --format raw epub-utils book.epub files OEBPS/chapter1.xhtml --format xml --pretty-print epub-utils book.epub files OEBPS/chapter1.xhtml --format plain # Display non-XHTML files (CSS, images, etc.) epub-utils book.epub files OEBPS/styles/main.css epub-utils book.epub files META-INF/container.xml
Options
-
-h, --help- Show help message and exit -
-v, --version- Show program version and exit -
-fmt, --format- Output format (default: xml)xml- Display with XML syntax highlighting (default)raw- Display raw content without formattingplain- Display plain text content (HTML tags stripped, for content command only)kv- Display key-value pairs (where supported)
-
-pp, --pretty-print- Pretty-print XML output (applies to xml and raw formats only)# Display as raw content epub-utils book.epub package --format raw # Display with XML syntax highlighting (default) epub-utils book.epub package --format xml # Display as key-value pairs (for supported commands) epub-utils book.epub metadata --format kv # Display plain text content (content command only) epub-utils book.epub content chapter1 --format plain # Pretty-print XML with proper indentation epub-utils book.epub package --pretty-print # Combine format and pretty-print options epub-utils book.epub metadata --format raw --pretty-print
Use as a Python library
from epub_utils import Document # Load an EPUB document doc = Document("path/to/book.epub")
Basic Document Access
Access the main components of an EPUB document:
# Get container information container = doc.container print(container.to_xml()) # Formatted XML with syntax highlighting print(container.to_str()) # Raw XML content # Get package information package = doc.package print(package.to_xml()) # Formatted XML with syntax highlighting print(package.to_str()) # Raw XML content # Get table of contents toc = doc.toc if toc: # TOC might be None if not present print(toc.to_xml()) # Formatted XML with syntax highlighting print(toc.to_str()) # Raw XML content # Access specific navigation formats ncx = doc.ncx # NCX format (EPUB 2 or EPUB 3 with NCX) if ncx: print("NCX navigation available") print(ncx.to_xml()) nav = doc.nav # Navigation Document (EPUB 3 only) if nav: print("Navigation Document available") print(nav.to_xml()) print(toc.to_str()) # Raw XML content
Working with Metadata
Access and format metadata information:
# Access package metadata metadata = doc.package.metadata # Basic Dublin Core elements print(f"Title: {metadata.title}") print(f"Creator: {metadata.creator}") print(f"Identifier: {metadata.identifier}") print(f"Language: {metadata.language}") print(f"Publisher: {metadata.publisher}") print(f"Date: {metadata.date}") # Dynamic attribute access for any metadata field isbn = getattr(metadata, 'isbn', 'Not available') series = getattr(metadata, 'series', 'Not available') # Get formatted metadata output print(metadata.to_xml()) # Formatted XML with syntax highlighting print(metadata.to_str()) # Raw XML content print(metadata.to_kv()) # Key-value format for easy parsing
Working with Manifest
Access the manifest to see all files in the EPUB:
# Get manifest information manifest = doc.package.manifest # Access all manifest items for item in manifest.items: print(f"ID: {item['id']}") print(f"File: {item['href']}") print(f"Type: {item['media_type']}") print(f"Properties: {item['properties']}") # Find specific items nav_item = manifest.find_by_property('nav') chapter = manifest.find_by_id('chapter1') xhtml_items = manifest.find_by_media_type('application/xhtml+xml') # Get formatted manifest output print(manifest.to_xml()) # Formatted XML with syntax highlighting print(manifest.to_str()) # Raw XML content
Working with Spine
Access the spine to see the reading order:
# Get spine information spine = doc.package.spine # Access spine properties print(f"TOC reference: {spine.toc}") print(f"Page progression: {spine.page_progression_direction}") # Access spine items in reading order for itemref in spine.itemrefs: print(f"ID: {itemref['idref']}") print(f"Linear: {itemref['linear']}") print(f"Properties: {itemref['properties']}") # Find specific spine item spine_item = spine.find_by_idref('chapter1') # Get formatted spine output print(spine.to_xml()) # Formatted XML with syntax highlighting print(spine.to_str()) # Raw XML content
Content Extraction
Extract content from specific documents within the EPUB:
# Access content by manifest item ID try: content = doc.find_content_by_id('chapter1') # Get content in different formats print(content.to_xml()) # Formatted XHTML with syntax highlighting print(content.to_str()) # Raw XHTML content print(content.to_plain()) # Plain text with HTML tags stripped # Access the parsed content tree for advanced processing tree = content.tree inner_text = content.inner_text except ValueError as e: print(f"Content not found: {e}") # Find publication resources by ID (for non-spine items) try: resource = doc.find_pub_resource_by_id('cover-image') except ValueError as e: print(f"Resource not found: {e}")
File Operations
List and access files directly by their paths in the EPUB archive:
# Get information about all files files_info = doc.get_files_info() for file_info in files_info: print(f"Path: {file_info['path']}") print(f"Size: {file_info['size']} bytes") print(f"Compressed: {file_info['compressed_size']} bytes") print(f"Modified: {file_info['modified']}") # Access specific file by path try: # For XHTML files, returns XHTMLContent object xhtml_content = doc.get_file_by_path('OEBPS/chapter1.xhtml') print(xhtml_content.to_xml()) print(xhtml_content.to_plain()) # For other files, returns raw string content css_content = doc.get_file_by_path('OEBPS/styles/main.css') print(css_content) except ValueError as e: print(f"File not found: {e}")
Output Formatting Options
All document components support flexible output formatting:
# Pretty-printed XML output print(metadata.to_str(pretty_print=True)) print(manifest.to_xml(pretty_print=True)) # Syntax highlighting can be controlled print(package.to_xml(highlight_syntax=True)) # With highlighting (default) print(package.to_xml(highlight_syntax=False)) # Without highlighting
Industry Standards & Compliance
epub-utils provides comprehensive support for industry-standard ePub specifications and related technologies, ensuring broad compatibility across the digital publishing ecosystem.
Supported EPUB Standards
-
EPUB 2.0.1 (IDPF, 2010)
- Complete OPF 2.0 package document support
- NCX navigation control file support
- Dublin Core metadata extraction
- Legacy EPUB compatibility
-
EPUB 3.0+ (IDPF/W3C, 2011-present)
- EPUB 3.3 specification compliance
- HTML5-based content documents
- Navigation document (nav.xhtml) support
- Enhanced accessibility features
- Media overlays and scripting support
Metadata Standards
-
Dublin Core Metadata Initiative (DCMI)
- Dublin Core Metadata Element Set v1.1
- Dublin Core Metadata Terms (DCTERMS)
-
Open Packaging Format (OPF)
- OPF 2.0 specification (EPUB 2.0.1)
- OPF 3.0 specification (EPUB 3.0+)
The library maintains strict adherence to published specifications while providing robust handling of real-world EPUB variations commonly found in commercial and open-source reading applications.