PDF2EPUB 📚
Convert PDF files to nicely structured Markdown and EPUB format with intelligent layout detection.
✨ Features
- 📖 Smart layout detection for books and academic papers
- 🔍 Advanced text extraction and OCR capabilities
- 📊 Table detection and formatting
- 🖼️ Image extraction and optimization
- 📝 Clean markdown output with preserved structure
- 📱 EPUB generation with customizable styling
- 🌍 Multi-language support
- 🚀 GPU acceleration support (NVIDIA & AMD)
- 🍎 Apple Silicon support
🛠️ Dependencies
- Python 3.9+
- PyTorch (with CUDA/ROCm support for GPU acceleration)
- marker-pdf==0.3.10
- transformers==4.45.2
- markdown==3.7
💻 Installation
- Install Python dependencies:
pip install -r requirements.txt
- Install PyTorch:
- For NVIDIA GPUs, install with CUDA support:
pip install torch torchvision torchaudio
- For AMD GPUs, install with ROCm support:
pip3 uninstall torch torchvision torchaudio pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.2
- For Apple Silicon, install with MPS support:
pip3 uninstall torch torchvision torchaudio pip3 install --pre torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/nightly/cpu
- Verify GPU support:
import torch print(torch.__version__) # PyTorch version print(torch.cuda.is_available()) # Should return True for NVIDIA print(torch.mps.is_available()) # Should return True for Apple Silicon print(torch.version.hip) # Should print ROCm version for AMD
🚀 Usage
Basic Usage
Convert a single PDF file:
Convert all PDFs in a directory:
python main.py input_directory/
Advanced Options
python main.py [input_path] [output_path] [options] Options: --batch-multiplier INT Batch size multiplier for memory/speed tradeoff (default: 2) --max-pages INT Maximum number of pages to process --start-page INT Page number to start from --langs STRING Comma-separated list of languages in document --skip-epub Skip EPUB generation, only create markdown --skip-md Skip markdown generation, use existing markdown files
Examples
Process a specific range of pages:
python main.py book.pdf --start-page 10 --max-pages 50
Process a multi-language document:
python main.py paper.pdf --langs "English,German"Convert to markdown only:
python main.py thesis.pdf --skip-epub
Output Structure
output_directory/
├── document_name/
│ ├── document_name.md
│ ├── document_name.epub
│ ├── document_name_metadata.json
│ └── images/
│ ├── image1.png
│ ├── image2.jpg
│ └── ...
🤝 Contributing
Contributions are welcome! Here's how you can help:
- Fork the repository
- Create a new branch for your feature
- Commit your changes
- Push to your branch
- Create a Pull Request
Please ensure your code follows the existing style and includes appropriate tests.
Development Setup
- Clone the repository:
git clone https://github.com/yourusername/pdf2epub.git
cd pdf2epub- Create a virtual environment:
python -m venv venv source venv/bin/activate # Linux/Mac venv\Scripts\activate # Windows
- Install development dependencies:
pip install -r requirements.txt
📄 License
This project is licensed under the MIT License - see the LICENSE file for details.
🐛 Known Issues
- Some image embedding might need manual adjustment
- Some complex mathematical equations might not be perfectly converted
- Certain PDF layouts with multiple columns may require manual adjustment
- Font detection might be imperfect in some cases
🙏 Acknowledgments
This project builds upon several excellent open-source libraries:
- marker-pdf for PDF processing
- mark2epub for markdown conversion
- PyTorch for GPU acceleration
- Transformers for advanced text processing