Knowledge Management for Data Science (KMDS)
Capture, organize, and reuse knowledge from your data science experiments.
🌟 What is KMDS?
KMDS is a Python-based tool for systematic knowledge management in data science and analytics projects. It helps you document the incremental process of exploration, data preparation, and model development — capturing context, decisions, and rationale so that valuable insights are not lost over time.
The Problem It Solves
Experimental work generates a stream of decisions and findings. The context and rationale behind each step are often documented ad-hoc, if at all. When it is time to revisit a question or build on earlier work, the research trail has gone cold. KMDS fixes this by providing a structured, ontology-backed way to log, search, and share your findings.
Who Can Use KMDS?
KMDS was originally designed for data scientists writing Python. Recent additions to the CLI and natural-language tooling mean it is now practical for a broader set of users:
| User | How they interact with KMDS |
|---|---|
| Data scientist | Python API, notebooks, CLI — full access to all features |
| Software developer | CLI tools and Python API for automating knowledge capture in pipelines |
| Business analyst | CLI commands and natural-language ingestion — no ontology code required |
🎥 Watch a quick overview of KMDS: YouTube Video
✨ Key Features
- Structured Observation Capture: Log findings from exploration, data representation, modeling choice, and model selection stages using Python or the CLI.
- Natural Language Ingestion: Describe a finding in plain English — KMDS classifies it, extracts structured entities, and optionally logs it to the knowledge base. No ontology code required.
- Ontology-Backed Knowledge Base: Store and reload workflow knowledge as RDF/OWL artifacts that can be shared across projects and teams.
- Semantic Search: Build a vector index from your knowledge base and retrieve relevant findings with natural-language queries.
- LLM Search Orchestrator: Route natural-language questions to structured KMDS search templates with automatic semantic fallback.
- CLI-First Usability: Every major feature is accessible as a command-line tool — usable by developers and analysts without writing notebook code.
- Simple Reporting Surface: Load observations into tabular form for review, sharing, and downstream analysis.
🚀 Getting Started
1. Installation
Install KMDS in your Python environment:
2. Usage
As you work through your analysis, log your findings to kmds. Check out the examples below.
3. Quick Summary Logging (CLI)
KMDS now supports logging exploratory observations directly from a free-text project summary. This is useful for business analysts and other non-developers who want to capture findings quickly.
Application workflow example (explicit, non-interactive):
kmds-summary-log \ --summary "This is a daily reporting workflow for support operations. Missing category labels were found in intake data." \ --workflow-name "support_reporting_intake" \ --workflow-type application \ --project-file ./support_reporting_intake.xml \ --create-project \ --no-prompt
Ambiguous summary example (interactive prompt):
kmds-summary-log \ --summary "Project kickoff notes for the upcoming quarter." \ --workflow-name "quarterly_kickoff_notes" \ --project-file ./quarterly_kickoff_notes.xml \ --create-project
In the ambiguous case, KMDS will ask whether the workflow is application or experimental, then continue logging exploratory observations.
4. Export Executive Summary (CLI)
You can export a non-technical executive summary from a KMDS project file.
kmds-exec-summary \ --project-file ./support_reporting_intake.xml \ --output-file ./support_reporting_exec_summary.txt
Optional LLM mode (falls back to local summary if API/model is unavailable):
kmds-exec-summary \ --project-file ./support_reporting_intake.xml \ --output-file ./support_reporting_exec_summary.txt \ --use-llm \ --model gemini-1.5-flash
Markdown output option:
kmds-exec-summary \ --project-file ./support_reporting_intake.xml \ --output-file ./support_reporting_exec_summary.md \ --format markdown
5. Natural Language Observation Ingestion
KMDS can classify a free-form natural language statement into the existing KMDS observation schema, extract structured entities, and either return a summary or log the result into a KMDS knowledge base.
Summary mode example:
kmds-observe \
--text "The model accuracy dropped by 5% after pruning on 2026-04-20." \
--mode summary \
--output-format jsonLog mode example for a new project:
kmds-observe \ --text "Missing values were observed in the customer_age field during intake validation." \ --mode log \ --workflow-name "support_reporting_intake" \ --project-file ./support_reporting_intake.xml \ --workflow-type application \ --create-project
Python API example:
from kmds.utils.natural_language_observation import map_text_to_observation mapping = map_text_to_observation( "We engineered a rolling 7 day demand feature from timestamped order counts." ) print(mapping.workflow_family) print(mapping.observation_type) print(mapping.extracted_entities)
6. Semantic Search (CLI)
Build a vector index from a KMDS knowledge base and retrieve relevant findings with a natural-language query. No API key required.
kmds-search \
--kb ./support_reporting_intake.xml \
--query "What data quality issues were found?" \
--n-results 5Or from the Python API:
from kmds.search import SemanticIndex idx = SemanticIndex() idx.build("./support_reporting_intake.xml") results = idx.search("What data quality issues were found?", n_results=5) for r in results: print(r["obs_type"], "|", r["finding"])
7. LLM Search Orchestrator (CLI)
Ask a free-form question. The orchestrator routes it to the best KMDS observation-query template using an LLM, executes the template, and synthesises a plain-English answer. Falls back to semantic search automatically.
export GOOGLE_API_KEY="your-api-key" kmds-ask \ --kb ./support_reporting_intake.xml \ --query "What assumptions drove the final model selection?"
The full documentation covers custom LLM functions, available routing templates, and output formats.
This repository includes two detailed examples:
-
Analytics Example: Evaluates the effectiveness of a ticket resolution help desk.
-
Machine Learning Example: Uses Principal Component Analysis (PCA) to summarize online store sales activity.
🤝 Contributing
We welcome contributions! If you have an idea for a new feature or would like to report a bug, please open an issue. If you'd like to contribute code, please fork the repository and submit a pull request.
📄 License
This project is licensed under the Apache 2.0 License. See the LICENSE file for details.
📞 Contact
If you have questions or are interested in the following, please schedule a meeting:
- Help with a data analysis task for your use case.
- Developing a custom ontology-based solution.
- Integrating KMDS with other tools in your data science stack.