Settings

Theme

Show HN: Deidentification, Python tool for removing personal info using NLP

github.com

1 points by jftuga a year ago · 0 comments · 1 min read

Reader

I created a Python library and CLI to automatically identify and remove personal information from text documents using Natural Language Processing. It has been used to de-identify internal employee surveys and patient satisfaction surveys.

What my project does:

* Identifies and replaces person names using spaCy's transformer model

* Converts gender-specific pronouns to neutral alternatives

* Handles possessives and hyphenated names

* Offers HTML output with color-coded replacements

___

Here's a quick example:

    Input: John Smith's report was excellent. He clearly understands the topic.
    Output: [PERSON]'s report was excellent. HE/SHE clearly understands the topic.
___

This was a fun project to work on - especially solving the challenge of maintaining correct character positions during replacements. The backwards processing approach was a neat solution to avoid recalculating positions after each replacement.

* blog post: https://gitgist.com/posts/introducing-deidentification-pytho...

* github: https://github.com/jftuga/deidentification

* PyPI: https://pypi.org/project/text-deidentification

No comments yet.

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection