02/26/2026
I spent the last two days fighting PostgreSQL, Python, and my own stubbornness ๐. Alas, I have a new portfolio project to show for it!
What is it? A De-Identification Pipeline, built from scratch.
Working in the healthcare data space, I know how valuable keeping patient data secured is. I had even started on a project at my last company to help de-identify the company's data, but left before the completion of that project. This helped me explore what could have been done.
The idea was simple: simulate how real healthcare organizations protect patient data while still making it useful for analytics.
The ex*****on wasn't so simple though. Between Faker generating phone numbers in scientific notation, CSV import permission errors (just let me import the fields in the columns I want, regardless of the data type!! ๐), and SERIAL sequence starting at 9 instead of 1 due to the those import permission errors *rolls eyes*, let's just say this project tested my patience as much as my skills.
But here's what I built:
- a raw_data schema with three normalized tables: patients, medical_records, insurance; holding full PII and PHI data
- three de-identified analytics views using HIPAA Safe Harbor techniques:
* Masking (SSN, phone, names)
* Redaction (last name, address)
* Generalization (DOB > birth year, ZIP > first 3 digits)
* Tokenization (SHA-256 hashed IDs)
- a comment-heavy SQL script that explains step-by-step what my process was in building the pipeline, and the code to create the schemas, tables, views, and role.
- a detailed Python script that uses the Faker library to generate 50 fake patient records across three CSV files, ready for import into PostgreSQL
- Role-Based Access Control - a data_analyst role that can query the analytics schema but gets a permission error if it tries to query raw_data
The best screenshot of this project? ERROR: Permission denied for schema raw_data. I've never been happier to see an error in my life. ๐
For those transitioning into or looking to demonstrate compliance awareness, understanding the WHY behind de-identification not just the how, is what sets you apart.
๐ Full project on GitHub: https://github.com/jlynne2004/hipaa-deidentification-pipeline