HDG #034: 100+ open-source health data sets

 

Happy new year, my friends!

Last month I announced that HDG would be going down to monthly for a number of reasons, one of which was to give me ample time to compose resource-rich, actionable articles + health data how-to’s.

This is one of those.

I also posted my 2024 healthcare predictions (since everyone else was) and toasted to better data, new data, and old data being put to better use in 2024! 🥂🥂🥂 

 
 

As such, you can imagine how excited I am to have this be our first edition of 2024 and very first monthly edition. In fact, I’m a little embarrassed to admit I’ve been working on compiling this for your almost a year now — even with ChatGPT’s assistance. So without further ado…

. . .

I present to you 100+ open-source health data sets.

(🎉 that means free to the public, my absolute favorite thing! 🎉)

Visit the full-size table view here or download as a CSV above.

If you know of other data sets that should be on this list, submit them to me for review here.

. . .

This is a good starting point to hit when you’re looking for health-related datasets.

I learned about a ton of new data that exists out there.

And though this list is certainly not exhaustive and will probably grow over time as people ask me to add to it or I find new sources that belong on this list, it does provide a wide range of starter points.

A caveat — know that of these come with their own nuances:

  • I tried to ensure the majority had a way to download all the raw data in an easily accessible format, but some made it to the list that didn’t because of their content or importance of information

  • Not all of them may be at the level of granularity you seek (state, county, city, zip code, or census tract, for instance)

  • While 99% are open source (free to the public), some include additional links where you can license or purchase additional, more granular detail. Again this is not my preference when it comes to public data, especially government, but I understand restrictions due to HIPAA, identifiable small numbers, and that it costs a lot of money to maintain, curate, and provide this kind of information. Businesses have to eat, and many of these government entities outsource/subcontract to a non-government entity to collect, maintain, curate, and make this information available.

    . . .

In compiling this, I saw commonalities among these public data sites.

These seem like low-hanging fruit to me to significantly improve the user experience.

  1. All this data creates a LOT of NOISE for people just trying to find even the most basic of information. Many of the sites are overwhelming, outdated, text dense, and hard to navigate. It is no wonder people can’t find the data they need and get overwhelmed trying. Many of the sites are so complex and have so much technical and healthcare jargon you must know in order to navigate and make use of the site or data, it is probably prohibitive to more people than we’d like to admit, even those working in healthcare or data.

  2. Finding the actual button to download the dataset is extremely nested/hidden/buried under text/way at the bottom… if it is easy to find at all. This seems odd to me since I assume that 90%+ of traffic to these pages are coming to find and download the data. Why not make it big and obvious? Some of them are doing a good job putting a link to their visualization or dashboard front-and-center, but the option to download the raw data remains largely hard to find in general.

  3. Accompanying and similarly easy-to-find Data Dictionaries were co-located with about 50% of the sources cited. A common issue I have with public sources like this is knowing what field names mean, what values/value sets are, and where the data comes from. This should be part and parcel with any public data set, imho.

  4. The Federal government seems to assume we all use SAS and Stata. While most of them have a download option for a spreadsheet, text table, or otherwise, it can often require more work/time.

  5. CMS’s website is full of broken links, announcements of old data tools, and is damn near impossible to find what you’re looking for. This surprises no one reading this article, but it seems time they really invest in revising this. Easier said than done, as the sheer volume of information (current and historical) that needs to be on this site is insurmountable. Could LLM search tools help?

  6. Along those lines, data.cms.gov has gotten almost as equally unwieldy. Seemed like a good idea at the time: centralize all the CMS data (of which there is commendably a lot) into a “one stop shop” for data users. But there are so many different types of data housed here that the data you’re looking for rarely comes to the top of the search, the lingo is very challenging for a non-healthcare user, and some of the most common datasets are buried or seemingly non-existent in the search by the name most of us would search for them (hospital compare, public use files, fee schedule/rvu list, etc). I may be offbase on my assumption about how “most of us would search for them,” but they are very common vernacular used for these data.

  7. Many sites are starting to publish their own visualizations, dashboards, and exploratory tools. This is great if you want to come find some quick info. If they could also make the download raw data option more easily accessible for those of us who need to go deeper, that would be ideal.

  8. Along those lines (again), some Federal sites known for publishing tons of public data (CDC, CMS, etc.) are now publishing “Fast Stats” or “Fast Facts” pages that collate the most common datasets into a list or provide quick high-level statistics commonly searched for their programs.

  9. Unrelated but worth mentioning since I relied on it to help me with this monumental task: ChatGPT’s output quality seems to be declining. Turns out, others are noticing and there are even some studies starting to pop up. There was a noticeable difference in the results and its level of comprehension of my detailed prompts between when I first started compiling and vetting this list back in April 2023 to now, December 2023. While some things may have slipped through, I still visited and explored every single website and copied every single link by hand to ensure that this information was as accurate and up-to-date as possible. And if I could find specific links to the data download or the actual page users would need to find to get the download button, I also included it for you. You’re welcome!

    . . .

Please send me a note if you find any inconsistencies, broken links, or have suggestions for other data sources that need to be on this list.

Let’s make this another public resource that we can use/share/improve to further our industry, research, and data available.

Because there is certainly a lot of data out there, but if our peers still feel that they don’t have the data they need, it is either not hitting the mark, they don’t know it exists, or it is unusable in some way for their specific need. 

And I don’t doubt it… but maybe this list can help us start bridging at least some of the gaps.

. . .

Until next time,

-Stefany



P.S. If you have any ideas, suggestions, feedback, or requests for specific topics, I’m always open. Hit reply and let me know!!

 

Like my content?

If you want to learn more about health data quickly so you can market yourself, your company, or just plain level up your health data game, I recommend subscribing to this newsletter and checking out my free Guides. Courses and more resources are coming soon, so check back often!

Want to work together?

I work with healthtech startups, investors, and health organizations who want to transform healthcare and achieve more tangible, equitable outcomes by using data in new ways. Book some time with me to talk health data, advice for healthtech startup and investment, team training+workshops, event speaking, or fractional support + analytics advisory.

And follow me on LinkedIn, Twitter, and Medium to stay up-to-date on resources and announcements!

 
Previous
Previous

HDG #035: The non-mathematical guide to using risk scores in healthcare

Next
Next

HDG #033: HDG’s top 10 health data resources