The Design Data List
tl;dr: I've made a GitHub Repo for linking to known Design Data resources. Check it out and add to it if you or someone you know has design data somewhere.
One theme to come out during this year's IDETC Design Theory and Methodology session on Creativity and Ideation was the need for increased data sharing and reproducibility in our community. We do have a beginning culture of this, particularly in the CAD and Design Computing sub-fields. But cultural change takes time, and we have struggled to find a central place to share and store data like the Machine Learning community has managed to do with the UCI Machine Learning Repository and MLoss. Jami Shah's group has valiantly tried to create a central repository through the ASU Design Protocol Repository, and Rob Stone's group has provided a product repository for several years, but the practice of reproducibility and data sharing is not yet widely adopted.
I think this is comes down to the perceived difference between benefits and costs:
- Benefits: The benefits of sharing code and data aren't as well understood within our community as they are in other communities (Computer Science, Psychology, Economics, etc.). This is something that will take time to educate the community about, and is not the purpose of this post (though this would a good topic for a future post).
- Costs: Sharing data or code is seen as arduous (which might or might not be true depending on how it is done). Even if one wanted to follow Uri Simonsohn's advice to "Just Post It" , it's often unclear how or where to post your data/code to maximize impact and minimize work. Do you just post up a zip file on your website, transfer it to a central community repository, or go full-in by publishing your entire workflow on something like the Open Science Framework? This is what this post attempts to address.
After chatting with folks at the past two IDETCs about this (notably Alex Burnap and Karthik Ramani), I decided to address the Cost side of the equation by borrowing an idea I saw from trying to keep track of the ever evolving landscape of Machine Learning Libraries: rather than asking folks to pick some standard method of sharing their data, I created a GitHub Repo that just aggregates links to all of the design data sources that I know about. This works for me for several reasons:
- Contributing is easy: it's just a set of text files. The original authors can add a link, or any 3rd party. Its completely open.
- Since it just provides links, researchers don't need to do anything fancy to share their data: researchers can chose to just post it wherever is convenient, or they can use more formal mechanisms like central repositories. It's all treated equally; I (or anyone else) just needs to add a link to it.
- It's a low-effort way to highlight some of the great open-source work we do in our community, and to visualize our collective efforts in one place. In NSF parlance: it helps increase our works broader impact beyond just the papers we write.
- GitHub's version tracking and commenting system give us a low-weight, but potentially useful means to have a discussion around the purpose and organization of the list. Their pull request system means we can collaborate on it together and change it over time.
Will this be the permanent solution to bring reproducibility to the design community? I hope not. But it's a start; a way for folks to test the waters and establish a culture of reproducibility before trying their hands at something better-suited, like the OSF.