Human-Centered Data Science Lab

Uncategorized

HDS Lab members advance to candidacy

Posted by Cecilia Aragon on January 26, 2016
Uncategorized / Comments Off

HDS Lab members, Ray Hong and Katie Kuksenok, have passed their exams and advanced to PhD candidacy. Ray’s doctoral research in Human Centered Design and Engineering is focused on developing a methodology for distance cartograms, and Katie’s Computer Science dissertation is titled, “Adoption and Adaptation of Programming Practices in Oceanography.”

Undercurrents at the DSE Summit

Posted by Cecilia Aragon on November 05, 2014
Uncategorized / Comments Off

The Data Science Environment (DSE) Summit took place in beautiful Monterey, CA at the Asilomar Conference Center. The Summit brought together over a hundred participants across three universities (UW, UC Berkeley and NYU) involved in the Moore and Sloan Foundations’ Data Science Environment grant.

As a data science ethnographer, I typically take on the role of participant-observer of various data science events, but at the DSE Summit I ended up being more of a participant than an observer. The high degree of participation made it challenging at times to listen as closely as I would have wanted to for underlying rhythms and patterns across the group. However participating in the discussion sessions and interactions I identified some important undercurrents. I draw out these undercurrents into two main themes that I discuss in this post.

image of Monterey coastline

Photo credit: Kevin Koy

Imagining a Data Science Environment

I participated in particular sessions, including teaching and curriculum development, data science ethnography, and the ethnography and evaluation working group, which were all, to some degree, imagining a data science environment. Underlying these discussions were questions of where and what exactly is data science? Where is it located and when does it matter? What are the origins and the goals? These questions bubbled up in conversations imagining a curriculum for data science, career paths in data science, and the very structuring of a data science community within academia. As these various concepts were discussed and imagined, there was a fracturing and multiplication of perspectives around these questions that sparked a bit of confusion.

There was some clear discomfort with the uncertainty and messiness around these issues. Many people seemed to be craving concrete definitions and the move towards formalization of data science, while others seemed content to not know their specific position or where the ship was headed exactly, opening themselves up to being influenced by the experimentation yet to come. Allowing ourselves as a community to sit with the uncertainty and messiness while we try things out and wrap our heads around the implications is very much part of the goal of these five years. As an ethnographer, I am drawn to messiness, so I was feeling quite comfortable in these conversations.

I am going to attempt to disentangle a few strands of conversation in the unconference session on data science ethnography which focused on different approaches to data science. This discussion drew on a previous conversation about how different ways of approaching data science can emphasize a more individual-oriented view versus a community-oriented view. This conversation helped shift the focus from data science as residing within an individual to data science enacted at the community level.

This included thinking through the implications of different metaphors, such as T to pi (Π) to gamma () for characterizing the shape of data scientists of the future. The shape of a pi-shaped scientist implies there is an expectation of individuals having expert-level depth of knowledge in two domains. Whereas a gamma-shaped scientist would have expert-level depth in one domain and be versed and proficient in another domain. The addition of other metaphors, such as gamma-shaped individuals, expands our imagination for what this data science environment might look like. It also has implications for how people may or may not identify as data scientists or as playing various roles within the data science community. This gets at the heart of the question, “What are we building?”

This question of “What are we building?” opens up other fractures in perspectives around how applied or theoretical data science is and “should” be. As the term data science has accumulated many meanings across industry and academia, there are distinctions many people wanted to make around data science in a research sense, data science in an applied sense, and data science in a professional sense. Was data science going to be its own discipline, its own department, or was it simply a new dimension of the work of all other domains? Is data science always applied? What would its body of theory look like? What are the political implications of these different imaginations of data science for issues like status and careers?

A related strand of conversation emerges around the question, “Where are we building it?” A strong current across the discussions imagined data science coming out of statistics and/or computer science, which alienated some people within the group who did not see it that way. Others wanted to frame data science as an integral part of all domain sciences in this data-intensive age. These different imaginations require a language and infrastructure we don’t yet have and must build. What would it mean to be neither and both of these in the institutional context of the university?

Another strand emerges around the characterization of how and when is data science. This strand of conversation was first dominated by talk about “data producers” and “data consumers”. This characterization implied ways that the work of data science was being divided up. But by the end of the conversation, these oversimplified categories fell short of describing a more complex ecosystem of data science. First, this is because there are those individuals and practices that embody both consumption and production. Second, these categories don’t encompass the mediator roles and mediation practices that are integral to the data science environment. These roles and practices involve the work of translating, connecting, and often innovating in the interstices.

The conversation around the community level data science and the relation to T, pi, and gamma appropriately ended with a move to focus on the “horizontal line”, the connections and intersections among these various disciplines, the mediator roles and practices that support research translation. What emerged was that, perhaps as or more important than a conversation about the number of legs or their length (pi versus gamma), is the conversation about the character and future of the “horizontal line.” This focuses us on the translation work and the supporting infrastructures that function to forge and maintain connections across legs. Part of the important impact I think the ethnography group can have is in making visible the character of different horizontal lines, and to better understand how they function and their implications for developing a data science environment.

Collaboration in Context

Throughout the Summit, collaboration was often referred to and leveraged as an abstract principle that reigned over all of the activities of the grant. People talked at high levels about the goal of collaboration mostly ungrounded in what, when, where, how, and why. The overbroad terms in which collaboration was talked about potentially obfuscated the many levels, layers, and conditions at which differently configured collaborations may occur. Collaboration is multiple things, and importantly, it is negotiated within a multitude of circumstances and values.

The spaghetti and marshmallow hack (the community building activity we did on Tuesday morning) aimed to have the group experientially engage with the inextricable relationship of collaboration and the performance of tasks at hand.

Marshmallow and spaghetti sculpture

Photo credit: Gina Neff

For example, in many cases there is a fragile balance between attending to the work of collaboration and getting things done. Collaboration does not exist within a vacuum without constraints or consequences. We hope that the goals of the collaboration are aligned with measures of performance, but this is not always the case. Further, as anyone who has ever collaborated knows, collaboration requires time and energy beyond the task itself. Yes, we want to incentivize more collaboration across domains and groups, but most importantly we want to learn about how we configure effective collaborations, what different roles are important, and how different forms of value can be strategically distributed across participants.

Collaboration as a goal may make a lot of sense at the level of an individual’s specific research question when this question requires multiple types of expertise to answer, but at other levels such as collaborating across institutions to support a data science environment, the value and the incentives of collaboration may be harder to assess and determine for individuals. What our three universities together with Moore/Sloan are trying to learn about and develop on an institutional level shifts everyone’s focus beyond the particular research at hand to the hard work of building the infrastructures and cultivating the relationships, cultural norms, and values that are necessary for supporting a thriving data science environment.

Interestingly, the concrete work of building collaboration the how, where, when, and why, one might say, didn’t get discussed until the Wednesday morning unconference when many had left, and many who were still there were “raptured” in important high-level meetings. The group that was left was made up of a mix of graduate students, postdocs and research scientists, but no faculty. It was about 12 of us discussing exactly how connection and communication would continue after the Summit. What infrastructure would support the exchange of ideas and the conversations we had begun to have here?

We discussed the role of a chat room and the needs of different working groups for sharing and connecting across the campuses. This is work that needs to happen to ground any kind of collaboration. This was what the 12 people left on the final morning thought was most important to discuss over any other data science topic. These people didn’t just discuss these questions, they generated ideas, innovated around these ideas, and executed these ideas. There is now a MS-DSE chat room set up and multiple ongoing conversations about how to connect and communicate within and across campuses. I felt inspired by this session as I got the sense that these interactions represented the movement taking root and beginning to grow!

Hacked Ethnographic Fieldnotes from Astro Hack Week

Posted by Cecilia Aragon on October 29, 2014
Human-Centered Data Science Lab (HDSL) Blog, Uncategorized / Comments Off

First posted at the Astrohackweek blog

What is data science ethnography anyway?

As an ethnographer of data science, I immerse myself in particular communities to understand how they make sense of the world, how they communicate, what motivates them, and how they work together. I spent a week at astro data hack week, which might as well have been a foreign culture to me. I participated as an active listener, trying to sensitize myself to the culture and discern patterns that may not be self-evident to people within the community. Ethnography can have the effect of making the ordinary strange, such that the norms, objects, and practices that the community takes for granted become fascinating, informative sites for learning and discovery. Many of the astro hackers were probably thinking, “Why is this woman hanging around watching me code on my laptop? There is nothing interesting here.” But I assured them it was interesting to me because I was seeing their everyday practice in the context of a complex social and technical world that is in flux.

Ethnography can be thought of as a form of big data. Typically hundreds of pages of fieldnotes, interview transcripts, and artifacts from the field would be recorded over a long period of time until the ethnographer determines they have reached a point of saturation. The analysis process co-occurs with the data collection, iteratively shaping the focus of the research and observation strategy. Across this massive dataset with an abundance of unwieldy dimensions, the ethnographer has to make sense. The ethnographer works with members of the community to help them interpret what they are observing. Ethnographic insights, what many may term “findings”, emerge as patterns and themes are detected. Theory and new questions are generated, rather than tested. In this process I also acknowledge my own biases and prior assumptions and use them as ways to probe deeper and understand through them rather than ignore them. For instance, I came to astro data hack week not understanding much of anything people were talking about. It made me prone to feeling intimidated and I recognized with this intimidation my own reticence to ask questions. My own experience with this feeling helped me identify in others that were also feeling variations of this and also be able to identify what helped transform that feeling throughout the week into a more comfortable and curious state.

I only spent 5 days among the community of astro hackers, but in the spirit of hacking, I have a few “hacked” fieldnotes to share. Sharing is a key component of the hack week and as a participant I feel it is important to follow suit. But bear in mind these thoughts are preliminary. So, what have I been working on this week?

Initial descriptive observations from an outsider (a little tongue-in-cheek, forgive me):

Astro hackers live in a very dusty, dirty, and noisy environment! Very hard to keep clean and elaborate measures are taken to obtain a signal. But when the signal is too strong or the data too clean, there is a feeling of mistrust.
The common language is Python, although there are many other dialects, some entirely made of acronyms, others sound like common names, such as George and Julia.
When talking there is always some form of data, documentation or model that mediates the conversation, whether it is on the white board, on the screen, or through representational gestures.
Although most people are studying something that has to do with astronomy, they can literally be operating on “different wavelengths”!
Astro hackers play with “toys” and “fake data” as much as “real world data”!
Coffee and beer fuel interactivity!

Themes

Josh Bloom teaches Machine Learning

Data science at the community level: From T to Pi to Gamma-shaped (Josh Bloom’s term) scientists: Across the group I heard over and over again in various ways reference and deference to those who are more expert, those who are smarter or those who know more than I do. Granted, this is a somewhat common occurrence in the culture of academia as we are continuously humbled by the expertise around us. However, I found this particularly acute and concentrated within this community. What I heard across students, postdocs, and research scientists was more than the typical imposter syndrome. It was the feeling that they are expected to be experts or at the very least fluent in a range of computing and statistical areas in addition to their own domain. While this motivates people to be at a hack week such as this, it can also have the unintended effect of making people intimidated and overwhelmed with having to know everything themselves. This can have a chilling effect across the community. This means the feeling that other people know more than they do is pervasive and this often leads to thinking their questions aren’t valuable for the rest of the group, and therefore, not worth sharing. This is a negative thing and we want to ensure this effect is minimized. Not only is it bad for morale; it is bad for science. We should consider who feels comfortable taking a risk in these settings? A risk might be asking a question that they fear isn’t scientifically interesting for others. Or sharing something that isn’t complete or isn’t perfect. If we take what Josh Bloom says, that we might be better off thinking about data science on the community level, happening in a more distributed way, rather than data science on the individual level, we can begin to paint a different picture and change some of the expectations that may trigger this negative effect.

Josh Bloom’s lecture on machine learning explained the popular idea of “Pi-shaped” individuals (a buzz word for the academic data science community) and his preference, for talking about “Gamma-shaped” individuals. Rather than promote the idea that there is an expectation of individuals having expert-level depth in two domains, which is unrealistic for the majority of people, what if we thought of people as Gamma-shaped? These people would have expert-level depth in one domain and also be versed and proficient in other domains. Someone with their PhD in biology may be conversant in the language and culture of computer science enough to have conversations and collaborate, but they don’t necessarily need to be an expert in computer science to the extent that they are able to advance the discipline. These Gamma-shaped individuals can work with each other to bridge multiple domains of expertise. This Gamma symbol better reflects the makeup of individuals in this astro hack week community and this view of data science allows for the expectations to shift to the community and to the collaborative interactions between people. This shift is important and has implications for thinking about how to better structure hack week. For instance, with these tweaked expectations a learning goal of the hack week might be working together across Gamma-shaped individuals.

Categorizing hacking interactions I categorized the different kinds of hacking interactions I observed over the course of the week. This list is not meant to be exhaustive, but it might be helpful in understanding the diversity of interactions and how to facilitate the types of hacking interactions desired.

Resource Seeking: An individual works on their hack idea and uses other people as sources of expertise when they need help
Asymmetical Synergy: A pair or small group joins together to work on a hack idea in which one person is learning something, such as an algorithm, and the other has more advanced knowledge and is exploring what that algorithm can do. They are generating something together but getting different things out of the activity.
Symmetrical Synergy: A pair or small group joins together to work on a hack idea and iteratively discovers how their expertise informs the other, or how interests synergize. Then, they generate something new together.
Comparing Notes: An individual works on their hack idea and shares it with others based on their common interest. A form of comparing notes in which they are talking about the work more broadly and loosely.
Learning Collective: A semi-structured activity that draws multiple people in to learn something collectively, thus creating a learning collective.

The Importance of “Connective Tissue”

Across this community there is great diversity across institution, dataset, data source, methodology, computing tools and packages, statistical approach, status within academia, and level of knowledge in different arenas. This creates many opportunities for discovering connections, for sharing, and working together. Yet this also presents challenges for forging these connections especially within the broader academic environment which in many ways doesn’t incentivize collaboration and “failing fast”. Failing fast refers to the capacity to be highly experimental, to take risks, and invest a little bit often, such that when things don’t work, it is framed much more as part of the iterative process rather than as a significant loss. In a culture where people are failing fast, people are more likely to take risks and learning can happen more rapidly.

A key and essential role that emerged this week was the set of capacities for facilitating connection across people and ideas, what Fernando Perez has called the “connective tissue”. There is a need both the people and the organizational structure that supports social and technical resonances across a wide range of people and can facilitate connections among them. These people can play a role of translation across ideas that might appear otherwise unrelated. They also provide coaching (as opposed to teaching) to help both identify and achieve their goals. We should all be learning from these people so that we can all contribute to the connective tissue. This connective tissue developed further throughout the week. Specifically, the more semi-structured collective learning activities and the emphasis on working in pairs greatly increased the productivity across the group (there was more to show at the end of the day) and the interaction (fewer people with earphones in and more talking). I also observed many more small and big shared victories. I hadn’t yet seen a high five and I saw two instances on Thursday, which reflected the overall sense that the victory was about more than an individual completing the hack, rather it was shared and celebrated together.

This hack week performs as a kind of lab space where people can take risks and work together in new ways that they might not be incentivized to do otherwise. It is an opportunity to change the incentives for a short period of time. In fact, the frictions that we see emerge in this hack week (i.e. people needing to work towards publications) reflect some of the default incentives clashing with hack week incentives. For future hack weeks it might be important to advocate failing fast through normalizing it and facilitating a supportive environment for risk taking. In addition, part of the goal of a future hack week might be more explicitly to learn about how to work together and what it takes to develop connective tissue through incentivizing a range of different hacking interactions.