• March 28, 2016

    Meeting at Microsoft on Big Data Sharing Platform

    DataPlatform

    Last week Dr. Yinhai Wang, director of PacTrans, attended a meeting at Microsoft about trusted data platforms. Others in attendance included Mark Hallenbeck, director of the Washington State Transportation Center, as well as representatives from the City of Seattle and eCityGov Alliance. As many probably already know the City of Seattle was not selected as one of the finalists for USDOT’s smart city challenge. The USDOT has pledged up to $40 million (funding subject to future appropriations) to one city to help it define what it means to be a “Smart City “and become the country’s first city to fully integrate innovative technologies – self-driving cars, connected vehicles, and smart sensors – into their transportation network.

    “But that shouldn’t stop us from continuing to try to make this a reality,” says Microsoft’s Brant Zwiefel. As part of Seattle’s proposal, Microsoft was going to develop a big data sharing platform that incorporated a vast network of agencies’ data in an attempt to make analysis less expensive and more robust. This meeting was about how to continue pursuing this goal even though we will not be selected for the USDOT grant.

    Zwiefel began by outlining several of the big challenges associated with data sharing. Many data collectors are set up to collect, aggregate, analyze and report in one step. That aggregated data then gets copied by other users but with each iteration of copying it gets exponentially harder to tease out raw data. Recently there has been a big shift to leave data in raw format and store it in a “data lake.” Then interested parties can access raw data and do their own aggregation and analytics downstream. The other big barrier discussed was that data holders universally fear who will end up with the data if it is placed on a platform, because some the data is inherently sensitive.

    Zwiefel went on to discuss their concept for a platform. At the base level, each group (ie. the city, the university, the county, the private company, etc.) gets space in the platform to store and manage its own data and set rules on how can access it; gatekeeping functions. At this stage there is a strong need to tag received data correctly, specifically in regard to sensitivity, which includes understand the policies and promises made by the data collectors to the collectees of how there data would be used. Then Microsoft would build into the platform “kitchens” for folding data into larger, useful databases.

    This is where it gets really exciting. Here you need to do two things: (1) make sure that sensitive information is not given to people who are not authorized to have it, and (2) you need to identify common fields to link the data. So for super simplistic example, if one entity has data on household water use, and another has data on household transit use, and you want to study links between water use and transit use, you need to be able to fold those sets together. The issue comes into play when the common identifier is sensitive information. So for the example used here, the home address would likely be the only piece of overlapping data. So the address needs to be used to fold the data but you aren’t authorized to receive the address. Here Microsoft has come up with some brilliant but simple options. You can encrypt a field identically in both datasets and then use the encrypted value to fold the data. Thus the end user would see only a series of numbers that mean nothing to them but they would now have a dataset that links water usage and transit usage to the same household. Then if you need the original field you need to push that information back to someone who is authorized to see the unencrypted field.

    Many if not all of these functions could be built into the system so that they happen at the press of a button. Then more groups can get faster and greater access to “unsiloed” data for analysis and reporting. Further, with regard to the security question. This platform allows for ultimate record keeping and transparency of who took data, when they took it, and what they did with it. This trail of bread crumbs ensures that if someone inappropriately or maliciously gets ahold of data that they shouldn’t, there is record of it. This could be a huge step in the effort toward more intelligent transportation systems. PacTrans is excited to partner and see where this goes.