Bob Gleichauf, Lab41, In-Q-Tel (DG'13)

Channel: Data Gotham Published: 2013-10-18 4,135 words Source: auto_caption

Transcript

so i am not from new york and i want to thank hillary for allowing me to come and do a little introduction to something that i am here soliciting but at the same time i'm here to just show something that may be of value to be replicated which is what we call lab 41 but you should leave this room thinking of it as a playground a place where people can come together and experiment see how things break how things work how they don't and the fact that as soon as you finish building something you're probably gonna have to rebuild it when the problem space changes on you and that you need venues to constantly be preparing for the next thing and that's basically what i'm going to be trying to talk to you about today so before i do this i need to go back the slides did not go in the order i want okay so where did lab 41 come from it came from a place called inquitel how many of you have not heard of inquitel before all right all right 13 years ago the cia was headed by george tennant his advisors people in congress people in the tech community was dawning on him that the ic was falling behind the pace of innovation in technology they were in a pstn type environment the internet was taking them by storm and they were not able to keep up a non-profit was formed at that time to create an investment arm of the intelligence community to try to go out and track what's going on the investment community and help the ic keep up with it it morphed over time and what really the group does there are about 70 to 80 people at any given time working for incutel they come from a diverse range of backgrounds vcs tech uh private equity whatever it is and what we're doing is we're meeting with close to 800 900 companies a year typically startups figuring out what they do and matching what they do to the problem sets that the government has and trying to figure out are there things that those companies could do to accelerate development of features in their roadmaps or add extra features that would be useful to the government but the thing is they have to be useful also to those companies in their existing marketplaces because the whole point here is to leverage the very power of innovation that it occurs in open markets so the government doesn't get stuck in the classic system integrator thing where they get locked in with the technology stuck with it etc that's the model i came from cisco systems four years ago i was a cto type guy and they brought me in and i didn't fit real well but i was sort of in-house to try to figure things out stick my nose in whatever i want and about a year and a half into it what happened is i was given a special project to work on worked on that special project with people in the government and people in inquitel and people in industry and what happened is we had a model that worked it's a pipeline of finding technology sourcing them and delivering them to the government customers but that's not sufficient it's finding technologies and products handing them over and hoping that the government could use them well the problem is that especially with big data which was one of the skunk works projects i was involved in an aspect of big data it becomes a systems problem integrating what you have with all the other moving parts the legacy stuff that isn't going to go away the way that the async horizontally scaling data infrastructure is going to work with your existing highly vertically integrated stuff that's already there all of that gets very complicated that was one problem we were noticing the other problem we were noticing is shelfware somebody goes and buys something it gets delivered and it never gets to be deployed because the very people that are in charge of deploying it were not consulted they weren't factored into how you're going to successfully use it so are there all these subtle dynamics that if you're really going to succeed at scale on a big data problem you have to start thinking of all these things so we wrote a little white paper that went to get the full return on your investment of all the things you do whether you're investing in products whether you're going to universities whether you're going through federally funded research and development organizations you need this integration capability and you need this new thing called a data scientist what is a data scientist it's this hybridized individual so what we said is let's start creating a venue a playground where we can bring people together from academia industry government and then the venture world where incu tells the primary representative of throw them in rooms or in a venue and allow them to work on hard problems and so what we tried to do is we tried to create this lab environment it happens to be out in menlo park utel's primary office is in dc we put it in menlo park because some of the decision makers decided they wanted a brick and mortar kind of place i actually think it should be a virtual place that's partially why we're here today in that we're trying to extend out the feelers of what the lab does and make this a virtual venue as well as a physical one but for better or worse there is a brick and mortar place out in menlo park we also tried to form it in the form that it doesn't overlap too much with what good work that's already being done in startups in academia work that's already being done in stuff like mitre or a place like sandia or whatever so we call it a challenge lab and i had some success in this in industry before where you take facets of a hard problem and you use that as a beach head for wherever the work will go you don't try to structure it too much and keep the bright people from really getting to what's important it's sort of like in the old days when i was in school i'd look up a word in the dictionary well i'd look at three other words on the way and get something useful out of it it might even change my whole focus we want the lab to support that kind of innovation so we take facets of problems maybe you're doing a form of social network analysis and you have a simple problem with the graph database component we take that as a starting point we define a problem and then we go find people and we bring them together so we're really big on begging and borrowing we provide the venue we don't pay anybody anything and then we provide unclassed open source data and we provide clusters for people to work on and open stack and things like that to manage it all then we just bring people together i don't think there's anything relevant there beyond that and i don't know if i'm going to go 20 minutes or over we'll see so we're very ad hoc full and part-time very iterative uh the model is evolving we're very much kind of like a startup we assumed initially oh we'll do a project it'll be six months working at six months and when we're done we're done well what's happening is projects we apply some process and projects are of various size but we had a project we thought was going to be six months it was a etl type problem it ended up being four weeks all in and we shut it down because we realized by the time we had gotten into it that there were several facets to the the problem we were working on we could nail one facet that no one was addressing in the open literature and none of the companies were but that a huge part of it there were three stealth startups that we had discovered that were already doing it so we said we're done don't waste any more time and do a handoff we introduced the sponsors of the challenge who came from a specific agency to these startups and let the inquitel machine which is the sourcing of technology and products and helping fund those startups that was the logical thing to do and one of the subliminal lessons that was occurring here through the lab is don't be afraid to change your mind midstream don't be afraid to change how you fund work just because you stood up a bunch of boys and girls to work on something don't sustain it because you have a bunch of boys and girls working on something if it isn't appropriate shut it down and refocus where you're going to get the best return for your investment so fast fail is important also at the lab a lot of things break in fact my history when i was at cisco i did a lot of mna as part of the work i did it's very important to me not to determine if things fail everything has its design parameters and it's going to fail it's much more important particularly if you're running like an enterprise network or your service provider how do you recover from failure is much more important to me than can you build something that doesn't fail with a few exceptions if you're running the space program yeah you want redundancy and all of that to eliminate failure but in big data analytics there are a lot of cases where you want to understand how is it going to tail off how are your results going to be affected by scale step functions all of that so fast fail is very important to us as well as fast handoff so what are we doing and i apologize i'm jumping around the lab is an unclassed facility working on on problems where everything is published via open source we're trying to use the open source model to get around the intellectual property concerns everybody has when they collaborate with someone else and we're trying to say when you come into the lab 41 environment anything you say or do in that venue whether it is the virtual email environment or the lab is shared intellectual property you may bring your your proprietary stuff and you may withhold certain aspects of your proprietary work but what you do as part of the group will be published in github and you can go to lab41.org and already see things we've already started to do in our first six months of existence we'll do a few white papers but the other thing here and i'm kind of jumping around is relationships when we first started lab 41 it was done under the umbrella of we're pursuing technology and we're going to understand how big data aspects of big data work and don't work but the other aspects of this is we're taking people out of government out of the beltway we're putting them in environments where a lot of innovation is occurring silicon valley and we have a venue that's pretty cool for people to hang out in and what we're trying to do is that when someone cycles through the lab and they go back to their day job they've been out there for a month three weeks or they come out a series of three-week stints for six months what they're taking back are the relationships so that they know what experts they can get on the phone or they can email or they can text when they have a problem and they understand the work dynamic of how it worked at amazon or how this latest visual analytics company is hacking and screwing around with things that may not be the way it works back in the beltway and that through a slow process of osmosis of rotating people through we're trying to create a kind of insurrectional movement of the data scientists and the analysts who get the opportunity to come through the lab go back and when they're sitting in a room and says someone says can't do it that way they go well when i was out in california there was this company doing x or y where they can do it that way and we're trying to make stop the thing of there are a lot of people they may not be the data scientist but it may be the guy or girl who's worked there for 15 years who is the one who constantly says this is how it works or doesn't and providing another way of saying nah that may not be so that they get out of some deadly embraces where they just don't make progress on things and i'll hopefully give an example of that in a sec so while we're doing this you get this lab it's unstructured it can be kind of scary if you leave it too unstructured so we do have a conceptual framework to what we think the big data analytics challenges should be that we're working on and they go everything from classic data ingest data sharing which just to give a little bit this talk's broken into two parts we're getting through the more formal part the bob g part my part we may not get through but data sharing if you think of any enterprise they probably have lots of different databases all islands to themselves how can you get a company to quickly start sharing all of it without reinventing the world companies already use networking and they already rely on the domain naming system dns for example to access all of their data why not reuse that scalable robust system and add a uri universal record identifier for data so that you could use that system to resolve data across silos and treat it as a single global directory that would be one way to quickly take the existing infrastructure and get something going that's something we want to play with at the lab that is a facet of a problem you could expend a career on that you could build a company around that who knows i don't care we want to play and allow things to happen from there analytic services is a catch-all for the machine learning catch-all and everything else that goes around with it secure workspaces is an area that i don't want to get into deeply but the way that people try to anonymize data and encapsulate may be dead on arrival when you're crossing domains and joining data looking at it you actually may get artifacts in your data or unreliable results based on some preliminary work we've done that is not widely pursued and so we believe that the way secure workspaces and monetization works maybe need to be rethought data-driven analytics somebody else said something about spreadsheets the site that was sending out stuff all the time in the government excel rules it constrains very much how you look at data that's not a bad thing it works it needs to be augmented more and more based on the plethora of types of data sources that are appearing with data driven methodologies that even affects how you store data how you create your metadata forms for referencing the data and that gets into that intelligent metadata which is another a huge part that i could talk about at length which is we are trying to actually allow metadata to be where a lot of automation works not the raw data and that you invest much more heavily on a metadata layer to complement what's going on at the data layer and i came from networking my primary role was security botnets were one of the hardest things to defend against i actually would like to try to use botnets to drive analytics on behalf of us this gets into another thing that the lab does most of the people who come to us with a data problem talk about i've got 300 analysts and they're doing x y and z what we're trying to get them to understand is you frequent you should actually be building for 300 000 agents working on behalf of your 300 analysts to do work and what would be the kind of works is it a data pruning is there an aspect of the accelerating the pipeline are there things that could be going on through the automation and if you do that you design differently your i o changes your read and write contention just across the cluster will change when you design for that kind of thing and people frequently design for their 300 users they're not designing for the 300 000 agents that might be working on their behalf that's the kind of stuff we're playing around with so i have a degree in archaeology and anthropology i did primatology i figured i had to put something in here one of my favorite tarsars is this philippine tarsu so i want to talk in terms of three metaphors i've got a jillion metaphors but i want to explain again help just share a little bit before i have to get off the stage things that the way we look at things etl is a huge problem in the government they collect data from everywhere they collect from what are large sources and so they're a superset of all these other large sources data comes in many times also in hard forms and they don't know the content they don't know its structure they then have a problem that an analyst wants to look at something really quickly that analyst may sit and be cooling his or her heels for a week or two weeks before getting access to the data because there are these formal processes that came from the formal enterprise vendors about how you annotate data and ingest it's dead on arrival it needs to be rethought and instead of thinking of heavyweight data coming in that you can then look at what if you brought it in like dust and you allow those pieces of data to come in and be addressable remember my dns concept but that they can come in as quickly as possible and just be referenceable or shareable and allow all the analytic processes remember not 300 analysts but 300 000 things operating on that data that then can annotate it and treat the data like dust bunnies where it can reform reshape don't predict don't constrain how it looks and if you can get those cycle times fast enough for the re-annotation and applying of attributes to the data at a metadata layer you might actually get a better quality of annotation than relying on some front-end process with a cadre of forensics analysts or whatever you want to call them trying to presume they know what the right attributes should be beyond basic provenance and stuff like that for the data so we have a concept of a dust bunny another concept cup of coffee i tend to measure my whole life in terms of my cups of coffee everything associated with data these days as a cottage industry it takes forever to do anything the classic target stores analyst who's doing you know sourcing of suppliers and things like that is waiting overnight for a data cube the data mart to be set up to be able to do his or her analysis and they have one cube to work on and if the cube isn't right they wait until the next day to get the next one it takes too long it's too heavy weight we want to get to the point where the time it takes to create an analytic workspace should be the amount of time it takes to get a cup of coffee even if it's a french press cup it shouldn't take that long it should be possible to do it in such a way that you can create seven of these desktop analytic workspaces in the time it takes to get seven cups of coffee in parallel it's been done why is this important because it starts allowing analytics to approach the magic of what google did with search google made the cost of asking a question near zero that means that i can ask a crappy question and end up with a good answer five questions later because the cost of each question was so cheap didn't matter analysts need to be freed to be able to ask crappy questions and lots of them because the cost should be low the cost is too high right now for us to ask the questions of our data coffee cup metaphor permeates a lot of what we're doing another way to say it is you could say i need better scale and performance yes but you need all the infrastructure to make it so cheap to build these hypercubes so what a lot of the lab is doing is the block and tackle of how do you take a pipeline or a processing problem or any tail problem that takes 40 hours get it less than 10 minutes and we've done work like that and then the last thing i want to give in the little time i have is a maze as a metaphor for a huge part of the focus for us for many of you i suspect the work you're doing the data comes in well structured and you have a model that you're applying that is more or less well understood a priori and then you're optimizing that model in various ways if i go to twitter and talk to the people there that's basically what they're doing they understand their models and they're figuring out how do you propagate lady gaga's tweet to 13 million followers faster for many people in the government they don't know the model to their data a priori they're doing model discovery when you walk into a maze you don't know the shortest path a priori and there are various ways of solving for shortest path you have to understand that in the constituents that we work with we're trying to help them do model discovery at scale it's a metadata problem it's a processing problem it's a jiu-jitsu problem of the data and you need various techniques to do that shortest path short of past solutions two ways is a well understood problem it is a graph theory problem what i want you to understand if you ever come talk to us at lab 41 is we think of everything as graphs we think uh the graph databases need to be taken and expanded in their capabilities dramatically because once you can start applying graphs at the billion node scale and the billions upon billions of edge scale it frees you up to do certain kinds of analyses that are much harder to do with non-graph databases so the lab and if you go look at the blogs we're posting on lab 41.org that takes you to our github blog area you'll see all the kinds of things we're doing including the challenge of how do you get data to test at scale graph databases we're investing in graph synthetic data using something called chronic or graphs that you may find interesting and at i guess given the time i have a counter that's changed on me i think i have a little more time than i thought i had i'm going to end with just i'm here to invite you to come work with us check us out i know it's new york and we're in menlo park we're also in washington d.c if there was enough interest in this concept and us establishing a tow hold in new york city i would take that under advisement and go pitch that back to my minders we are all about begging and borrowing any type of talent whether you can program whether you can come talk pretty to some of our customers and help them understand what we're doing and with that i think i'm gonna

Bob Gleichauf, Lab41, In-Q-Tel (DG&#39;13)

Transcript

Bob Gleichauf, Lab41, In-Q-Tel (DG'13)