RAPIDS Academy - Out-of-core SQL (main)
Transcript
so for everyone who isn't familiar felipe is our cto and kind of the original architect of blazing sql and and everything that we started doing about i want to say like six years ago when we started playing with gpus for the first time um what he's going to talk about today is a little bit around our out of core um technology which basically enables us to scale up query execution beyond what uh what what's available in gpu memory um so yeah um that that's that's basically it so felipe you can um you can take it away and yes real fast for any of the the questions you can ask questions here in the chat there is actually a q a feature inside the zoom webinar so if you are inside zoom uh the best thing for you to do is to submit a question to the q a platform you can see it at the bottom of your of your webinar feed and that way we can manage the questions and answer them but if you if you can't find that the chat also works but the other is definitely better all right felipe with that it's your session mate all right um hi i'm felipe um as rodrigo said i'm the cto at blazing and what i want to do with everyone today is um run a couple of simple queries um look at their uh execution plans in blazing see what that graph looks like and get an idea of how blazing handles out of core um basically what it is that blazing does when we start having memory pressure right when we start running low on resources um that that's really gonna be like the the majority of my talk is i'd like to explain to you all of you um how we handle this and why we've made some of the decisions that we've made and of course to answer questions that you all might have about that so quick introduction that we always like to give people about rapids is uh rapids is a python ecosystem that helps accelerate um data science etl workloads with gpu accelerated libraries uh blazing sql is a sql component of that um but there are many other components uh many of them are open source and uh and right here we can kind of see like some examples of what those packages are for those of you who don't know um there's a we we all work off of the same in-memory gpu representation of data which is based on apache arrow and so all of the libraries here and many others including graphistry for example can interact with that gpu representation natively and don't have to do any serialization or deserialization to move information between blazing sequel and the codif function or a cool ml uh training session that you might want to have um yeah so that that's kind of like a real rapid introduction blazing sql is the sequel engine in this ecosystem we are the way that you could interact with these data frames on sql and uh the the this is kind of what not kind of this is exactly what it looks like to run uh blazing sql queries um you make a data frame you you don't have to only make a data frame you can also just pass the files directly to the create table statement or pass it a list of directories or use wildcards and what have you but the idea is that you create a table based on files right files that exist that you have somewhere and then you can run queries against them um and the result of this query is a codif it's just just like you know pandas like the when you run this query the output that you get is a a data frame in codif that has you know it's just a normal pdf um the output of this is the same type as the type df up here right the output of bc dot sql is the same as the output of data frame um we like sql for this kind of data manipulation because of many different reasons right it's accessible um concise um it's powerful it's optimizable it's much easier to optimize than than for example just straight up python code um and and we did a few things that that are gonna kind of help make this uh scale a little bit better and that will necessitate the need to be able to do out-of-court processing and that said we we access that data where it lies right so s3 hdfs we go after your information where it's lying so it can actually scale to very large storage systems um we look at it how you store it right it's if it's an orc or parquet csv we don't care we're reading the information in the format that you picked and we're trying to help you guys be as big as is needed um so the way that the blazing uh works is that when you start out with uh you start out with something that's like a query which gets converted via apache calcite into a relational algebra plan that relational algebra plan gets converted in blazing into a physical plan and that physical plan generates a graph that looks something like this right so we'll have um a scan here is a table scan and in between the project that's about to come after it which does something like row-wise operations like let's say i'm going to scan table a and then i'm going to sum the first and the second columns right the project would have the operation that does the summing of those columns and the scan is what actually reads the files from you know disk or from hdfs or which brings in a qdf data frame from another process and in between the scan in this project is a cache node um this cache node basically provides the the following functionality when scan generates a data frame that needs to go into the next kernel into the next project um we don't always know how much space we're going to have how much space we're going to need how much that project might explode so we we need to be able to kind of monitor our resource consumption and when things are pushed into these caches these green boxes we want to be able to to decide where they're going to go uh for temporary storage right so the project might not need it right away right it might be busy processing some other chunks so something that can happen in blazing is that in these green squares when a data frame comes in it can get moved either to the cpu or the disc it can also stay in gpu but uh it can basically get moved into any one of these three layers in order to be able to have um access to to more resources so that we don't have to rely just on gpu memory for our all of our interim materializations for the the the inputs and the outputs we can actually run this analysis um um by uh we can divide up these really large data frames into smaller pieces so that we can make sure that everything fits uh fits in memory so that we're able to you know solve bigger and bigger problems um i'm trying to see if i have any other points here nope that's it so i think after this what i'd like to do is show you guys what this kind of plan looks like in uh in blazing um by going into app.blazingsql.com and this is the point where um you guys can open up the the notebook that has been shared with you so if you've uh cloned that notebook i'm gonna give everyone just a minute or two to to get this open and then we can we can get started running uh running these queries and seeing what the relational algebra looks like and and whatnot right that's a good idea right we should wait a second here yeah i was i was trying to get myself off mute so if you haven't already you can clone the notebook via that command that i put both in the zoom chat i'm gonna put it one more time i don't know if people that just joined can see the history of the chat so i'm going to put it one more time in here it's also in the rapids go ai you can execute that from a notebook cell if you take out the exclamation mark you can just open up a terminal window in app.blazingsql.com and run that in the terminal um real fast also just to let people know where that notebook is because that notebook is a couple of um it's a couple of directories in you should see that there's a tutorials and then in tutorials there's multi-gpu and there you should be able to find um this notebook which will be called a little bit different than what felipe's is called it's called bsql underscore ra for rapids academy underscore tpch sf100 so if you can't find that right now please either raise your hand or chat us directly inside slack or here so that we can we can support you and philippe you don't worry about that leo and i or i can i'm monitoring that slack i was about to go check and i think no i'm not supposed to do that right now don't worry about it um so let's give people just one more minute real fast and then and then we can go over it um you know what while we're while we're waiting this is a perfect time for a question that came in so one of the first questions is why would i use blazing sequel if i can do things i want in pandas or qdf sure so there's a couple of answers to that question um one it's a lot easier for us to optimize sql and to avoid many interim materializations many places where if you use pdf you're going to be going to global memory where we can basically fuse those operations together the other point is that we handle out of core a much more uh i would say with a bit more robustness than happens with just uh codif for example when you're running when you're making codif uh when you're trying to do das codif stuff that's out of core it works perfectly fine um if it can kind of divide up that problem beforehand and divide up the outputs but your interim steps need to fit in gpu memory or it will go um and give you a big complaint um we don't have that issue the other thing is that for me i can express more data manipulation in fewer lines of code in sql than i can in uh in in what you call it in pandas most of the time that might not be true for another person you might be some pandas guru um that's that's not me yeah so just to add to that it's kind of a pick your poison that's what's so great about the data frame you can either go pandas like api or you can go sql i think that's enough time to give people felipe so why don't you um why don't you jump in and i'll try to triage anyone who's having issues sure oh and tell everyone how you opened up that gpu resources thing because someone literally just asked about it so it's right here you click on the system dashboards on the left it's the little thing that looks like i don't know what that looks like a gpu it looks like a gpu that's what it's supposed to be ah nice and then you click on gpu resources and that'll open that up great so uh i mean first thing we do usually is kind of see hey here we go we've got uh we got a t4 running uh that that's the gpu that we're gonna be running this on um and then the next step is that we're creating our context here um we're basically limiting the amount of memory that we're going to allow the context to use we're going to allow blazing to use and this is going to force it to start using the cache earlier than it normally would um just so that we can make sure when we look at the logs that we are definitely going into the cpu um when you create a here i'm setting pool equal to true i'm going to just it doesn't matter i left i think that you all have that um it it's not gonna be very important for this uh for this demo here but uh we'll go ahead and initialize that and um once we do that our context is ready right so that means that blazing is now initialized you can do things like create table register file systems and run queries um the first step um after we create our context is to register the s3 bucket that is going to have the data that we're going to be referencing um so we'll go ahead and do that and kind of basically just let us know that it was able to register that bucket successfully and then here we're gonna create tables based on all of this uh on all on all of the data that we have in s3 i never explained what kind of notebook we're looking at so let me back up one second this is a tpch demo so tpch for anyone who doesn't know is an industry standard benchmark for databases it's used to kind of see how databases perform in a wide a variety of situations and here we're looking at a tpth benchmark which is based on a 100 gigabyte data set so we're using um about 1.6 gigabytes of memory of this gpu or 1.5 so it's using 10 of whatever the total memory here is um and we're using that to query a 100 gigabyte data set um so first thing i'm going to do here is now that we've registered the s3 bucket we're going to go ahead and create tables um this is this kind of data is kind of like retail ish data right it's like oh i have a bunch of orders and each order has line items on it i've got customers parts and things like that um geographic information um this takes a little bit of time we're working on optimizing this by being able to leverage the underscore metadata directories that uh parquet sometimes offers but for now it's it's gonna take a little bit of time but after 23 seconds it's gone through and read all the metadata on those tables it knows where all the row groups are these are part k files if i'm not mistaken um and now that we've done that uh we're going to want to run our query so this is what a blazing query you know looks like it's just sql um the kind of sequel that that you're used to um you'll notice that i'm printing out the explain of this query so we want to look at the relational algebra um to get kind of an idea of like what it is that that that is happening in the graph and what that's going to look like uh compared to to some visualizations we're about to see so we're going to go ahead and run this i probably should have started this before i explained all that because this one takes a little bit of time but um you should start seeing both the the algebra and a result for this query after about 30 30 or 35 seconds so what blazing is doing here is it's taking all these files right we see that there's these three tables customers orders and oh i think we might have run out of memory because there's a bunch of people hitting the machine i'm on let me just gonna refresh here see if it puts me on a machine that's not quite as in use go ahead and recreate these tables and run this query again app.lazinessequal.com is uh uh something you know it's it's not new but it's definitely something that we've been uh working on a lot recently so we're we're definitely expecting to see a lot of improvements um and better scalability as more and more people get logged on to these uh to this shared resource i'm just going ahead and recreating these tables and running this query again i'm hoping that we don't run out of memory or something um yeah are there any rod you can tell if there are questions right so if there are any questions that you want to throw in while we're waiting for for results it's always helpful sorry i disconnected for a second what was that what did you ask are there any questions yeah yeah there are any questions i've got more questions yeah so how does blazing sql fit my use case with all data being stored in a postgres database it doesn't fit your particular use case at this time you would want to export it but rodrigo is going to say something better i think so yeah yeah so real real fast um like felipe says yeah you'll want to output that to cody f or pandas ideally or even arrow uh to then be able to run blazing sql queries off of it our long-term vision is to be able to create tables off of databases one of the first examples of that is with hive which we built out for our customer so for example you can create tables off of hive cursors so that you can query directly on databases but we don't have anything like that for postgresql your query just ran those so you might want to talk about it yeah so here uh we have two outputs here um this is the the relational algebra that we got from calcite um that we actually used to generate the graph and execute the query and here's the query and it ran um in 40.1 seconds that's the wall time these times add up to more i am still trying to figure out what this is based on um not very successfully but like i've timed it before with a clock it's the wall time is the accurate time in terms of how long this took so it's about uh you know not a huge result so we got about a million rows after you know processing all these where clauses and and ordering and whatnot um but um if we look real quick at uh at the logs and that'll be right here there's actually a place where we can see uh hold on actually this might be the wrong log there we go sorry so we can see here that it's actually you know caching a lot of this information into cpu i'm gonna see if i don't think it would have had to have hit the disk oh look it did the disk too so here it's cached both into disk and cpu so it's been able to go not only bigger than um gpu memory but system memory as well um so one one we're gonna take a quick pause on this demo and i'm going to kind of show you what this looks like on on the blazing side so here we have that relational algebra on on the left i'm going to make that a little bit bigger that you all can see and um the way that the relational algebra works is it it goes from deepest to to like it you start processing in the deepest note and kind of work your way up so this blazing table this bindable table scan here maps to this one right here this this top yellow circle um and um and so all of these kind of nodes in the relational algebra get converted into one or possibly multiple nodes um over here on the right hand side like you'll see that here we have a logical join that actually gets divided up into a partwise join uh and and followed by uh a project and a merge sometimes um here actually i think it's missing one of the nodes but that's that's just a bug in the visualization in the actual query it exists um so what we can see here is that you know from 100 uh from 1000 milliseconds to 2000 milliseconds of this query execution so in the first like in the second second of this query we've got 168 gigabytes of information being read from this customer table right at the same time in this other scan which corresponds to the line item table we have 167 megabytes of information coming in there and it's caching data why can't i scroll down there we go with an arrow key i can and it and this movement so the the the circles are the actual transformations here they map to these kernels that are actually making transforms and in between every kernel is this cache right and that that's that cache that we saw earlier is this this cache which allows us to move information into and out of the gpu as um as it's been read in so these arrows represent the speed that it's actually moving through the cache um we need to do a little bit of updating to the the thing that generates this graph in order to be able to visualize both the input speed and the output speed which can be different because one is the speed that it cached and the other one's the decaching speed um this is actually both put together so it'll be like half as fast as it'll look slower than it it's actually operating um but um we can actually go and see kind of how information starts flowing right so uh between two and three seconds right it's it's scanning both the orders and the line item table um there's a lot of filters on the orders table so it's probably able to get a lot more throughput that's why you're seeing the the higher the higher numbers there in terms of processing but that that kind of explains how you know like here we can all when i'm hitting these next and previous buttons i'm just going through different pictures which are different slices of the graph at different points in time so like between seconds five and six this is everything that's happening in the engine at at the same point in time you'll notice that um it starts becoming faster as the query is running that's because as the query is running it's gotten estimates from the first couple of chunks and it's able to make better decisions in terms of scheduling how it how how it moves information through this cache it'll start being more intelligent with how it um how it divides up partitions how it can combine partitions when it's doing things like um joining or projecting and when whenever it goes into the cache some of these caches actually concatenate things together to prevent things from getting too small or to prevent data frames from getting too small and we'll split them up when they're starting to get too big so that and this is that that first query that we ran um yeah so that that that kind of gives you a quick i hope you guys aren't seeing this right like i'm not covering it up am i hopefully not well not a lot of feedback when you're giving a live session yeah it's good you're good you can see it awesome so going back uh to the demo uh here we have another query and i'm guessing as i see this memory utilization go up and down like that i'm guessing that uh some of you are uh running uh out of memory and that might be causing some of those uh dips that i'm seeing if that's the case uh we will definitely uh work to make sure that this doesn't happen as often in the future i think we didn't scale up our our node right yeah there it goes it died so i'll get back i think maybe maybe i'm gonna go to well no we should we should do this again i guess so i'll uh i know what i'm gonna do i'm gonna open this up in incognito mode and get a new uh get put onto a different server because i might be just put onto a server that's particularly slow and that let's see and yeah i think this is going to be a better experience all right yeah we got okay we got lucky here we got put on to a new one um so here we shouldn't see the out of memory issues that we were having before so i'm basically just going through running the same cells that we ran before to create the tables i'm going to skip this query since we already ran it and i'm going to go down to self uh to this cell here where we're running query number four um it's basically uh you know a nested query where i'm running a very large query on the line item table so this is actually a pretty this is this is all the data in line item and then it's going to order and group by whatever comes out of that just clarify line item is the largest table inside the tpc-h data set and represents about 90 or so of the or i think it might be 85 of the total data so that one's like 85 gigs or so yeah so we've already created the tables and now we're running query number four and it takes i can't remember how long this one takes but it takes like another i think this one's like 35 40 seconds this is actually a lot of data right um um so we're gonna wait for that to finish executing um while you wait there's a great question go ahead so when i run blazing tpch testing my own gpu server with 36 cpu cores i noticed that a quarter of the cpus are also busy can you explain a bit what is done in the cpu sure there's a lot of things that have to happen on the cpu from reading and interacting with metadata to sending messages across tcp performing optimization calculation calculations doing things like tracking how many chunks are coming into a kernel so when you look at like this uh this kind of graph right so we have awareness and are pulling these different kernels to know how much uh how many partitions are they processing how many bytes uh on average is each partition so there there there's all kinds of calculations that are happening to keep this graph operating as efficiently as it can and there's also um you know like like the network work and and things like that there i can say that there's no places where the cpu is being used to look at long buffers right so the cpu is not being used on like long buffers or to do big uh calculations so here we ran another query that um you know very similar kind of situation here we can see what the relational algebra looks like um and we we have our results that are in about 42 seconds and this is on you know a table that i think it's tvz is 75 percent of the data so it'd be 75 gigs it's a 75 gig table and we ran that on you know a single t4 um and caching on both the cpu and and disk from gpu um and here we can run our last query um this is uh one of the more simple queries it actually has uh it's it's going through line item making a uh taking a a year of information out of it and then um so it's basically doing a big filter and then just aggregating that right so this is kind of a nice query because it kind of gives you an idea of what this plan looks like right so this bindable table scan already comes with filters on it so it's as it's scanning data from the database it's running filters on that information and when that before that goes to the logical project a lot of that is going to get cached on uh on cpu until it's needed by this project to actually do the multiplication right um so uh yeah between all of these different steps right it's able to to cache the interim results on in in any of these different places um now i want to show you all a demo that uh i guess rod are there any questions so far about this that we've run so far that we need to address before i go to kind of showing what this looks like on on your instance real fast just a great follow-up question so following on on on that cpu question what configuration of cpu to gpu should should people look for in other words number of vc virtual cpus versus gpu cores um that's a very interesting question for someone to ask me right at this moment because we are refactoring um how our communication works a little bit to move it more uh more of it to basically moving it to ucx and our ucx version of the engine has a lot less thread pressure on the cpu side it basically needs much fewer threads to accomplish what it was doing before so that optimal ratio is kind of like difficult to to pin down um i would say somewhere between you know two to four physical cores per gpu i think is often a good trade-off um the more gpus that you have the more you can kind of fund on the number of cpu cores that you have because they can you know share that burden a little bit better if that makes any sense there's there's more overlap felipe maybe also um what we find in practice like graphistry is the memory also matter maybe you can can talk about uh cpu ram versus gpu ram and disk maybe even too sure so um i mean the biggest differences are really the the the throughput that you get between any of these different systems right so you're you're you're you're talking a wild um like i don't even know what it's at it's like 800 740 gigabytes per second always getting higher is the memory band that you have on the gpu right and on the cpu you you might have anywhere between you know an order magnitude less to maybe 4x less that that kind of bandwidth on the cpu itself depending on if you're on an intel 86 or or power 9 system or some other kind of you know fancy hpc uh hardware that i see out there that people get real creative with that stuff um and disc is obviously by by far right now uh the slowest though we're always you know um um there's always like new technologies and interconnects that are being developed that might be able to to increase the throughput between things like you know discs and gpus um especially specifically high performance like nvme drives um all right so i'm gonna go over to the last kind of demo here um just to kind of give you guys a little bit of an idea this is running on eight gpus and a dj x2 uh they're v100s uh we're going to be running this on a one terabyte data set uh the data set that we're using is called tpcxbv it is brought to you by the same people that brought you tpch um and it's exciting because it includes machine learning type stuff and and all all kinds of fancy algorithms that you uh that you have to implement it's it's been a wild ride um getting all of these queries to work uh in blazing with the combination of blazing and das qdf uh but that's uh that's what we're gonna look at so here you can't follow along because like i said this is running on a um on a much bigger cluster um that we've already set up um so i'll i'll just go ahead and kind of like you know run these queries so this is coming from felipe real fast why don't you explain how blazing sequel becomes distributed which you kind of have in that first cell yeah yeah all right so right here i'm uh i'm what i'm doing is normally i would use this cluster equals local cuda cluster but i think i need to submit a github issue because right now it's scheduling all the das workers on the same gpu um so what i did instead was um it's actually pretty simple let's see if i have the where did i put it right here we had a tiny little script which uh runs the dash workers manually and assigns a different cuda visible device to each one so that i have eight workers right um and so and i attached those workers to a dash scheduler this is just a url for the das scheduler right um down here somewhere is uh we're not going to look at the terminal that's not interesting to anybody but down there is a das scheduler which has this ip address it's on the local uh network interface we're doing this on one node and could just as well be multiple nodes just for this test we're doing it on eight gpus and a local node and um after i create my client kind of you know the way that you're you know however you make a dash distributed client right that's that's it i can pass that client to the blazing context and now whenever i run these queries what we're going to get back here um this result is going to be a dash data frame i i can show you how that works in a minute um and here i'm initializing a pool um saying what the pool size is going to be um right here i'm just you know putting in the the directory where all the where all this information lies this is something that's going to be used in the query and after i create my tables which are what is it workers what's wrong with workers does not have workers give me a second hopefully i didn't pose something here so i might have it's very easy to do just kidding oh i might have timed out from uh my session on the dgx um let me maybe stop sharing your screen for a second depending on whether or not you have passwords or anything nope oh all right never mind you're good no but i do need to restart my uh scheduler for sure there um so yeah you'll get to see how it is that we do these exciting things so i'll go into my condo environment and here i'm just going to do scheduler interface slow and [Music] i'm going to put some port that i don't think of 9005 all right and then over here just going to connect to the same place and i'm just going to run them again all right so there i've started up the workers great they're all there all right back to the demo sorry about that now we're reconnecting and we should be able to see you know a client exists making sure this all running it'll take a few moments to initialize there we go all right now our context is ready all right so we're going back in and now we're creating the tables uh based on these files does it pretty quickly because these are local it didn't have to go to s3 to find them all and all right so here we're running this first query that we're running is uh some nested select distincts with some other select listings around them that um this is basically a join between these two different things right and this just ran on a one terabyte data set using some of the the biggest table in tpc xb is this web click streams if i'm not mistaken and is able to run that in 13.2 seconds on that that gpu and if we look at results it's a dash data frame right wow look this is a normal dash data frame with eight partitions how exciting and um you know if we do dot uh head well i guess that's what the command below is you know we can actually see uh the information that that lies therein right um so uh i mean it's it's it's it's pretty easy you know to to just point yourself at these tables and you know run these very large out of core capable um queries on you know a one terabyte data set on in this case eight gpus um here um we have another query that's uh little bit more uh complicated there's you know some some nested selects um a little bit of aggregationing happening up here and some joins and there's just there's a bit more happening right um but you can see um we'll look here at they can still run this query very quickly and you know just to have a little bit of fun why don't we in seven seconds why don't we explain that query so we kind of see what's oh i need to print it if i wanted to have the slash ends there we go so it's actually a pretty complicated plan that it's able to execute here a lot of joins going on right as you can see um and and it's able to uh you know process that actually quite quite quite quickly um and be able to leverage you know other you know not just gpu memory for both uh for for the interim processing and for the inputs and the outputs which is to us very very very exciting um that i think is it for that demo i um kind of want to mention a few things real quick that you know this is uh all these efforts are community efforts like we we've built this together we've done this together um with a lot of different people uh people like leo who's here um you know the death project all the people at nvidia that work on rapids uh you know we've all been working on this stuff uh together and uh it's it's been a very exciting ride and we're always looking for more people to go file issues on github right please file issues file issues that's very useful to us contribute when you can you know you want to throw in a new file format have at it like we'll help you we'll teach you we'll tell you what needs to be done um so we're we're definitely looking to to for always to get more help in in building this project that's that's a a great way that you can support our efforts and um yeah and i kind of wanted to mention as freudian usually does you know you can see us on anaconda um if you want to uh uh on the uh blazing sql or the blazing sql nightly dash nightly channel um you can look at us on blazing sql notebooks by going to app.blazingsql.com and you can launch your own gpu data science workstation on the aws marketplace just look for blazing sql in the marketplace any any questions i