Ori Rafael knows the pain of trying to query data from enterprise data lakes for analysis. It’s slow time to value when you can only get access to the data after having to engage the engineering team and have them spend cycles pulling something together for you. That’s why he and his cofounder started Upsolver. It makes data immediately accessible to more than just the data engineers.
Ginette: I’m Ginette,
Curtis: and I’m Curtis,
Ginette: and you are listening to Data Crunch,
Curtis: a podcast about how applied data science, machine learning, and artificial intelligence are changing the world.
Ginette: Data crunch is produced by the Data Crunch Corporation, an analytics training and consulting company.
We’d like to know more about you, our listeners, and what you’d like to hear on the show, so go to datacrunchpodcast.com/survey to take our easy six-question survey and let us know your thoughts.
Now onto the show. Today we get to hear from Ori Rafael, CEO & Co-Founder at Upsolver, a Data Lake ETL Platform in the cloud.
Curtis: I want to start off with a brief synopsis of who you are and your history and why you are working on the problems that you’re working on.
Ori: The best way for me to define myself is a data guy. I started my career as a DBA working with both Oracle and SQL server. And I spent several years in the Israeli intelligence, and eventually I was, I managed all of their data integration platforms. And that’s also where I met my cofounder, Yonnie, who was the CTO of, I think, the largest data science group in the Israeli army, very strong data guy, and we tried to solve a problem in advertising that had very high scale and we needed to analyze the data in real time and kind of data lake was our only option, and as people that work well with databases, we were kind of frustrated with how long it took us to iterate, over data in the data lake. So every step we wanted to take was not, let’s just query the data like we’re used to, it’s let’s start a project or a sprint and give some data engineers a spec and they will develop it for us.
And only then we get access to the data. And then we found out, find out it’s not really what we want. We need something different. That’s another sprint. So the slow time to value of working with data lake on that advertising dataset, that kind of inspired Yonnie and myself to create an easier way for people that understand databases—so I’m a DBA—but let’s say I’m a DBA, I want to work with the data lake. It should be a familiar experience from databases, and it should be easy, like, you know, SQL-based like the language of data instead of so much engineering. So that specific experience, our own personal pain, that’s what inspired us to build the Upsolver you know today.
Curtis: Interesting. And I want to get to Upsolver, but you mentioned a data lake, and I just want to dive into that since that’s a term people hear a lot, but I want to make sure people understand what it is, how you define it. So, so talk to us a little bit about data lakes.
Ori: Sure. So I think the best way to describe, maybe the easiest way to describe a data lake, is that it’s a decoupled database. So if I’m taking a regular database, then there is the storage part, there is the metadata, and there’s all the compute in order to do the queries. And they’re all bundled into the same box, the same model data. And if I want to go with a data lake, then basically I’m decoupling those streams to different parts. And it creates, I think, two distinct, distinct advantages. One of them is cost. So if I want to store data on Amazon S3, which is like, Amazon’s data lake, then let’s say, I’m going to use, so I’m going to pay $23 per terabyte that I store per month. And if you are looking at database pricing, you can sometimes get to $10,000 per terabyte for storage, because you’re also paying for the server you’re paying for the software.
So creating a very cheap alternative for storage is the first thing you want to do, especially when you’re looking at event data. So let’s say I’m capturing all the ads, all the IOT data, all my logs. So that information, I don’t necessarily know how I want to analyze it today. So I just want to store it in the easiest way possible and then analyze it later. So cost would be number one. The second is the lock-in that you usually have with the database vendor. Like I was an Oracle DBA. We used Oracle extensively, and you needed to get to a very extreme use case for us not to choose Oracle because I needed to create an application from Oracle, where I have all my data, into a new database in order to use that database for another use case. So it was a big hurdle to go over every time you had a new use case, and that’s the reason I needed to solve it with Oracle, even if Oracle wasn’t the best fit for that specific use case, but with a data lake, that’s not the idea since I’m using open storage and open metadata, basically any engine can start running and analyzing that data.
So the ability to choose the best tool. And usually I’m not just choosing one. So if you’re going to look at the data stack for companies that work on top of lakes, you’ll see four or five tools that are used in each one for that use case. So you’re going to use Presto to do your research, and maybe you’re going to use a warehouse to do some faster BI reporting, and maybe I want to use the elastic search for log analysis. So I’m going to ETL the data from the lake into elastic. And in addition, maybe I have some data science that I want to do is I want to also send the data to Spark. So accommodating so many different use cases on top of one cheap storage is the vision for a data lake, at least in my opinion.
Curtis: Got it. And we’re talking about distributed file systems, right? Or is there more to it than that?
Ori: Yeah, definitely. The storage piece of a data lake is distributed is an object storage. So it can be Amazon S3, traditionally it was HDFS when we’re talking Hadoop and more on-prem, you have Google cloud storage, Azure blob storage, or agile data lake analytic storage. So those are usually the main suspects.
Curtis: Got it. Okay. So that’s cloud data lakes, right. A lot of people are moving towards the cloud database. And I think this is where Upsolver comes in. There’s a challenge once everything is in the lake to then get it out and actually do something with it, right?
Ori: Yeah, exactly. And that’s, so I, I told you the positive side where I need to decouple the database into three pieces, but there is a cost to decoupling those three pieces. So let’s say I have the storage part. Usually with the database, I did insert-update-delete. I didn’t think about how data is stored. So the database had its own internal file system and that it optimizes that file system because I always like the enemy of the database. And you want to minimize IO as much as, as much as you can, but when you’re using a data lake, you as the customer are in charge of how data is stored. So you need to think about the file format. You need to think about the file size, the compression, how data is going to be partitioned. So they’re like a big list of best practices and those best practices are basically low-level engineering that you need to do.
That’s what creates the complexity. So I need to manage the storage. I need to manage the metadata, but I need to make sure that the metadata aligns with the storage that I have, because if they don’t, I just won’t be able to query the data. And then when I’m doing the compute, let’s say, I want to transform the data, the way for me to transfer the data would be to use something like Spark or Apache MapReduce. So I need to write code. I need to know either Java or Scala. I need to understand distributed systems. I need to do some dev ops. I need to think about scaling. I need to think about checkpointing for how I’m going to manage state (?). So all of those are the reason that there is a person, which would be the big data engineer, and that person will take care of all of those aspects. But while they’re doing that, the users can’t really work with the data. That’s the pain, the pain that we had before starting Upsolver.
Curtis: So tell me how you started building Upsolver and what it’s grown into.
Ori: So when we started, I think the first problem that we solved was the indexing issue. So with a data lake, you kind of, uh, always doing full table scans. So you partition the data by time, and then you scan a block of time. In many cases, you do it in parallel, and then you get to a result. But sometimes let’s say you want to do an operation that requires an index like a join. Say, I have, I talked about advertising, say I have a stream of impressions, stream of clicks, and I need to join both of them together to calculate the click-through rate. How do I do that? So indexing is really missing. So that’s, I think, the first piece that we developed was that. It was created a low-level building blocks that allows us to do joins by key. And then we started adding more and more pieces until we supported all the building blocks of SQL. So we could support the select and the form and the where (?) and the joins. So think of like every word in SQL is a building block in Upsolver.
Curtis: So do you take the SQL that then a DBA would write or something, and then are you, I dunno, maybe translating that into spark code when it’s appropriate or other code when it’s appropriate to then fulfill those queries, or how does that work?
Ori: We are actually not using Spark. So Spark is a, it’s a very broad platform, and usually I need to code in it. So we built building blocks like the building blocks similar to a database, but we developed them on top of object storage instead of using a database. So Upsolver is kinda in the stitch between being a data transformation platform to actually being a database because we offer all the transformation a database would be able to offer.
Curtis: Got it. Okay. And how does the performance there work with Upsolver? I’m curious, like, let’s say in file storage, you know, if someone has just a massive file or lots of massive files, how do you, how do you attack that from a performance perspective?
Ori: The biggest no-no is you never use the local state. So in order to really allow scaling, we only store data on object storage. And the compute that we are doing is only being done in memory. We never use the local disk. So every time you’re adding another app server instance, you’re getting a linear growth in the performance that you’re, you’re processing. I think there’s one interesting article that was published by AWS. They covered a petabyte scale data lake that Upsolver did with a company called Iron Source, also in advertising. And for that there we, I think we are already seeing 2 million events per second. We just increased the size of the scolaster (?) Like when I did databases, I never dreamed about doing 2 million events per second.
Curtis: Sure. Yeah, that’s crazy. That’s awesome. And so, they’re seeing pretty fast queries with that, even at that volume.
Ori: Yeah, definitely. So they’re using Apache Presto to query the data. And I think AWS in general, not just AWS have a set of best practices you need to do in order to get good performance from Presto. Like exactly the one I mentioned about storage, the file format, for example, you need to use Apache Parquet as the columnar format (?), when you’re doing select ABC, instead of select star, you only want to scan the data in columns A, B and C. You don’t want to scan the entire data. So you use columnar formats. The other thing would be the file sizes that you use. So for object storage, working with very small files is considered an anti-pattern and of using very small files compared to big files can be a hundred ticks in the performance you’re going to get to querying the data. So when Upsolver writes data to S3, we don’t just try it at once. We keep scanning the different partitions. We see if there is a potential to improve the query performance, and then we can rewrite the data in the same partition with bigger file sizes. So it’s kind of an ongoing process that works out of the box, and the user doesn’t have to be aware of it. Like when you’re writing, you want to, you’re creating a table in Presto, you’re not thinking about how data is stored for Presto, just like you wouldn’t think about it with a database.
Curtis: Right. That’s really cool. I mean, from, from a DBA perspective, from an analytics perspective, what we’re really talking about here is, like you said, removing the data engineering piece as much as possible so that you can actually get into the data and do your analysis and ask questions iteratively and all these things and not have to worry about all of those other tricky aspects of load balancing and writing code that you’re not familiar with and all that kind of all that kind of stuff. Is that a fair?
Ori: Yeah, I think that’s the essence of the valuable position and what Upsolver is doing today that’s unique because like other people would give you an, a simpler Spark distribution or managed Spark. But the fact that it’s managed or it’s simple Spark doesn’t mean that it’s simple enough to start giving it to all the people that understand SQL. So like the number of data engineers out there, maybe between 2% and 5% of all data practitioners. So data lakes are still they’re still very, I would say sitting on the side and not playing the same game as all the databases, unless you have a very big group that can support all of your requirement.
Curtis: Got it. Yeah. That’s, I mean, you’re making it accessible, right. I mean, everyone speaks the language of SQL. And so then all of a sudden with just equal, you can access your data lakes.
Ori: Yeah. And immediately. Not go and explain to someone else how you want to query the data. You can query it yourself.
Curtis: Yeah. That’s interesting. Now you were telling me earlier about one of your customers that had a pretty good, Sisense I think it was, that had a pretty good gain in efficiency and these kinds of things. Can you tell us just that story of, of what the use case was and how they went about it? What the gains were? So we can put this in a concrete, more concrete understanding.
Ori: Sure. Yeah. So Sisense, there’s also a public case study on it. So science is a BI company and they launched a cloud product. That’s generating a lot of product logs because they want to improve, keep improving the way their product works. So that cloud product generated 300 billion logs. It’s not something that you can easily just push to a database. And eventually there was a product group that want to kind of query the data and iterate on it. So they wanted to use Presto to query it. They want to create dashboard in their own product in Sisense. They wanted to start doing some predictive modeling as well. So exactly what we said about data lake—different use cases on top of the same data. So they wanted to build a data lake, but they didn’t have this big data engineering team in house. And even if they did, the product group wanted direct access. So they used Upsolver, took them a few weeks and they already had it in production. And today the product group is using Upsolver directly. They don’t need to go to any additional teams.
Curtis: That’s awesome. Just in general, I’m curious about your perspective here. You know, we’re talking about in the market, there’s, there’s not enough data engineers for the way systems currently work, but of course, some systems are coming online, like Upsolver that help non-data engineers still get to the data and use it. Do you think that the trend is technology is going to keep, keep advancing here to where we don’t need as many data engineering skills or like where do you think the market’s going for people that are maybe thinking about what kind of skills should I focus on? Should I just focus on SQL and analysis? Should I do data engineering? Like what should I do?
Ori: It’s a good question. It’s also a question of what you need. Let’s say you’re a good data engineer. It doesn’t mean that what you need to do is write ETLs. The organization can still use you for many. Let’s say I’m building a machine learning model that I want to run in real time now. And I need to set up the entire engineering backbone to support it. I need to make sure the training looks exactly like serving. Those are different type of challenges for data engineers. And by the way data engineers should definitely, in my opinion, also learn data science. So be able to complete the loop. A data scientist that has engineering capabilities is a great employee to have in your, in your company. But I don’t think that what we are seeing today, the data engineering are consuming so much of their time writing ETLs. I don’t think that’s something that’s going to remain like this. It’s just not sustainable. And I think looking at it from a numbers perspective kind of makes more sense.
So data lakes were often, like, they’re notoriously known as data swamps. This is where like, when people want to make jokes about data lakes, they say those are data, data swamps, because no one can get the data out. And that joke started years ago when data lakes were still on-prem. So even when data lakes were on-perm and only Hadoop, that’s what people were saying. So think about it. There were thousands of companies, not tens of thousands, thousands of companies that were using Hadoop. They had big data engineering groups and still data lakes were data swamps. And now if you look at the number of customers that you have on Amazon S3, it’s more than a million customers, and like on just that data lake. So you have a hundred times more data lakes, even more than that, there was, there isn’t a hundred times more data engineers. And even if there were, the situation before it wasn’t good, even, even back then.
Curtis: When you put that into perspective, you know, that’s a, that’s, that’s a lot of data not getting used for anything. And, uh, so do, do you also foresee as tools like this come online, you know, and maybe you’ve seen this in some of the clients that you’ve worked with, you know, you mentioned product, product, people from Sisense, wanting to be able to access the data directly. So now we’re talking about maybe some people that aren’t necessarily data people picking up SQL and learning some principles of data analysis and that kind of thing. Is that a trend that you think will, will keep going. I mean should, should people in product and other places in the organization learn SQL and, and learn how to work with data?
Ori: Definitely. I think the ability to make decisions with data and big data, strategic capability for an organization—it’s strategic capability for people that are looking for work. I think all organizations, even the most traditional organization are trying to make more decisions with data. And I think maybe I want to add something to the question you asked me before. If you’re an analyst, if you’re a person that makes the decisions, being hands on with data is good for you, but you still need the data engineers to kind of keep the organization in check when it comes to security, when it comes to cost management, when it comes to scalability, to durability. So somethings you need the engineer for, but you just don’t want them in the way to a specific data analysis research, because that means that the users can not work on their own. So this is kind of the future I’m imagining.
Curtis: Got it. That’s good. So then where are you going with Upsolver? I mean, I know there’s a lot of companies that need, you know, need better access to their data lakes. So there’s plenty of opportunity there, but do you foresee other, are there other big problems in the data space you’re thinking about or working on or where are you, where are you going to take things?
Ori: Two answers for that. One is what I think Upsolver will do. And Upsolver is still a young company, and we’re focused on one thing, and that will be to make the data lake accessible according to everything that we talked about today. And I think that’s a very big trend because the data preparation is 80% of the work you usually do. I think the second, how will the, how will users work in a data-lake era and comparing to a database era? And I’m going back to the decoupling of databases. So let’s say that metadata, metadata is a big part of it. So for example, every time I go work with a database, I kind of need to recreate the schema and the permissions and all of that. And today you have central metadata repositories, but it’s not that all databases are working with them.
You have the, kind of, the new-age tools like Presto and Spark. And those are the ones that kind of integrate with the central metadata catalogs and many of the database are not. And I think going forward where all the data is stored in a central place, like an object storage, I think that it also makes sense that there will always be centralized metadata and all the databases would work with that centralized metadata. And if I’m a data engineer for an organization and I would need to define permissions, I need to define permissions once and others need to do it in all places because security, permissions, governance are things some of the concerns that organizations are raising, just because it’s different than the way they manage their databases.
Curtis: So better, maybe better tooling around, when you say metadata again a sort of a high level, you may define that as a data catalog or something like that, where people can understand what the data is and where it is and what kind of data you have. All of that. Is that fair?
Ori: It’s more about standards, then so, you have, I’ve met a storage (?) data catalogs for a while now, but it’s more about standardizing, about seeing every possible tool being used to query the data, working with the centralized metadata catalog, with a data catalog like you, you just define,
Curtis: Right. I’ll sort of give you the last word here on anything else you wanted to chat about.
Ori: Upsolver is going very well and we want more people to join our ride. So we just closed a big round. We have some new investors that know the space for the last 20 or 30 years, and we are hiring extensively all positions. If you’re interested, definitely touch base with us. Upsolver.com. There is a contact us. We’ll get those emails. We’ll try to get back to everyone. Definitely want to work with more people that are excited about this space and kind of want to see the data lake–oriented future happening.
Ginette: A big thank you to Ori Rafael for being on the show. As always, got to datacrunchcorp.com/podcast for our transcript and attributions.
We’d like to know more about you, our listeners, and what you’d like to hear on the show, so go to datacrunchpodcast.com/survey to take our easy six-question survey and let us know your thoughts. Thank you!
Attributions
Music
“Loopster” Kevin MacLeod (incompetech.com)
Licensed under Creative Commons: By Attribution 3.0 License
http://creativecommons.org/licenses/by/3.0/
Ori Rafael knows the pain of trying to query data from enterprise data lakes for analysis. It’s slow time to value when you can only get access to the data after having to engage the engineering team and have them spend cycles pulling something together for you. That’s why he and his cofounder started Upsolver. It makes data immediately accessible to more than just the data engineers.
Ginette: I’m Ginette,
Curtis: and I’m Curtis,
Ginette: and you are listening to Data Crunch,
Curtis: a podcast about how applied data science, machine learning, and artificial intelligence are changing the world.
Ginette: Data crunch is produced by the Data Crunch Corporation, an analytics training and consulting company.
We’d like to know more about you, our listeners, and what you’d like to hear on the show, so go to datacrunchpodcast.com/survey to take our easy six-question survey and let us know your thoughts.
Now onto the show. Today we get to hear from Ori Rafael, CEO & Co-Founder at Upsolver, a Data Lake ETL Platform in the cloud.
Curtis: I want to start off with a brief synopsis of who you are and your history and why you are working on the problems that you’re working on.
Ori: The best way for me to define myself is a data guy. I started my career as a DBA working with both Oracle and SQL server. And I spent several years in the Israeli intelligence, and eventually I was, I managed all of their data integration platforms. And that’s also where I met my cofounder, Yonnie, who was the CTO of, I think, the largest data science group in the Israeli army, very strong data guy, and we tried to solve a problem in advertising that had very high scale and we needed to analyze the data in real time and kind of data lake was our only option, and as people that work well with databases, we were kind of frustrated with how long it took us to iterate, over data in the data lake. So every step we wanted to take was not, let’s just query the data like we’re used to, it’s let’s start a project or a sprint and give some data engineers a spec and they will develop it for us.
And only then we get access to the data. And then we found out, find out it’s not really what we want. We need something different. That’s another sprint. So the slow time to value of working with data lake on that advertising dataset, that kind of inspired Yonnie and myself to create an easier way for people that understand databases—so I’m a DBA—but let’s say I’m a DBA, I want to work with the data lake. It should be a familiar experience from databases, and it should be easy, like, you know, SQL-based like the language of data instead of so much engineering. So that specific experience, our own personal pain, that’s what inspired us to build the Upsolver you know today.
Curtis: Interesting. And I want to get to Upsolver, but you mentioned a data lake, and I just want to dive into that since that’s a term people hear a lot, but I want to make sure people understand what it is, how you define it. So, so talk to us a little bit about data lakes.
Ori: Sure. So I think the best way to describe, maybe the easiest way to describe a data lake, is that it’s a decoupled database. So if I’m taking a regular database, then there is the storage part, there is the metadata, and there’s all the compute in order to do the queries. And they’re all bundled into the same box, the same model data. And if I want to go with a data lake, then basically I’m decoupling those streams to different parts. And it creates, I think, two distinct, distinct advantages. One of them is cost. So if I want to store data on Amazon S3, which is like, Amazon’s data lake, then let’s say, I’m going to use, so I’m going to pay $23 per terabyte that I store per month. And if you are looking at database pricing, you can sometimes get to $10,000 per terabyte for storage, because you’re also paying for the server you’re paying for the software.
So creating a very cheap alternative for storage is the first thing you want to do, especially when you’re looking at event data. So let’s say I’m capturing all the ads, all the IOT data, all my logs. So that information, I don’t necessarily know how I want to analyze it today. So I just want to store it in the easiest way possible and then analyze it later. So cost would be number one. The second is the lock-in that you usually have with the database vendor. Like I was an Oracle DBA. We used Oracle extensively, and you needed to get to a very extreme use case for us not to choose Oracle because I needed to create an application from Oracle, where I have all my data, into a new database in order to use that database for another use case. So it was a big hurdle to go over every time you had a new use case, and that’s the reason I needed to solve it with Oracle, even if Oracle wasn’t the best fit for that specific use case, but with a data lake, that’s not the idea since I’m using open storage and open metadata, basically any engine can start running and analyzing that data.
So the ability to choose the best tool. And usually I’m not just choosing one. So if you’re going to look at the data stack for companies that work on top of lakes, you’ll see four or five tools that are used in each one for that use case. So you’re going to use Presto to do your research, and maybe you’re going to use a warehouse to do some faster BI reporting, and maybe I want to use the elastic search for log analysis. So I’m going to ETL the data from the lake into elastic. And in addition, maybe I have some data science that I want to do is I want to also send the data to Spark. So accommodating so many different use cases on top of one cheap storage is the vision for a data lake, at least in my opinion.
Curtis: Got it. And we’re talking about distributed file systems, right? Or is there more to it than that?
Ori: Yeah, definitely. The storage piece of a data lake is distributed is an object storage. So it can be Amazon S3, traditionally it was HDFS when we’re talking Hadoop and more on-prem, you have Google cloud storage, Azure blob storage, or agile data lake analytic storage. So those are usually the main suspects.
Curtis: Got it. Okay. So that’s cloud data lakes, right. A lot of people are moving towards the cloud database. And I think this is where Upsolver comes in. There’s a challenge once everything is in the lake to then get it out and actually do something with it, right?
Ori: Yeah, exactly. And that’s, so I, I told you the positive side where I need to decouple the database into three pieces, but there is a cost to decoupling those three pieces. So let’s say I have the storage part. Usually with the database, I did insert-update-delete. I didn’t think about how data is stored. So the database had its own internal file system and that it optimizes that file system because I always like the enemy of the database. And you want to minimize IO as much as, as much as you can, but when you’re using a data lake, you as the customer are in charge of how data is stored. So you need to think about the file format. You need to think about the file size, the compression, how data is going to be partitioned. So they’re like a big list of best practices and those best practices are basically low-level engineering that you need to do.
That’s what creates the complexity. So I need to manage the storage. I need to manage the metadata, but I need to make sure that the metadata aligns with the storage that I have, because if they don’t, I just won’t be able to query the data. And then when I’m doing the compute, let’s say, I want to transform the data, the way for me to transfer the data would be to use something like Spark or Apache MapReduce. So I need to write code. I need to know either Java or Scala. I need to understand distributed systems. I need to do some dev ops. I need to think about scaling. I need to think about checkpointing for how I’m going to manage state (?). So all of those are the reason that there is a person, which would be the big data engineer, and that person will take care of all of those aspects. But while they’re doing that, the users can’t really work with the data. That’s the pain, the pain that we had before starting Upsolver.
Curtis: So tell me how you started building Upsolver and what it’s grown into.
Ori: So when we started, I think the first problem that we solved was the indexing issue. So with a data lake, you kind of, uh, always doing full table scans. So you partition the data by time, and then you scan a block of time. In many cases, you do it in parallel, and then you get to a result. But sometimes let’s say you want to do an operation that requires an index like a join. Say, I have, I talked about advertising, say I have a stream of impressions, stream of clicks, and I need to join both of them together to calculate the click-through rate. How do I do that? So indexing is really missing. So that’s, I think, the first piece that we developed was that. It was created a low-level building blocks that allows us to do joins by key. And then we started adding more and more pieces until we supported all the building blocks of SQL. So we could support the select and the form and the where (?) and the joins. So think of like every word in SQL is a building block in Upsolver.
Curtis: So do you take the SQL that then a DBA would write or something, and then are you, I dunno, maybe translating that into spark code when it’s appropriate or other code when it’s appropriate to then fulfill those queries, or how does that work?
Ori: We are actually not using Spark. So Spark is a, it’s a very broad platform, and usually I need to code in it. So we built building blocks like the building blocks similar to a database, but we developed them on top of object storage instead of using a database. So Upsolver is kinda in the stitch between being a data transformation platform to actually being a database because we offer all the transformation a database would be able to offer.
Curtis: Got it. Okay. And how does the performance there work with Upsolver? I’m curious, like, let’s say in file storage, you know, if someone has just a massive file or lots of massive files, how do you, how do you attack that from a performance perspective?
Ori: The biggest no-no is you never use the local state. So in order to really allow scaling, we only store data on object storage. And the compute that we are doing is only being done in memory. We never use the local disk. So every time you’re adding another app server instance, you’re getting a linear growth in the performance that you’re, you’re processing. I think there’s one interesting article that was published by AWS. They covered a petabyte scale data lake that Upsolver did with a company called Iron Source, also in advertising. And for that there we, I think we are already seeing 2 million events per second. We just increased the size of the scolaster (?) Like when I did databases, I never dreamed about doing 2 million events per second.
Curtis: Sure. Yeah, that’s crazy. That’s awesome. And so, they’re seeing pretty fast queries with that, even at that volume.
Ori: Yeah, definitely. So they’re using Apache Presto to query the data. And I think AWS in general, not just AWS have a set of best practices you need to do in order to get good performance from Presto. Like exactly the one I mentioned about storage, the file format, for example, you need to use Apache Parquet as the columnar format (?), when you’re doing select ABC, instead of select star, you only want to scan the data in columns A, B and C. You don’t want to scan the entire data. So you use columnar formats. The other thing would be the file sizes that you use. So for object storage, working with very small files is considered an anti-pattern and of using very small files compared to big files can be a hundred ticks in the performance you’re going to get to querying the data. So when Upsolver writes data to S3, we don’t just try it at once. We keep scanning the different partitions. We see if there is a potential to improve the query performance, and then we can rewrite the data in the same partition with bigger file sizes. So it’s kind of an ongoing process that works out of the box, and the user doesn’t have to be aware of it. Like when you’re writing, you want to, you’re creating a table in Presto, you’re not thinking about how data is stored for Presto, just like you wouldn’t think about it with a database.
Curtis: Right. That’s really cool. I mean, from, from a DBA perspective, from an analytics perspective, what we’re really talking about here is, like you said, removing the data engineering piece as much as possible so that you can actually get into the data and do your analysis and ask questions iteratively and all these things and not have to worry about all of those other tricky aspects of load balancing and writing code that you’re not familiar with and all that kind of all that kind of stuff. Is that a fair?
Ori: Yeah, I think that’s the essence of the valuable position and what Upsolver is doing today that’s unique because like other people would give you an, a simpler Spark distribution or managed Spark. But the fact that it’s managed or it’s simple Spark doesn’t mean that it’s simple enough to start giving it to all the people that understand SQL. So like the number of data engineers out there, maybe between 2% and 5% of all data practitioners. So data lakes are still they’re still very, I would say sitting on the side and not playing the same game as all the databases, unless you have a very big group that can support all of your requirement.
Curtis: Got it. Yeah. That’s, I mean, you’re making it accessible, right. I mean, everyone speaks the language of SQL. And so then all of a sudden with just equal, you can access your data lakes.
Ori: Yeah. And immediately. Not go and explain to someone else how you want to query the data. You can query it yourself.
Curtis: Yeah. That’s interesting. Now you were telling me earlier about one of your customers that had a pretty good, Sisense I think it was, that had a pretty good gain in efficiency and these kinds of things. Can you tell us just that story of, of what the use case was and how they went about it? What the gains were? So we can put this in a concrete, more concrete understanding.
Ori: Sure. Yeah. So Sisense, there’s also a public case study on it. So science is a BI company and they launched a cloud product. That’s generating a lot of product logs because they want to improve, keep improving the way their product works. So that cloud product generated 300 billion logs. It’s not something that you can easily just push to a database. And eventually there was a product group that want to kind of query the data and iterate on it. So they wanted to use Presto to query it. They want to create dashboard in their own product in Sisense. They wanted to start doing some predictive modeling as well. So exactly what we said about data lake—different use cases on top of the same data. So they wanted to build a data lake, but they didn’t have this big data engineering team in house. And even if they did, the product group wanted direct access. So they used Upsolver, took them a few weeks and they already had it in production. And today the product group is using Upsolver directly. They don’t need to go to any additional teams.
Curtis: That’s awesome. Just in general, I’m curious about your perspective here. You know, we’re talking about in the market, there’s, there’s not enough data engineers for the way systems currently work, but of course, some systems are coming online, like Upsolver that help non-data engineers still get to the data and use it. Do you think that the trend is technology is going to keep, keep advancing here to where we don’t need as many data engineering skills or like where do you think the market’s going for people that are maybe thinking about what kind of skills should I focus on? Should I just focus on SQL and analysis? Should I do data engineering? Like what should I do?
Ori: It’s a good question. It’s also a question of what you need. Let’s say you’re a good data engineer. It doesn’t mean that what you need to do is write ETLs. The organization can still use you for many. Let’s say I’m building a machine learning model that I want to run in real time now. And I need to set up the entire engineering backbone to support it. I need to make sure the training looks exactly like serving. Those are different type of challenges for data engineers. And by the way data engineers should definitely, in my opinion, also learn data science. So be able to complete the loop. A data scientist that has engineering capabilities is a great employee to have in your, in your company. But I don’t think that what we are seeing today, the data engineering are consuming so much of their time writing ETLs. I don’t think that’s something that’s going to remain like this. It’s just not sustainable. And I think looking at it from a numbers perspective kind of makes more sense.
So data lakes were often, like, they’re notoriously known as data swamps. This is where like, when people want to make jokes about data lakes, they say those are data, data swamps, because no one can get the data out. And that joke started years ago when data lakes were still on-prem. So even when data lakes were on-perm and only Hadoop, that’s what people were saying. So think about it. There were thousands of companies, not tens of thousands, thousands of companies that were using Hadoop. They had big data engineering groups and still data lakes were data swamps. And now if you look at the number of customers that you have on Amazon S3, it’s more than a million customers, and like on just that data lake. So you have a hundred times more data lakes, even more than that, there was, there isn’t a hundred times more data engineers. And even if there were, the situation before it wasn’t good, even, even back then.
Curtis: When you put that into perspective, you know, that’s a, that’s, that’s a lot of data not getting used for anything. And, uh, so do, do you also foresee as tools like this come online, you know, and maybe you’ve seen this in some of the clients that you’ve worked with, you know, you mentioned product, product, people from Sisense, wanting to be able to access the data directly. So now we’re talking about maybe some people that aren’t necessarily data people picking up SQL and learning some principles of data analysis and that kind of thing. Is that a trend that you think will, will keep going. I mean should, should people in product and other places in the organization learn SQL and, and learn how to work with data?
Ori: Definitely. I think the ability to make decisions with data and big data, strategic capability for an organization—it’s strategic capability for people that are looking for work. I think all organizations, even the most traditional organization are trying to make more decisions with data. And I think maybe I want to add something to the question you asked me before. If you’re an analyst, if you’re a person that makes the decisions, being hands on with data is good for you, but you still need the data engineers to kind of keep the organization in check when it comes to security, when it comes to cost management, when it comes to scalability, to durability. So somethings you need the engineer for, but you just don’t want them in the way to a specific data analysis research, because that means that the users can not work on their own. So this is kind of the future I’m imagining.
Curtis: Got it. That’s good. So then where are you going with Upsolver? I mean, I know there’s a lot of companies that need, you know, need better access to their data lakes. So there’s plenty of opportunity there, but do you foresee other, are there other big problems in the data space you’re thinking about or working on or where are you, where are you going to take things?
Ori: Two answers for that. One is what I think Upsolver will do. And Upsolver is still a young company, and we’re focused on one thing, and that will be to make the data lake accessible according to everything that we talked about today. And I think that’s a very big trend because the data preparation is 80% of the work you usually do. I think the second, how will the, how will users work in a data-lake era and comparing to a database era? And I’m going back to the decoupling of databases. So let’s say that metadata, metadata is a big part of it. So for example, every time I go work with a database, I kind of need to recreate the schema and the permissions and all of that. And today you have central metadata repositories, but it’s not that all databases are working with them.
You have the, kind of, the new-age tools like Presto and Spark. And those are the ones that kind of integrate with the central metadata catalogs and many of the database are not. And I think going forward where all the data is stored in a central place, like an object storage, I think that it also makes sense that there will always be centralized metadata and all the databases would work with that centralized metadata. And if I’m a data engineer for an organization and I would need to define permissions, I need to define permissions once and others need to do it in all places because security, permissions, governance are things some of the concerns that organizations are raising, just because it’s different than the way they manage their databases.
Curtis: So better, maybe better tooling around, when you say metadata again a sort of a high level, you may define that as a data catalog or something like that, where people can understand what the data is and where it is and what kind of data you have. All of that. Is that fair?
Ori: It’s more about standards, then so, you have, I’ve met a storage (?) data catalogs for a while now, but it’s more about standardizing, about seeing every possible tool being used to query the data, working with the centralized metadata catalog, with a data catalog like you, you just define,
Curtis: Right. I’ll sort of give you the last word here on anything else you wanted to chat about.
Ori: Upsolver is going very well and we want more people to join our ride. So we just closed a big round. We have some new investors that know the space for the last 20 or 30 years, and we are hiring extensively all positions. If you’re interested, definitely touch base with us. Upsolver.com. There is a contact us. We’ll get those emails. We’ll try to get back to everyone. Definitely want to work with more people that are excited about this space and kind of want to see the data lake–oriented future happening.
Ginette: A big thank you to Ori Rafael for being on the show. As always, got to datacrunchcorp.com/podcast for our transcript and attributions.
We’d like to know more about you, our listeners, and what you’d like to hear on the show, so go to datacrunchpodcast.com/survey to take our easy six-question survey and let us know your thoughts. Thank you!
Attributions
Music
“Loopster” Kevin MacLeod (incompetech.com)
Licensed under Creative Commons: By Attribution 3.0 License
http://creativecommons.org/licenses/by/3.0/