Davit Buniatyan: The problem is that you have all this big, like traditional databases. You have a data warehouses, you have data lakes, now they call them lake houses. But the issue is that they’re all focused for processing structured or maybe semi-structured data, but you don’t have one for unstructured data, like images, video, audio, text that where deep learning capabilities gave birth to extracting more business valuable insights.
Ginette: I’m Ginette,
Curtis: and I’m Curtis,
Ginette: and you are listening to Data Crunch,
Curtis: a podcast about how applied data science, machine learning, and artificial intelligence are changing the world.
Ginette: Data Crunch is produced by the Data Crunch Corporation, an analytics, training, and consulting company.
Today we have the founder of Activeloop, Davit Buniatyan, chatting with us. He started his PhD in Princeton’s Neuroscience Lab, and he now runs Activeloop, the company behind the Database for AI. As we chat with him, you’ll learn more about how you can build your AI products faster with Activeloop. But keep in mind, they’re offering you, our listeners, two months free for their growth plan, which gives you the ability to visualize, version control, query, and explore AI datasets in a team of up to 10 data scientists. For this promo, go to app.activeloop.ai and use promo code. All caps, no spaces DATACRUNCH at checkout.
Curtis: We’re really excited today. We have Davit Buniatyan here with us today. He’s done really interested research, has his own company now that’s doing some really interesting things in AI. So we’re excited to have him here to chat about all of that with us, and so I just want to turn the time over to him. Davit, if you could give us a brief introduction, and we’ll, we’ll take it from there.
Davit: Thanks, Curtis, for having me in this podcast. Super excited to be here. Yeah. I came to US to start a PhD. When I got into the university, I got into a computer vision lab, but my advisor left the university and came to Bay Area here to start a self-driving car company.
So I had to find another advisor, and I accidentally, or incidentally, got into the neuroscience department working on this field called connectomics. And for those who are not familiar with connectomics, it’s basically a new branch in neuroscience that tries to reconstruct the connectivity of neurons inside the brain, and what we were doing, we were taking a one millimeter volume of a mouse brain. Cutting into very thin slices, imaging each slice, like just one millimeter was a petabyte scale dataset sitting on top of AWS.
Curtis: So we’re talking one, one millimeter slice of a mouse brain is a petabyte of data.
Davit: Yes, like kind of a volumetric slice. So it’s like a one millimeter cube of a mouse brain.
Curtis: Okay, that’s crazy.
Davit: And if you want to get the whole mouse brain, you need to get thousand times bigger. And that’s about exabyte dataset. And if you want to get a human brain, that’s another thousand times bigger than a mouse brain. So you, we are talking about zettabyte, essentially a scale that’d be, we’ll be able to process in estimated in 50 years.
Curtis: Fifty years. Okay. What’s, ah, what’s the scale sort of today, if, just give everyone kind of your perspective on how do you even process a petabyte? You know, what does that mean in terms of compute power?
Davit: So just to maybe give you a brief, what is petabyte itself? Like it’s, it’s essentially you have 20,000 of hundred thousand hundred thousand pixels, images sitting on top of an object storage, and you as a go to processes data, you have . . . So what we were doing, basically, we taking this volumetric image and then trying to separate the neurons, find the connections, and then build a graph using deep learning. So essentially you have a pipeline of set of models that you have to apply on this data to be able to reconstruct the connectivity or the graph of all the connections of neurons inside this volume.
And the main use case for this was for neuroscientists to be able to do research, to understand how actually the brain works. So we were sort of using artificial neural networks to reconstruct the real neural networks to get more insights how the real networks work inside the brain. Like what, what are the naturally inspired algos that the human brain actually, or mouse brain use for learning new patterns.
Curtis: How, how did that work? Were, were you guys able to get something that was similar or how did it . . .
Davit: One thing that you can use actually, or one thing you can prove, in neural networks there’s this famous algorithm that is used for training those models called backpropagation. Actually backpropagation isn’t, doesn’t exist inside the brain, at least in the current form. So that’s, I think that’s more like a validation of another research that has been done before putting that theory up and using this way, you can prove out that it’s not the case. However, there’s other like research as well, going, moving forward that if you actually treat a single biological neuron as multiple neurons in the artificial space, then you can actually show that backpropagation might exist inside the brain, but that’s slightly cutting edge still in the validation space.
Curtis: Still theoretical. We’re not sure about that yet. That’s interesting. So I’m just curious though, you know, the brain is interesting ’cause there’s a lot of talk about, you know, like Elon Musk has his neurolink where you’re going to connect with the brain, and creating AI that thinks like humans and these kinds of things. I mean, the scale is so far beyond what we can do right now, but you know the research better than me. What would you say to that kind of thought track in terms of like having machines able to process something like a human brain.
Davit: Well processing a human brain, we are 50 years ahead. So we should not get likely worried about even trying to simulate a full human brain. The goal of that research is actually, or the bet, I’ll say the biggest bet on that research is, while you are getting to this scale, doing that pathway, you will learn and get more understanding how the brain works. So you might not need to get the full brain to be able to understand, let’s say, how we make decisions. Essentially. You have this big gap between neuroscience that has been traditionally focused on understanding a single cell. And you have psychology where humans are making decisions and there’s a big gap between from your, how a single cell, like making a call, how human is making a decision, and let’s say, should I go into join this company or not?
Davit: But from the AI perspective, I think you now have all the deep learning networks, specifically transformers, that are getting into the landscape of, I’ll say the number of parameters that the human brain has. They are not yet there, but like, I think it is like a couple order of magnitude smaller and they’re achieving pretty remarkable results.
That’s not too much, um, inspired from, I mean, there are some inspiration coming from how the attention model works. It sits right inside the brain, but that’s kind of a sidetrack as well, getting into the model scale that human capacity’s capable of, at least on the assuming a connectionist theory, which, which basically means that we believe that we can simulate a brain with, uh, having just the graph of all the neurons and their connections, but which might not be the case as well, because you can go into further, like more biological level biochemical, and then there’s even further like more kind of complex systems.
One thing I know is that a lot of computer scientists, especially data scientists, they think they know how the brain works, but if you have been in a neuroscience lab, that’s far, far away of the truth, um, what we might as computer scientists think to do? Yeah. And, um, me as a computer scientist in that lab was mostly focused on the infrastructure related problems. So the issue with processing a petabyte scale on the cloud using the data pipelines, first of all, was costing us a lot of money, like order of a million dollars. And furthermore, the issue is that the tools that we had at the time including like, I dunno, airflow, Kubernetes, and so on, they were not scaling to the, the needs that we had, that, that I’m talking about five years ago, basically. And we had to rebuild a lot of tools ourselves. We had to rethink how the data should be stored, how it should be streamed from the storage to the computing machines, should we use CPS or GPS and what kind of models to use? And those kind of insights helped me to start a company.
Curtis: Which is called Activeloop. Yes. Tell me how that even came about, right. I mean, you’re, you’re working in a neuroscience lab and what you do in Activeloop is sort of solving the same problems, but running a company is much different than a, than working in lab, I assume. So how did you make that transition?
Davit: Yeah, exactly. So three years ago we applied to Y Combinator on the last day of the deadline, and we thought that, yeah, why don’t we just try out and see what. And apparently we got through the interview,
Curtis: Like with some friends? Or like . . .
Davit: Like other PhD folk from my lab and from another department doing financial engineering and so on. So we applied,, and we got, got in and we thought, oh yeah, this is going to be an internship, um, over the summer, and we’ll get back to and continuing our research. And we came here, and apparently I stayed to continue building the company. We raised additional seed funding after Y Combinator. We had went through a lot of steps, started working with early customers to help them to be more efficient in terms of their machine learning efforts. One customer had, you know, about 18 million text files to process to train models. Another customer had petabyte aerial imagery data and while working with them as well, we figured out actually, the problem is that you have all this big, like traditional databases. You have a data warehouses, you have data lakes, now they call them lake houses. But the issue is that they’re all focused for processing structured or maybe semi-structured data, but you don’t have one for unstructured data, like images, video, audio, text that where deep learning capabilities gave birth to extracting more business valuable insights.
Curtis: Yeah. Uh, so tell me when you were getting going, I’m assuming you, you kind of took the tech you, you developed in the lab, productionalized it for, for companies. Are the problem spaces you’re working in, are they similar to neuroscience or does this apply again to like lots of different companies, lots of different problems?
Like, what are some of the kind of things that you’re solving for companies right now?
Davit: So what we have seen in the lab is that not only our lab had the same exact issues, but also other labs, and furthermore getting outside of kind of an academic environment, as I mentioned, we have seen that apparently not only our lab or like we were, um, I would say cutting edge side of the technology, other companies, like they’re far, far away from, in terms of the data infrastructure they have built on.
And they have they’re facing exactly the same issues that they were, we were facing in the early days while we were thinking about solving this problem. So what we did is that a lot of tools, by the way, develop at the lab, they are open source, so, um, anyone could use it for their own research or their own company.
And we didn’t quite use them, though, what we got is, those kind of the insights for, so let’s say if you take a vertical in biomedical image processing, apparently the tools that we have developed there, they can easily be taken and used, for example, in self-driving cars or in AR imaging processing or in texts like large text data processing.
And that, that’s what we did at the company where we took, like, let’s say focusing only on just four dimensional data. Now you can expand this to end dimensional data sets. And that’s where basically the kind of a generalization comes from from our lab versus all these different use cases you have in the industry.
Curtis: So focus on unstructured data, being able to process like immense amounts of that. How does this relate to maybe other technologies people are more familiar with like Spark, for example, those kinds of things. How does it compare and contrast with that kind of toolset?
Davit: So a lot of things happened like starting from 20 years ago from MapReduce that has been developed at Google and then Yahoo took over, put it on Hadoop and then you have Spark. Very fond of the technology and the folks at Databricks behind the scenes, and the main difference between what we saw and they were operating on. So they were operating mostly on Tableau data or structure data. And the way MapReduce works is that whenever you have a cluster of machines, you have to preload the data to each of those machines, and then only load the computation start running on top of it. And our software is like just slightly goes against that think, the way of thinking where we can actually store the data in a centralized location in our format. And then can very efficiently stream the data over the network to our computing machine, like CPU or GPU as if the data was local to the machine.
So now the workers doesn’t have to actually preload the data to those machines. They have a virtual view of a petabyte scale data set sitting on their inside their memory, which they can access any at any time. But whenever they access it, the data gets streamed from the storage to the computing machines as if they are, as if you’re watching a Netflix movie.
And there’s kind of the difference between like Blockbuster, which the company where you ha you had to go and buy the DVD or VHS before versus Netflix sitting now and then streaming the video material home.
Curtis: So I’m assuming there’s like significant gains in cost of infrastructure as well as speed of processing time.
Like, what is that are we talking like an order of magnitude increase or what are we talking about?
Davit: So the main difference here is now your compute power that does the most expensive compute resource and you pay for the hours of the GPU or CPU should not wait until we are moving the data around. And with small amount of data, when I say small, it could be still terabytes, but comparably per work, we have a small data, that’s fine. But when the data becomes huge, and we did like the very basic example, which is not a huge dataset, is like one of the iconic classical computer vision datasets, which is image net about 150 GB data. Like you have million images. It’s like, it has been a kind of a standard, uh, considered to be a large dataset, but now it’s like super tiny.
And for that, for you to train, uh, one epoch of a model that does the computer vision, like basically does the classification of the data. You have to wait three hours just to copy the data file by file from S3 storage to the company machines. And in our case, there’s not any copy time; we get the same performance as if the data was local to the machine.
So you still, you do your training, locally on the machine, but while the data behind the scenes getting streamed to this machine efficiently, so for that particular use case, you have kind of a 4x speed up while you’re training the model, but the main kind of the optimization for that use case was not that we made it everything faster like in terms of the CPU GPU time, we actually just solved the data bottleneck problem where your computation was actually bottlenecked on the data transfer rather than actually the GPU or CPU.
Curtis: Got it. Okay. Yeah. So that’s, that’s huge then. And, and this is all, I think you said this is all Python libraries. This can be accessed via Python called hub? Is that right? Is that,
Davit: Yeah, you can just do PIP install hub and to get this Python library, which helps you to convert, let’s say your, your million images and million labels into tensors, or end dimensional arrays. And now for you, deep learning simply becomes just learning to function from one tensor, like image tensor to a label tensor, and the data set that you construct can be easily transformed with one liner to a pytorch dataset or a TensorFlow data set.
And you can start doing your training process as if you were doing it as before, but we do all the, kind of, all the background job while going and fetching the data from the remote storage, putting into the cache, decompressing the data, running the transformations, and then feeding into the VR GPO while you not being aware of it.
Curtis: Got it. Yeah. This is open source, like people can download it and
Davit: Yes, people can download it. People can join our community. People can create issues, help us and solve some problems and contribute to the opensource. And we have a very vibrant community. Yeah. on Slack, I think it’s called slack.activeloop.AI, if feel free to join us. And yeah. And github repo is activeloopAI/hub.
Curtis: Yeah, that’s great. It sounds like the community’s growing, and they really liked the technology.
Davit: Last year, I think hub was at some point was number two trending across all GitHub repositories, and it was number one in Python. And now we have about, I think, 800 community members and above 70 folks contributing to the open source repo.
Curtis: All right. That’s awesome. Sounds like a great addition to the opensource community. And then obviously you also then have a company like Activeloop. How does your, your Python library coincide with what you do commercially for companies, like where’s that transfer there?
Davit: Um, similar to Spark, like our hub is like the open source version for the data center management.
And we are building the Activeloop, which provides additional features on top of the opensource for enterprises and for companies to better manage their data. And one of the key pillars, what we have kind of being proud of developing is actually the visualization engine. So the same way as you stream the data from S3 to your modeling, you can actually stream it to the browser.
So we build this visualization engine on using C++ running on a browser that can actually visualize your million images at the same time on your browser. And you can quickly not only images, but you can also like, uh, watch movies. But the main point is that you can easily explore the data sets, find out the issues in the data, like mislabels. You have a version control, so you can create branches. Go out to a specific lead by your collaborator. So this collaboration aspect of things, you can run queries, you can give me all the images that contain X, and then we will subselect and then visualize it. And they’re now doubling down on top of the integrations, um, to make it part of the whole ML ops cycle system for companies to be able to quickly go from, let’s say a labeling annotation tool to storing their data and then training the models and then very rapidly being able to visualize this data. So that’s all gets into the app at Activeloop.ai, the platform that we are serving it as managed service for on behalf of customers.
Curtis: Got it. Yeah. That’s awesome. And is that, um, is that in familiar languages, like someone gets access that service and it’s like, they can write Python, they can use it or SQL or those kinds of things, or do you guys have proprietary stuff
Davit: The Python is essentially our open source tool that connects to our backend where the data is being stored. And more than that, customers, they can provide their own AWS or Google cloud. So it can be deployed on their private cloud as well. And, um, everything is Python. The, even the query engine that we built, which is like a very simple version of it. It doesn’t get into a SQL world yet is just the Python compiler that runs on the browser.
Curtis: Yeah. That’s awesome. How long have you guys been running this down, been in business with Activeloop?
Davit: For three years.
Curtis: I would assume your clientele typically are larger companies because they would have access to bigger data sets, but that may not be the case. There’s smaller startups and things that are doing interesting work with open source datasets are generating their own data sets. Right. So I’m curious, kind of what the breakdown is and who is accessing this technology? What kinds of things are they doing with it?
Davit: So when we started with the community, initially, there are a lot of researchers slash students from different parts of the world, in the universities that we’re looking into. And what we have seen recently that has a major shift into more machine learning engineers from enterprises, from large companies, and we still see both like the small kind of startups medium-sized companies trying and experimenting and training their models using hub. And on the contrary, we actually started seeing much larger organizations with some of them having a hundred thousand employees, um, using this into the production, into their use cases, and on the platform side, like the managed version of it. And we are rolling this out. And one of the key, I’ll say, offerings that we do have is actually a startup pack that will give all the functionality that you have for the enterprise at a much, much cheaper price per month for the startups to use. And for the individuals, it will be free obviously.
Curtis: W what would you say is, I don’t know, maybe you can’t share this, but I’d be interested to know, like, the most interesting use case that you’ve seen from your clients, like, what problem are they solving? And they had really interesting results? Like neuroscience is really cool, right? Is there any other things that you’re seeing that they’re using it for that’s that’s interesting?
Davit: So, um, from the use cases that we are focused on almost like primarily on, uh, image and video processing.
So that’s kind of our key strengths where we have seen a lot of customers using us, but I think if you slightly translate the question of what was the most surprising use case that you have never thought of that somebody will use? I think the most surprising thing that we have never thought of that this could be used is actually coming from an research group from Oxford University, where they used hub for storing molecular structured, protein structured data. And that was something surprising that we had never actually thought that this could be useful, that kind of a use case. And we have seen them successfully using it and furthermore, asking us, “Hey guys, can you also like help us with visualizing this data on your brow visualization engine?” So we haven’t yet done that part on the managed platform, but from the most vibrant, apart from our kind of main verticals that we are focused on, on AR imaging, video processing.
And I think some of the use cases are public there as well. That the Oxford University case was pretty, pretty exciting like within the same bucket as the neuroscience is.
Curtis: So actually like, like molecular data, you said that they’re storing like
Davit: The protein, protein slash molecular.
Curtis: It’s like a biochem lab or something.
Davit: Yeah, but we ha we haven’t yet been to be fair for everyone, we haven’t been focused yet on making it very efficient for that use case.
Curtis: Yeah, that’s a cool use case.
Davit: So, yeah. Um, but maybe also can another interesting use case as well to, um, the portfolio is that one of the customers we work with, they had about 18 million text documents and they were processing all the patents in the world and they had to build a search engine.
So the problem with them was training a model was taking them two months and they’re using our stuff. They actually help to reduce the training time to a week project. So, yeah, that’s, that’s one of the, um, I think the early use cases that we have seen work, that’s working pretty successful. And that’s how it actually shaped us to build this.
Curtis: Where do you go from here, like whatever, like five, 10 years, like where do you see your company and what kind of things do you think you’ll be doing?
Davit: I think our core vision is focused on making this to become the industry standard for storing the data. Uh, in terms of the deep learning applications, we are not that much of interested in becoming the annotation tool or becoming the training as a service platform.
We are more focused, okay, we see that deep learning is going to be huge. We see that unstructured data, like 90% of the data that generated today is like a non-structured data forum. Like. If you have, if you haven’t seen yet, you can Google search the TikTok video, uh, usage and you see not usage like the creation and you see the peak that happened over the COVID time that so much videos have having created and somewhat somebody has to analyze it.
And we want to be like the data storage there for helping those companies for more useful use cases to make it efficient. Like in that includes agricultural use cases, biomedical use case. And video processing and entertainment could it be one of them as well.
Curtis: Do you ever regret leaving the lab or are you happy with, with the move to the commercial world?
Davit: That’s a good question. So I think at least for me personally, finishing the PhD versus starting the company was one of those decisions in any way, whatever you choose, you’re going to regret. So, yes, there definitely there is a regret not going through the academia. And I had a lot of conversations with my advisor, and he was very, very supportive on the direction I take.
Basically the conclusion is, well, like what I got from Princeton overall saying that you have to pick one and you have to focus on the one to be able to be successful and whatever you pick, there’s a sacrifice to the other side. And I felt that I can have higher impact including inside the computer science, like world in academic world as well if I pursue this direction and then push the kind of a new paradigm in data structure, layout, like storing the data in arrays and so on. That, that could be also like without maybe publishing a lot of papers and getting a lot of citations, but this is like a way to have a huge impact. And definitely I’m not like super passionate about like building mobile apps and like, or building a like low key saas app company. I feel like it’s, my passion is coming from being in inside a data infrastructure, infrastructure, solving, super challenging problems and making, and then not just like technically solving inside the lab and then hoping that someone will see and figure out how to use, but also like going in then deploying this in front of the customers and making sure that the you solve the actual problem they are facing today.
Curtis: How, how big are you guys now?
Davit: We are like 15 people for, um, team company still small, mostly based here in Mountain View, California. But if you have folks from Florida, New York, Boston, so it’s like all over the US and more than that from India and Armenia as well. So it’s, um, a wide diverse team of a hybrid, mostly remote team across all locations.
Curtis: Awesome. Well, that’s really cool. I know the technical work you’re describing, obviously it’s not easy and, and owning a business and running it is also not easy. So that’s two really challenging things that you guys are doing well, it seems. That’s awesome.
Davit: The nice thing about deep techs, deep tech startups, like you, you have not only a business risk, but I also have a technology risk and also like thankful to the super supportive investors, advisors, our team members as well, who understanding all the kind of problems and challenges obviously is not a, it’s a very bumpy road.
Like it’s, it’s very, very big obviously we are nothing like this further risky technology spectrum where like in quantum computing or quantum software or in fusion, we are building a fusion reactor. We are less like more grounded stuff building here, but still, still the challenges are there.
Curtis: Sure, I mean, it’s still cutting edge, right? And that’s, that brings a whole host of, I mean, it’s a research question, right? And whenever it’s a research question, there’s, there’s risks with that because there’s not a clear path to necessarily solving things, although you’ve, you’ve clearly breached some of that because you have something now that speeds things up, like nothing else does. So that’s great. Is there anything else that I’ve I’ve missed? I don’t know. I mean, you’re doing lots of stuff. Is there anything else that you wanted to discuss?
Davit: One of the key things that we wanted to offer folks who are listening to this broadcast is that we also have a promo code of Data Crunch and listeners. They can easily join, I think, if I’m not wrong for two free for two months to use our platform.
So that could be a small, tiny thing from us other than that overall. And we are super optimistic and excited about the future.
Curtis: All right. Well that, that is awesome. So the company is Activeloop, promo code DataCrunch, and that’s that’s two months to your commercial application. Is that right? So just go to your website . . .
Davit: Open sources is there free. You can touch it. You can use it, you can distribute it, contributed back, and so on.
Ginette: Thanks, Davit, for being on our show today, and as always, head to datacrunchcorp.com/podcast for our transcript and attributions.
“Loopster” Kevin MacLeod (incompetech.com)
Licensed under Creative Commons: By Attribution 3.0 License