Python versus R. It’s a heated debate. We won’t solve this raging controversy today, but we will peek into the history of Python, particularly in the open source community surrounding it, and see how it came to be what it is today—a well used and flexible programming language.
Travis Oliphant: Wes McKinney did a great job in creating Pandas . . . not just creating it but organized a community around it, which are two independent steps and both necessary, by the way. A lot of people get confused by open source. They sometimes think you just kind of going to get people together and open source emerges from the foam, but what ends up happening, I’ve seen this now at least eight, nine different times, both with projects I’ve had a chance and privilege to interact with, but also other people’s projects. It really takes a core set of motivated people, usually not more than three.
Ginette: I’m Ginette.
Curtis: And I’m Curtis.
Ginette: And you are listening to Data Crunch.
Curtis: A podcast about how applied data science, machine learning, and artificial intelligence are changing the world.
Ginette: A Vault Analytics production.
Ginette: This episode of Data Crunch is supported by Lightpost Analytics, a company helping bridge the last mile of AI: making data and algorithms understandable and actionable for a non-technical person, like the CEO of your company.
Lightpost Analytics is offering a training academy to teach you Tableau, an industry-leading data visualization software. According to Indeed.com, the average salary for a Tableau Developer is above $50 per hour.
If done well, making data understandable can create breakthroughs in your company and lead to recognition and promotions in your job.
Go to lightpostanalytics.com/datacrunch to learn more and get some freebies.
Here at Data Crunch, we love playing with artificial intelligence, machine learning, and deep learning, so we started a fun new side project. We just launched a new podcast that tests the boundaries of what can be done with Google’s cutting-edge deep learning speech generation algorithms. We use surprisingly human-like voices to host the podcast that reads all the unusual Wikipedia articles you haven’t had a chance to read yet, like chicken hypnosis, the history of an amusing German conspiracy theory, strange trends in Russian politics, and much more to come. It’s worth listening to to hear what this tech sounds like and you’ll learn unique and bizarre trivia that you can share at your next dinner party. Search for a podcast called “Griswold the AI Reads Unusual Wikipedia Articles,” now found on all your favorite popular podcast platforms.
Curtis: There has been a heated, ongoing debate about which programming language is better when working with machine learning and data analytics: Python or R, and while we won’t be wresting that particular question, we will overview a bit of history for both and then dive into significant history behind one of these languages, Python, with a major contributor to the language, a man who significantly influenced the way that data scientists use Python today.
Ginette: As a very short historical background, Python came to the scene in 1991 when Guido Van Rossem developed it. His language has developed a reputation as easy to use because it’s syntax is simple, it’s versatile, and it has a shallow learning curve. It’s also a general purpose language that is used beyond data analysis and great for implementing algorithms for production use. As for R, it followed shortly after Python. In 1995, Ross Ihaka and Robert Gentleman created it as an easier way to do data analysis, statistics, and graphic models, and it was mainly used in academia and research until more recently. It’s specifically aimed at statistics, and it has extensive libraries and a solid community.
As a controversial side note, according to Gregory Piatetsky Shapiro’s KDNuggets poll, late last year, Python overtook R in data science and machine learning. Gregory’s organization received some sharp criticism for this and defended the claim with a follow-up article. We’ll link to them in our shownotes for you to read if you’re interested. Now let’s chat with Travis Oliphant, former CEO of Anaconda, current CEO of Quansight, and creator of NumPy and SciPy, two packages that were foundational in bringing Python to the forefront of machine learning.
Travis: I’m Travis Oliphant. I’ve been involved in scientific computing for a long time. I got my start really as a college student loving electromagnetism and applied mathematics. I was really driven by science and math, and then as I went to grad school and start doing medical imaging and studying the problems of how to get information from large data for medicine, I started to uncover the problem of just writing software systems to help make that easy, and that’s where I fell in love with Python. It was early language back then, and they had a very nascent array object called Numeric, and I got involved as a graduate student—was supposed to be pursuing my PhD, but I end up delaying that—delayed my PhD to write a bunch of open source software, primarily for the love of writing the software and also for the joy of sharing it with people around the world and having them comment, kind of the community building that allowed for.
It really was an extension of the scientific process for me. It was very similar to the kind of thing I was doing writing papers and writing ideas about how to move medical imaging forward. This is building systems for helping large data science move forward. So I did that, ended up teaching at a university as well. I went back and taught at my alma mater, taught electrical computer engineering, but even then, I found myself drawn more towards helping the software ecosystem grow, and so I started the SciPy project when I was a young graduate student. And that became . . . and then stayed with that. I grew that community, helped it grow until about 2009. About 2005, I realized I needed to write some better array object. There was some activity happening between Numarray and Numeric. A split was happening in the community. People were kind of building up ecosystems on different objects and different systems, and so I wrote NumPy to bring that back together and to have a common way to talk about data in Python, and then that, over the course of three to five years, started to just get adopted everywhere, and then built this huge ecosystem on top of it.
I left academia when I realized I was kind of more suited for entrepreneurship. I really love the idea of helping people build companies and helping create companies. I myself wanted to explore what that was like. And so I end up leaving academia to start at a consulting company and then from there, spun out another company called Anaconda, and then that’s now growing, and now I’m back running another consultancy whose goal is to spin out other companies, so it’s been a fun journey, but I’ve been open source the whole way, and then driven by entrepreneurship along the way as well.
Ginette: The open-source space that Travis has heavily contributed to has lava-lamp-like fluidity. It’s constantly shifting and re-molding itself as new factors come into play.
Travis: Now what’s happened over the past three years is there’s been new systems built as well, so we’re kind of at an interesting time in the open source ecosystem where for a lot of years, NumPy was kind of at the center of all these of all of these computations, and then as new, you know, huge work has been happening by Google and Facebook and Microsoft and Baidu and other companies, they’re sort of grown this space significantly, and they’re not always using common APIs either. They’re sort of recreating some of the APIs. So it’s kind of interesting. It’s wonderful technology, but it’s also very now not so consistent, and it’s becoming a little bit confusing for people to figure out where do they go. Now a lot of those folks, they basically use the terminology of NumPy, you know, the same concepts, and so it’s interesting to watch, kind of, names you created, like just pulled out of your hat are now showing up in all these libraries all over the world, because to name to kind of describe the same thing, things like D-type, shape, to describe the underlying capabilities. So it’s fun to watch that, but I’m also, I have a growing concern about how do we keep the community together and keep APIs and specifications kind of going forward instead of stepping backward, and then standardization conversations start to come into play, and so it goes from just being about technology to being about communities, and so that’s been a fun, a fun transition.
I hesitate to say back a step because definitely there’s new capabilities that are available, and the old capabilities are still there too. But in terms of where we go from here, it’s a little more fractured, definitely more fractured, and so it’s kind of like people . . . you know, when we merged Numeric and Numarray into Numpy, for ten years there was this sort of, “oh, yeah, this is what we do,” and it’s very clear, and everyone rallied around the same APIs and the same interfaces, and then as this other movement has emerged, it’s become, “Oh, well, what do we do now?” I’ve got to talk to people, for example, the PyMC story. It’s a Monte Carlo method for doing probabilistic computing. The idea of probabilistic computing is you can describe your model in relatively high-level computations: I’m going to do this A * B + C, do a function call on the array, and you can describe your model, and then rather than compute it once for a particular set of numbers, you run a bunch of numbers through it, and you kind of build a probabilistic output rather than a single answer, you get a probability distribution of the answers. It can be a lot more realistic. You know, a lot of times when you’re trying to predict the future . . . I used to teach inverse problems. I used to teach prediction theory, and I’d always emphasize that it’s not always about the single number you get in the future. It’s about how certain are you about that number. Given the information you have, how much do you really know about the future? And these methods really help you do that.
And there’s one, PyMC3 is built on top of NumPy, on top of a product called Theono, which also leveraged NumPy, and then, they were basically saying what are we going to do with PyMC4? How are we going to do this? We’re going to do build on top of NumPy again? Are we going to do TensorFlow? Torch? All of a sudden, it was a big question. They didn’t know. And I believe they decided to sit on top of TensorFlow for now . . . that conversation definitely helped me . . . kind of, “We’re back. We’re back ten years ago to where people are just not not knowing . . .” and that’s fine in the sense that’s great use code. We don’t really care if it’s TensorFlow or not. The trouble is, it’s that when the developers aren’t sharing common interfaces, then we’re not moving as quickly together forward. End users can still, hey, they used your library, it’s fine for them. But I think about the community of people building together the infrastructure, and we end up doing a lot of repeated cycles, a lot of wasted time together. And there’s always going to be some competition. It’s like as something becomes more popular, you have more approaches, I mean, computer languages themselves are this. If you think about, there’s Python. There’s R. There’s Node. There’s Nim. There’s new languages all the time. And it’s not that there should only be one language, and it’s not like there should be only one way to do this computation, but for common things, you hate to see when you know exactly this is the way to do it, and we kind of have four implementations of exactly the same thing. That’s when it starts to get a little bit. Do we really have to do it this way? Can’t we cooperate a little better?
Curtis: Travis’s aim is to build on open source, and he’s figured out how to do that while also fulfilling the basic need of making an income.
Travis: There is definitely a tension between the open-source philanthropic community spirit and then the for-profit I got to . . . I need to sell something, right? And it’s not necessarily intrinsic tension. It’s just, you can end up focusing on different parts of things. A lot of people in for profit end up so focused on the “How am I going to monetize this? How am I going to make money off this, that it can be very easy to lose sight of the community. What the needs are kind of actually building a community. For me, the way I combine those was the reason I wanted a for-profit company was to make more resources, to create more resources to help create more open source. So the whole reason to put effort into building a profitable company was to drive revenue that could be used to pay people to work on open source, that’s what drove me almost entirely,
Now that doesn’t necessarily drive everybody who’s building a for-profit company. But that was certainly in my mind, and I realized to build open source, you’ve got to pay people to work on open source. A lot of open source is done by volunteer time. You can get a lot done with volunteers, but in order to get the kind of work done to really accomplish the world’s goals especially in machine learning and analytics, you’ve got to pay people, so how are we going to do this. How are we going to pay people while allowing them to work on open source? That’s been a question I’ve had for decades. So that’s what drove me to create Anaconda, and Anaconda’s purpose was to find a product that would be built on top of open source. We found one by effectively by solving the packaging problem. By solving the packaging problem, we discovered an opportunity to sell to corporations basically the ability to manage their, their deployment story, and that’s that’s become Anaconda Enterprise. So it took us a little while to figure that out, we basically started Continuum, and we didn’t have Anaconda as a product in mind. We actually kind of had an array server, kind of NumPy server was our idea, kind of like some way to do data online and people could do their computation in the cloud and then have distributed data across the Internet. We kind of had a broad scope of people doing high-level computation and having it distributed to the cloud, more of a data service, data computation service. We explored that for a little while, and it kind of lead us to do some open source projects around that. But then we uncovered almost accidentally, people just needed the stuff installed. That was the biggest problem they were solving, so we ended up kind of . . . any entrepreneurship effort, I’m a big fan of the lean startup kind of a model. A business is a scientific experiment with a market. Your success is measured by are people paying you for something. If you just let that drive your activities along with mixing this, you’re sort of experimenting. “Well, hmmm, I think this is what will help people experiment, and it either does or doesn’t.” So we had about five or six experiments going on at Continuum, and the challenge became that four of them were successful and just degree of success and then we ended up pursuing pretty much the one broad successful company that became Anaconda.
Curtis: While Travis was building Anaconda, he was working on other side projects, such as NumFOCUS, a nonprofit organization that works to improve the health of the Python open-source community, and it funds certain open-source projects.
Travis: The idea of NumFOCUS is to be a central point where communities could trust that it’s agenda was only the community health. It had no other agenda but the community health. And I created NumFOCUS with a bunch of other . . . with some other board members. We created this at the same time I was starting Anaconda. And I did it basically because I knew I wanted to build a for profit company—I’m really a big proponent of entrepreneurship and building productive companies that sell things to people—but then I also wanted at the same time just the tighter the stack. The things we were going to be building on are bigger than one company, and so I wanted there to be another organization that we could all participate with and kind of would feel, would have common trust with, and so we built both at the same time, and it was challenging. It’s hard to build one organization, so building two, certainly it took time and wasn’t as efficient perhaps it could have been, but fortunately, due to the great efforts of a lot of other community members, NumFOCUS has really taken off over the past several years, and it is a great organization for collaborating around. Certainly this kind of activity . . . if we come up with libraries, they could be unifying libraries that they could be fiscally sponsored by NumFOCUS. They could be seen as community library as opposed to just some as opposed to something that was just going to benefit me personally or something.
Ginette: There’s an art to building open source projects. Travis explains here what patterns he’s seen as open source projects become popular.
Travis: Wes McKinney did a great job in creating Pandas, and not just creating it but organized a community around it, which are two independent steps and both necessary, by the way. A lot of people get confused by open source. They sometimes think they’re just kind of going to get people together and open source emerges from the foam, but what ends up happening, I’ve seen this now at least eight, nine different times, both with projects I’ve had a chance and privilege to interact with, but also other people’s projects. It really takes a core set of motivated people. Usually not more than three. Sometimes only one. You know one to three people, maybe five, but it’s sort of a small group of people that end up working closely together and building the core capability, right? And then if they’re on to something . . . if they build something a lot of people want, then the next step they have to be able to take is can they bring other people in? And can that one two three people become 7 to 9 and 10 and 15? And then can that take the next stage to 75 and then 80 and hundreds? And that’s a journey that definitely takes time. It doesn’t happen overnight. It might take you six months to a year. Usually about a year to do initial work on something significant, and then probably another three, two to three years after to build up a community. And that’s sort of best case scenarios. And then from there, it starts to grow. So you know, that sort of takes you to a different topic about communities, but I kind of wanted to give you a flavor, but definitely Pandas was a big part of why NumFOCUS became popular and also why Python became popular for data science.
The other two projects that were huge I think were Jupyter. Ipython-notebook became Jupyter and Jupyter is a NumFOCUS sponsored product as well, and then SciKit Learn. SciKit Learn was this venerable sci kit. It’s interesting the story of scikit-learn, because I SciPy was started in 2001, Really 1999, when I started writing the first modules of SciPy and then brought them together with Eric Jones and Piarro to bring SciPy to life in 2000, 2001 as a single distribution, and actually SciPy was like a distribution of Python masquerading as a library. The number one problem solving . . . how do you get this stuff, right? Because it had a bunch of different capability inside it, a bunch of different packages. We all kind of pulled it into one name space mistakenly, in retrospect. I mean, at the time, it was the right thing to do because it brought things together made it easier to install, easier to work on. But now with packaging and distribution more solved, it really is better to split them out, and that became obvious to folks around 2004 when we built the SciKits. In 2005 people said, scikits; you need to break this apart and different things, and that’s the energy that created SciKit learn. SciKit learn became the first really successful scikit and really overshadowed, SciPy and the tools it was built on, but fantastic. It really hit a core need for machine learning and prediction and it was there at the right time.
Now, SciKit Learn is a very distributed group. It emphasizes the nature of these. SciKit Learn was built before NumFOCUS was created. It’s a very distributed group. International Group. Lot of France. Lot of Asia. Lot of the United States. South America. And so SciKit Learn is not yet a fiscally sponsored project of NumFOCUS. It’s sort of an affiliated project, but it really has to do with the SciKit Learn Community it so diverse so broad, it’s so hard to even get people to agree as to what it should be.
Ginette: SciKit Learn, being the very popular machine learning library it is, has a lot of functions to offer data scientists. In order to understand what’s going on with functions in libraries like this, it really helps to understand the concepts behind them. A great place for this is Brilliant.org. Their classes help you understand algorithms, machine learning concepts, computer science basics, probability, computer memory, and many other important concepts in data science topics. The nice thing about Brilliant.org is that you can learn in bite-sized pieces at your own pace. Their courses are entertaining, challenging, and educational, and they go beyond lectures to help you actively learn. It’s a great resource.
If you’d like to deeply understand machine learning and data science, give them a try by going to brilliant.org/DataCrunch. They were good enough to sponsor this episode, and using this link lets them know that you came from us, and you can sign up for free, preview courses, and start learning! Also, the first 200 people that go to that link will get 20% off the annual premium subscription. Once again, that’s brilliant.org/DataCrunch to understand machine learning!
Curtis: Now that we’ve heard from our sponsors, let’s find out what Travis is doing now. Travis has since moved on from being the CEO of Anaconda and has started a new company called Quansight.
Travis: I’m super excited by Quansight. I left Anaconda, in fact, at Anaconda, we hired some additional help, a new CEO, great leaders who could help carry the vision of Anaconda forward, and then in the process, realized hey, this general problem of just building companies that support open source is still there helping . . . and also realizing after having met a lot of great engineers, there are a lot of great engineers who want to figure out how to take their ideas to market, and I want to help them, and not that I know everything, but after having built a company that’s become successful and then having been involved in open source for so long, I feel like I have some ideas I can share with folks, and I just want to help them build their dreams, and I realized that about myself. That’s what I really wanted to do.
So I thought, well let’s build a company where that’s what we do. It’s kind of my take on an incubation company. It’s very pragmatic. Our dreams are in open source. We can do anything in open source, and we can dream big there, and then for companies, we also dream big, but it’s also very pragmatic: How are we going to go market? What product are we going to build? What’s the go-to-market playbook look like? What do the sales process looking like? You have to be fairly pragmatic when it comes to getting people to buy things. So what I say to people specifically about Quansight we grow talent, we build technology, and we discover products. Some people look a little unusual, discover products? What do you mean by that? It’s really to differentiate building technology, which is about good ideas and the right infrastructure, and products are, What will people buy? What can you sell that people will buy that helps you build your company? To me, that’s a discovery process. I’m discovering from the market what its current desires are. How to solve its pain points with products.
Ginette: For you entrepreneurs out there, here’s some advice from someone steeped in domain knowledge of what some great areas to innovate and ways to get involved with open source.
Travis: I think my advice currently, and this is just my perspective, is right now the verticals really need a lot of help, like I think there’s a lot of people out there trying to platform plays like horizontal plays. They’re certainly possible, but there’s going to be one out of a hundred successful platform plays, but you’re going to have a fifty percent, five out of ten of the vertical solution plays are going to be helpful because there are . . . if you’re in finance, you’re in oil and gas, you’re in consumer products, you’re in telemarketing, IoT, there’s a number of these industries where they have maybe aging software, they have solution providers providing them something, but it can be displaced by the newer approaches that machine learning enables.
Previously, you had to write a lot of code to build a good model. Now it’s about the data and using the data to construct a predictive model that is fairly . . . that that system or predictive model is kind of a generic . . . think of it as a pipeline of array competition. But the key of it that is the weights, the key of it is the data that goes into that model, and really training on that data is the key aspect. So I’d be looking for verticals to help, and I’d be looking for data sources, like specific data sources that are going to help. Be accumulating data . . . I think of these companies that are doing, that are trying to make mortgage prices better. They’re disrupting real estate by using data. I really would love to see entrepreneurship in the medical arena. I always caution people about that because it’s going to take more money to really make a difference in the medical arena because we have some outdated regulations. You know, regulations that were intended for a different a era that are now weighing down and slowing down the process of innovation in medicine. I hope that that will change in the future, but in the meantime, what that translates to is you need to have more money in order to make a difference there, but verticalization is one of the big stories. I think another interesting story is this decentralization. I think they’re going to be some really good startups. If you are set on a horizontal play, I think the decentralized horizontal play is a really interesting area over the next five to ten years where there’s going to be a lot of opportunity.
Curtis: We asked Travis what his best piece of advice is, and this is what he had to say.
Travis: My best advice is to be patient. Overnight successes or never overnight. You know, you’re seeing the last part of the journey, and every time that journey took a long time typically. Be patient. Figure out something you really love, something that really excites you because it may take you may have to stick with it for five to ten years to make a big difference. Or in my case, twenty years, so you have to love it because you have to enjoy the process, enjoy the journey. I think jumping into open source projects are really, really great ways to contribute to an open source project. Maybe you don’t know how to code or your concerned about your ability to code. It’s okay. Everyone starts somewhere. Nobody knows everything begin with. Start with a documentation pull request. Start with a test—”hey, here’s a test I did for this future I saw was missing.” These are always welcome. I don’t know a project out there that wouldn’t welcome a documentation. Now they might have to coach you on the documentation, the way that the way it spells. They may have to coach you on the test, but that’s a great way to get involved. Offer to help. There’s a bunch of ways to organize communities. There’s the PyData community. There’s the Jupyter communities. There’s places and kind of meeting people in meetups is a great way to kind of connect and see what the issues are. And as you meet people and explore that, you’ll find opportunities. There’s so much that needs to be done. The minute you find something needs to be done, connect, reach out, be a part of it, and then enjoy the journey.
Ginette: A huge thank you to Travis Oliphant from Quansight for chatting with us about his insights and perspectives on open source.
As always, you can find all links to attributions in our show notes at datacrunchpodcast.com.
Curtis: And as the final note, we would like to let everyone know that we are celebrating our second birthday this month. Our first episode dropped in October of 2016, so we’re super excited to still be doing these episodes. We’ve learned tons of interesting things, and we hope to continue bringing you interesting interviews and information.
If you are feeling particularly generous today as it is our birthday and want to give us a gift, we would very much appreciate a review on iTunes or anywhere where you listen to your podcasts. It really helps us to know that you are enjoying the show and helps other people find it. So thank you to everyone who has reviewed us, and we very much appreciate it.
Links
lightpostanalytics.com/datacrunch
Attributions
Articles
Music
“Loopster” Kevin MacLeod (incompetech.com)
Licensed under Creative Commons: By Attribution 3.0 License
http://creativecommons.org/licenses/by/3.0/