How did one boy’s stuffed yellow elephant permanently intertwine itself in history? What is a data scientist? Why is right now the golden age for data science? We take a crack at all three of these questions—the second two, with the help of Gregory Piatetsky-Shapiro and Ryan Henning.
Transcript
Ginette: “Over the past few years, we’ve seen these news flashes:
“An article in Harvard Business Review in 2014, titled: Data Scientist: the Sexiest Job of the 21st Century
“Mashable’s article in 2015: So You Wanna Be a Data Scientist? A Guide to 2015’s Hottest Profession
“Business Insider, 2016: Data Science was the #1 Profession as Rated by Glassdoor
“A data science industry observer, KDnuggets, 2017: Data Scientist: Best Job in America, Again, which cites the most recent Glassdoor report outlining the very top jobs in America:
“It turns out, four of the five top US jobs deal with data. In descending order, we find data scientist, devops engineer, data engineer, and analytics manager.”
Curtis: “With four out of five of these top jobs orbiting data, clearly something’s going on here.”
Ginette: “I’m Ginette.”
Curtis: “And I’m Curtis.”
Ginette: “And you are listening to Data Crunch.”
Curtis: “A podcast about how data and prediction shape our world.”
Ginette: “A Vault Analytics production.”
Ginette: “Today is a culmination of everything we’ve talked about in our series on the history of data science. This is where all the contributions of Florence Nightingale, William Playfair, Ronald Fisher, Ada Lovelace, and many others come together in one place. We’ll add a couple more people to this list to answer these two questions: ‘What is a data scientist? And why is right now the golden age of data science?’”
Curtis: “According to IBM, ‘everyday, we create 2.5 quintillion bytes of data.’ But what does a quintillion actually look like?
“Well, if you take one quintillion pennies, you could actually place them face up end to end can and blanket the entire surface of the earth 1.5 times over. Or think about one quintillion ants. That would be like taking all of the ants that exist today on planet earth according to some estimates, and then you have to take that number and multiply it by 100. So, that ant pile in your front yard becomes 100 ant piles in your front yard. Basically ants take over the earth. And we make 2.5 quintillion bytes every single day!
“The next question is, how much information does that actually represent? It’s 250,000 times the amount of information that all the printed material in the Library of Congress contains. And we make that every single day.”
Ginette: “In 2013, SINTEF published this stat, quote: ‘90% of the world’s data has been created in the preceding two years.’ According to one Ph.D. technologist, this has been true for the last 30 years because every two years, we produce 10 times as much data.”
Curtis: “This exponential growth is insane. Just as an example of this type of growth rate, if you take a hypothetical scenario, and you take the world’s population, and say it starts growing as rapidly as data is growing now, it would look like this: Currently, the world’s population, 7 billion people, could fit in the size of Texas if they were living as densely as they do in New York City. Now, in two year’s time with this growth rate, you’d actually have to cover the entire United States and half of Canada with people living in New York City-like density. And if you extrapolate that out ten years keeping the growth rate the same, you’d have to cover the entire planet, including all of the oceans, with New York Cities, and then you’d have to do that with 100–150 additional earths to fit all of those people. That’s the kind of growth rate we’re talking about.”
Ginette: “With data collection on the rise, one report goes so far as to say that only the data literate will have the chops to be executives in the future, quote:
“‘[E]veryone should learn to love data, if only for the sake of their career. “Within 10 years, if you’re not a data geek, you can forget about being in the C-suite.”’”
Curtis: “If you’re thinking about how you can surf this data tsunami, you’re not the only one, because even if you’re not directly involved in the data industry, data is going to make its way to you. Never before in the history of our planet has data played such a central role. It’s permeating our technology, our social interactions, our purchasing behavior. It’s woven into society, and almost every industry is seeing change because of it, and this is just going to keep spreading. So becoming data literate should be one of our top priorities. Not only for your job that’s going to morph as a result, but also for you to better navigate a society that’s molded by data.”
Ginette: “To gain insight into this field, we were lucky enough to interview one of the very first people to pioneer data science. This man was doing data science before it was cool. He even started the second website ever dedicated to data science, called KDnuggets. Back then, they didn’t call the budding field ‘data science.’ That term became popular around 2008. But what people did in the field was similar to what data scientists do today—wrangle data for insights.”
Gregory: “My name is Gregory Piatetsky-Shapiro. I worked for a while at GTE laboratories. It was a large telephone company, and we were thinking of interesting ways of applying intelligent algorithms to large databases.”
“I was involved in organizing KDD conferences for many years.”
Ginette: “KDD stands for ‘knowledge discovery in databases.’ This was at one point a contender for the field’s name. While it’s a little longer than ‘data science,’ it’s more descriptive.
“Now Gregory thought there needed to be a workshop on intelligent algorithms and databases, so he decided to organize a workshop in the field. One thing led to another, and soon he was writing a newsletter called KDnuggets, and eventually that newsletter grew into a website.”
Gregory: “KDnuggets website, I created it back in ‘94. I think it was the second website in the world on data mining and data science.”
Curtis: “He then worked for several startups, consulted, and now describes himself as an industry ‘observer.’ His website has garnered a ton of interest, to the tune of 400,000 unique visits every month, with 150,000 thousand subscribers and followers.”
Ginette: “But what was going on in the computer science industry when Gregory started his NYU master’s degree in computer science in 1977?
“We already know from our last episode that in 1976 the personal computer explodes across the United States. But there’s another invention that’s creeping up around this time—something that eventually plays a huge role in data science.
“It starts out like the ENIAC as a top secret Department of Defense project. This places us in 1963. In particular, it’s run by a group called ARPA, the Advanced Research Projects Agency. It’s now known as DARPA. The intellectual creator for this seminal technology is J.C.R. Licklider, known as the ‘Johnny Appleseed of computing’ for the huge growth he catalyzed in computing. While developing this technology, he names it the ‘Intergalactic Computer Network,’ which other people eventually develop into what was called the ARPANET. This as you may have guessed or already known, is prehistoric Internet.”
Curtis: “This technology connects four distant computers at four different universities to one network in 1969, making it essentially the very first Internet. As part of testing it, a professor from UCLA sends the very first message over this network, and in the words of that professor, the interaction went like this, quote:
‘We typed the L and we asked on the phone,’
‘“Do you see the L?”’
‘“Yes, we see the L,” came the response.’
‘We typed the O, and we asked,“Do you see the O.”’
‘“Yes, we see the O.”’
‘Then we typed the G, and the system crashed . . . Yet a revolution had begun.’”
Ginette: “But the public didn’t find out about this until a couple years later. In 1972, the world saw this new network technology for the first time at a conference. This is also the same year everyone learned about another ‘hot,’ new computer application. Electronic mail. This was the rage, and as a piece of very interesting knowledge, in 1976, Queen Elizabeth was the first head of state to send an e-mail. Her username was HME2 (Her Majesty, Elizabeth II).”
“It was the very beginnings of enormous growth for people-to-people traffic, and concepts from ARPANET continue to morph throughout the ‘80’s.”
Curtis: “The National Science Foundation starts funding a project to help connect education networks, and this helps the Internet grow into something that researchers, educators, and the government mainly use. Businesses generally weren’t involved because they generally weren’t in line with government goals.
“But something starts to change that. The Defense Department decommissions the ARPANET in 1990, meaning the government’s restrictions on businesses using the Internet start to ease up.
“More importantly than the decommissioning of the Arpanet, it’s about this same time that Tim Berners-Lee, the inventor of HTML, invents the beginnings of the World Wide Web, which is actually a way to access information over the Internet.
“Then in 1993, the World Wide Web software becomes public domain, and companies start to jump on this new technology.
“And to put this in context, in 1994, Jeff Bezos founds Amazon and everything starts to change.”
Ginette: “While a lot of development is happening with the Internet, Gregory decides to get his Ph.D. from NYU in Computer Science, and he focuses his dissertation on applying interesting algorithms to databases.
“Early in his career, he starts working on projects that basically make it a lot easier for you to find what you’re looking for within massive amounts of numbers. Eventually, he becomes a principal member of the technical staff at one of Verizon’s predecessor companies, GTE Laboratories, a large telephone company. This is where he starts and leads the world’s first knowledge discovery in databases project from 1989–1997. Here, he basically finds ways to help businesses find critical messages in their data.
“Essentially, he was developing data science tools before anyone else.”
Curtis: “Other people also begin developing similar tools. But, eventually, they hit a limit with how much data these tools can actually process. They don’t have enough power or memory to work with the growing amount of data that exists. And as they begin to struggle with this, someone is able to crack the problem. It has to do with a yellow, plush child’s toy, which we’ll get to in a second.”
Ginette: “But before this breakthrough, an important social movement is growing. In the early Internet days, academics freely shared computer code with each other. The aim was simply to advance knowledge for knowledge’s sake. They felt everyone should have free access to code, and their work was for the community’s good. Free computer code love, harmony, and happiness ensued. Kumbaya for all. In pre-computer days, an equivalent might be a brownie recipe exchange: take the recipe, adapt it as you want, and share it with your friends—all of this free of charge simply for a delicious brownie.
“But then a brownie recipe controller shows up. As computer code becomes more complex to develop, computer companies realize that they can make money developing and selling complex code. But this rubs some people the wrong way, hacker types, and this starts a social movement for free and open software, where people can access code free of charge. Some people in this camp start to make code accessible to the public. One of the leaders in this social movement is Richard Stallman, who now seems to have an almost mythical reputation according to the twitter handle @stallmanfacts, a bot that regularly spouts our Chuck-Norris–like pseudo facts about the man, like ‘Richard Stallman wasn’t born. He was compiled from source.’ ‘Richard Stallman’s DNA is in binary.’ You get the idea. Another group with a similar philosophy forms in 1998 by Bruce Perens. This second one calls itself the Open Source Initiative, which coined the term ‘open source.’”
Curtis: “Some say, and I agree, that the backbone of today’s data science tools comes from these movements. Prime examples are programming languages like Python and R, which are absolutely essential to data science today. These languages are free and available to anyone, and people can develop code that everyone can download, modify, and use. And this gives data scientists the ability to form into an enormous worldwide community that anyone can participate in at virtually no cost.”
Ginette: “So this is where that yellow children’s toy we mentioned earlier comes into play. In 2006, we see the beginnings of a hugely important open source software project, now known as Hadoop, which explodes in popularity around 2013. The inventor, Doug Cutting, names the software after his son’s stuffed yellow elephant. It all starts with Doug and his graduate assistant, Mike Cafarella, collecting web pages to create an index of the Internet. But as the Internet grows rapidly, it becomes too big to collect and process on one computer. There isn’t enough storage space or enough power to process the data. So they have to get creative.”
“They find some Google research papers that help them develop a new technology to process vast amounts of data. The basic idea is this: Instead of having one computer process everything, their technology allows that computer to essentially distribute the work to other computers when the load is too big for it. Then all these computers work together to solve the problem. The more data you have, the more computers you recruit to help. It seems like a really simple concept, but it’s very technically complex. It took a lot of ingenuity and hard work from the inventors to create software to do this.
“With Gregory’s experience working in this field, we can believe him when he says this:”
Gregory: “Right now there is a golden age for data scientists because there are an amazing amount of tools, and it becomes easier and easier to do machine learning.”
Ginette: “Today, with practically endless amounts of data storage, tons of organizations are collecting their data at a rate we’ve never seen before. This collection frenzy creates a mind boggling amount of information that’s completely meaningless unless someone mines it for nuggets of knowledge. This is where the data scientist comes in, and this is where we attempt to answer that 25 billion dollar question, ‘What is a data scientist?’”
Curtis: “At the core, a data scientist is someone who learns truth about the world using data and finds ways of applying those truths to benefit other people. What does this look like on a day-to-day basis? Their job always starts with important questions that either someone else asks or they ask themselves. Then they have to find relevant data that can help answer that question, or if it doesn’t exist, they have to come up with a system to collect the data. Then comes the difficult task of working with the data, which involves a lot of programming and interfacing with the data with technical tools. And then they draw insights from the data using statistics and visualizations. Finally, they have to present that data to decision makers in a way that’s persuasive and can be understood. This means they have to be really good communicators with an eye for what’s relevant to their audience, or their findings will basically fall on deaf ears. Very often, they rely on quick, powerful visuals to effectively tell their data’s story.”
Ginette: “More than all the technical skills, a successful data scientists needs to be insatiably curious and relentlessly exploratory. If they see an important trend in the data, they pursue it with unbounded curiosity and ask the right questions of it.”
Curtis: “Like Gregory said earlier, this is the golden age for data science. And there are four reasons why right now is the perfect storm: (1) our technology has advanced enough that it can support working with huge data sets; (2) this technology is widely available; (3) data is literally everywhere, and it grows every day. We’ll get to the fourth one in just a second, but before we do that, there’s an important problem we need to highlight first.
“Right now, the biggest thing that’s keeping data science from spreading more rapidly is the lack of people who actually know how to do it. There’s a large skills gap. These skills are in such high demand that universities can’t keep up with the needs of organizations, and because of the shortage, lots of online courses and bootcamps that help people rapidly acquire data science skills are popping up across the United States. We spoke with an instructor of one of the largest bootcamps in the United States.”
Ryan: “My name is Ryan Henning, I work for Galvanize in Austin. I teach the data science immersive course with my colleague, Scott Schwarts.”
Ginette: “Ryan gives us his take on Data Science:”
Ryan: “The field of data science is very unique in that it can span so many different industries. So even, even my students that have graduated from Austin, one of them works at an investment firm, one of them works at an insurance provider, another works for a construction management company, another works for like a call center optimization start-up, another works for an online marketing firm, another is going to soon be working for an oil and gas company in Houston. And that’s the beauty of data science is it spans so many industries, you can pursue whatever industry is interesting to you and be a data scientist in that industry.”
Curtis: “And this leads us to the fourth element of why right now is the golden age of data science. It’s because there’s huge demand from nearly every industry. They all have data, and they know it’s valuable, but they don’t have enough people to actually work with it.”
Ginette: “A brief look across some industries will show it’s huge potential.
“In the biotech industry, we’ve seen a lot of growth. Thanks to data science, we can now decode an entire genome in a matter of hours for around $1,000. In just 2007, it cost $10 million.
“In the energy sector, we can track exact power-usage in our homes and businesses, not only that, the industry is starting to use demand response initiatives, meaning you as a consumer can save money if you decide to use energy from the grid during lower energy consumption times.
“In the finance sector, we’re seeing the rise of roboinvestors, basically algorithms trained to invest your money at lower cost since there’s lower overhead, and Data scientist are harnessing data to take down fraudulent people.
“Let’s look at an industry that probably doesn’t pop into our minds first: Hospitality. It has a lot to gain from data science. Designing a great customer experience is critical for this industry, and data scientists can analyze mobile and web behavior, purchase histories, social media reviews, and can even help predict what temperature you’d like your room at when you arrive at the hotel.
“This list goes on and on and on. It’s hard to find an industry that you won’t see huge benefits from data science.
“But let’s not be too hype-driven. Even though it is having massive effects in various industries and has a lot of potential, data science isn’t a panacea, and it isn’t necessarily easy (yet).”
Ryan: “In the last few years data science has become a very popular job title, partly because the Harvard Business Review calling data science the sexiest job of the 21st century, and while that’s cool, I feel it’s kind of lead people to believe that data science is like black magic. If you have a problem, then data science solves it, and that’s really not the case. Really what data science is, is just about taking raw data and just sort of squeezing information out from it. Of course the data has to have that information before you can squeeze it out, like if you have a dry sponge, you’re never going to get water no matter how hard you squeeze.”
“The role of data science is the ability to actually make data-driven decisions in a business, so really a data scientist in a business is going to have one foot in business operations, to sort of understand what the business does, and one foot in the engineering operations to be able to actually crunch the data and draw inference from the data, so in a lot of ways they’re a bridge in the business between the engineering and the business operations.”
Curtis: “Data scientists are hybrids of strategic business and technical coding, and the best of them will notice what is relevant and what matters to people, and they’ll also understand the technical aspects of how to deliver it.”
Ryan: “A lot of what data scientists actually do in a company is participate in part of the data acquisition process to understand the business but then understand what kind of data does the business need to actually help the business make decisions and be a part of that process of what data do we need to collect from the beginning.
“What we do here at Galvanize is we require that students come in already knowing a lot of technology. We require that they already know how to program, specifically that they know how to program in python, and we require that they know statistics as well.
“But for our students who do come in with these important prerequisites—knowing programming, knowing statistics—in those three months we can really bring you from those very strong skills you already have and tweak them a little bit so that they become very applicable for being a data scientist.
“And the goal of Galvanize is that we want to transform lives. We want to actually provide an opportunity for these students to come in and learn the tools and the skills they need to actually, you know, change their career trajectory.”
Curtis: “So there is opportunity to be a part of this golden age, and the industry needs more people to gain the skills. Just be cautious, however, because golden ages don’t last forever. Gregory weighs in on this:”
Gregory: “I’ve recently run a poll about automation, and I think about half of the respondents were expecting that many machine learning tasks that data scientists do will be automated by 2025, and data scientists should enjoy the golden age now, but be aware that automation is coming, not only for simpler jobs but also jobs that deal with data that have well defined rules, that have clear criteria of what is better and what is not, are more likely to be automated, and there are some interesting studies by Tom Davenport that show professions most in danger of automation, and data scientist is one of them.
“So data scientists would be well served to focus on tasks that are harder to automate. For example, trying to understand how to best position the problem, what are the assumptions in this, in what they are doing, how to best present it, what the most convincing ways to tell stories. So kind of focus on softer skills of data science.”
Curtis: “The technical skills, including coding, are still very necessary and will be for several years to come, so if you want to be a contributing data scientist, you should definitely learn them. But be aware that change is coming, and you would benefit a lot from learning skills that aren’t talked about very much—like business and people skills. You need to understand the business problems in order to choose the data that’s most relevant to that problem, to communicate the insights to the people effectively, and to properly frame how to think about these kinds of problems. In the future, it’ll be more strategic and high-level thinking and less in-the-weeds coding.
“So, the skill set of a data scientist is going to shift. And in fact, more people who don’t have data science in their job titles will be doing data science-like work in their jobs—data and technology will continue to advance, making the technical side of data science easier and easier, opening the field up so more people can participate. People with a good understanding of data and what it’s capable of will rise to the top of their professions, because as we know, data is going to touch nearly every industry.”
Ginette: “In terms of data science in the future, you’re bounded only by your creativity, your ability to notice relevance, and your foundational understanding of data and statistics. So get excited for our future episodes where we cover how people are harnessing data to change the world.
“Next time, we’ll talk with one of the fastest growing startups in the UK to see how they’re using freely available data on the Internet to fix big problems in ways that might surprise you. In the words of our next guest, “A lot of the work we do really is finding a hard problem that no one’s really solved before and then use data science to crack it. They’re always quite interesting stories because you know, they’re stories of a little bit of adventure, luck, and skill.”
“If you like what we’re doing, we’d love for you to leave us a review on iTunes. As always, find our sources on our webpage at our show notes. You’ll find that at www.vaultanalytics.com/datacrunch. Here we’ll list links to some of the articles we’ve used as well as some of the music on this episode.
“And a special thanks to Gregory from KDnuggets and Ryan from Galvanize for giving us their time and perspectives.”
Some Sources
Articles
Music
“March of the Spoons” Kevin MacLeod (incompetech.com)
Licensed under Creative Commons: By Attribution 3.0 License
“Thinking Music” Kevin MacLeod (incompetech.com)
Licensed under Creative Commons: By Attribution 3.0 License
“Ave Marimba” Kevin MacLeod (incompetech.com)
Licensed under Creative Commons: By Attribution 3.0 License
“Wepa” Kevin MacLeod (incompetech.com)
Licensed under Creative Commons: By Attribution 3.0 License
“ZigZag” Kevin MacLeod (incompetech.com)
Licensed under Creative Commons: By Attribution 3.0 License
“Thief in the Night” Kevin MacLeod (incompetech.com)
Licensed under Creative Commons: By Attribution 3.0 License
“The Complex” Kevin MacLeod (incompetech.com)
Licensed under Creative Commons: By Attribution 3.0 License