Photo box full of old photos

Machine Learning, Big Data, and Your Family History

How can artificial intelligence, machine learning, and deep learning benefit your family? These technologies are moving into every field, industry, and hobby, including what some say is the United State’s second most popular hobby, family history. Today, it’s so much easier to trace your roots back to find out more about your progenitors. Tyler Folkman, senior manager at Ancestry, the leading family history company, describes to us how he and his team use convolutional neural networks, LSTMs, conditional random fields, and the like to more easily piece together the puzzle of your family tree.

Ginette: Today we peek into an area rich in data that has lots of interesting AI and machine learning problems.

Curtis: The second most popular hobby in the United States, some claim, is family history research. And whether that’s true or not, it’s has had a lot of growth recently. Personal DNA testing products have exploded in popular over the past three years, but beyond this popular product, lots of people go a step further and start tracing their roots back to piece together the puzzle of their family tree. Today we’re going to dive into the data side of this hobby with the leading family history research company.

Ginette: I’m Ginette.

Curtis: And I’m Curtis.

Ginette: And you are listening to Data Crunch.

Curtis: A podcast about how data and prediction shape our world.

Ginette: A Vault Analytics production.

Ginette: This episode of Data Crunch is supported by Lightpost Analytics, a company helping bridge the last mile of AI: making data and algorithms understandable and actionable for a non-technical person, like the CEO of your company.

Lightpost Analytics is offering a training academy to teach you Tableau, an industry-leading data visualization software. According to Indeed.com, the average salary for a Tableau Developer is above $50 per hour.

If done well, making data understandable can create breakthroughs in your company and lead to recognition and promotions in your job.

Go to lightpostanalytics.com/datacrunch to learn more and get some freebies.

Tyler: My name’s Tyler Folkman.

Curtis: Who is a Senior manager of data science at Ancestry.

Tyler: As I look across Ancestry and family history, we almost have, like, every kind of machine learning problem you might want, I mean, probably not every kind, but we have genetically based machine learning problems on the DNA science side. We have search optimization because people need to search our databases. We have recommendation problems because we want to hint the best resources out to people or provide them. For example, if we have a hundred things we think might be relevant to a person, what order do we showed them? So we use recommendation algorithms for that. We have a lot of computer vision problems because people upload pictures and a lot of our documents, if they’re not like digitized yet, meaning that they’ve extracted the text, they might just be raw photos, or even just the things that our pictures uploaded, we want to understand what’s in them, so is this a picture of a graveyard is it a family portrait? Is it an old photo? And so tons of computers vision stuff, natural language processing. On the business side, we have marketing problems just like any other business, like how do you optimize marketing spend? How do you optimize customer experience, customer flow? And so it’s really a cool place because you really can get exposed to almost any type of problem you might be interested in.

Curtis: So back in the 80s, before you could go easily find information on the Internet, genealogists had to spend a ton of time trekking around to libraries to try to find information on their ancestors. Ancestry saw a business opportunity and started selling floppy disks, and eventually CDs, full of genealogical resources for genealogists to easily access in their home.

Tyler: And then they grew up through the Internet age and moved out online, and so they started digitizing a lot of the records, like the Census records, birth records, marriage records. A lot of people could build trees on their sites, so one of the core products they offer is you can come in, tell us who you are, and tell us about your ancestry, so like who are your parents? Who are your grandparents? And as you start telling us stuff about you, we try to provide you with the resources to learn more, so maybe you don’t know who your great grandparents are, but we can give you a marriage record for them that will point you in the right direction to make some connections to find out who those people are, and that’s the main offering Ancestry had and then recently, as DNA has become a thing, and I guess recently is relative; it’s been around for a while now, they’ve offered this ability to know yourself from a genetic side around ethnicity, so you can take a DNA test, and it will tell you your ethnic breakdown from your DNA, and you can find out some really other cool things like communities is a project we have, or a product, that allows people to see they fit into these different kind of genetic communities over time.

Ginette: There are a lot of incredible stories out there coming from DNA products like Ancestry’s, like discovering genetic communities you belong to that you didn’t know about. I’ve also heard some crazy stories, like people finding out their fathers aren’t their fathers or discovering surprising new details about grandparents’ lives as a result of getting to know newly identified relatives. So the DNA test is one particular aspect of Ancestry’s offerings, but what else is the company working on right now?

Tyler: Two big things we’re working on right now: one is around this idea of how do we use all the data in our databases. So for example, when you build a tree, you enter in information about you and your family, and then you might also have your mom, who uses Ancestry too, or a relative, and she enters in information and there’s clearly overlap because you’re related.

Ginette: One really nice thing about Ancestry, unlike some other models out there, is they keep your work separate from anyone else’s so no one can change the information you’ve put on your tree, but this approach often means there will be lots of duplicate records of the same person, introducing a problem that machine learning can help solve.

Tyler: What we like to do is connect all of these pieces of information, which is this problem of entity resolution, into clusters of true people, so we have an algorithm using machine learning that looks at all of our billions of records, our billions of records and trees, I should say, and tries to sort out when a record or a tree are referring to the same person in real life and then connect them or link them, and that allows our users to make really interesting discoveries that they don’t have in their own trees because we’ve linked things for them. What we might call them new person hint, which is a pretty recent thing we’re testing right now, if you’re in one of our beta tests as a user, you’d get new person hints, which yeah, we would actually recommend to you a new person in your tree based on the inferences we made from connecting everyone.

We call those hints, if you’ve used the product, there’s a little green leaf that pops up and kind of shakes, like for example when I put my mom’s note in, I put her name in, but I didn’t put in the marriage date because I didn’t know it. About 10 seconds after I put her note in, I got a hint for her marriage record. Someone else had found it, and we, using our machine learning on the back end, had been able to make that connection for people and hit that up to them.

Curtis: This is a great example of a clever UI element that makes a powerful machine learning algorithm useful to an end user. They went the full mile from algorithm development all the way to the end-user experience, which is often something that’s overlooked. Now, with all the duplicate information in Ancestry’s backend, it takes a solid system to handle this and some sophisticated methods. So how exactly do they approach it?

Tyler: Much smarter engineers than I built a pretty impressive system that at real time processes data, so as you make a change on Ancestry, we real-time try to find where you fit with that new information or any changes that might happen to you. It’s a pretty large system; I frankly don’t know all the details of it, but it involves a lot of optimizations around the graphs, so our in-house graph database, which is very fast, optimizations on how to do what we call blocking, which is a pretty standard entity resolution thing for machine learning, which is kind of this idea where we have billions of things we need to compare to and a billion squared is a really, really large number, so you can’t do that in real time, so what we do is we do what’s called blocking, which is we try to find just the pieces of information that are relevant that might match, so if there’s no overlapping things, a very simple blocking algorithm might say, “you have no overlapping information, thus I don’t even need to compare you,” and so it just moves on, right? And we use some pretty sophisticated blocking methods to quickly narrow down the set of candidates that we have to compare with to keep it fast.

Ginette: Tyler has a great way of explaining blocking in simpler terms using a popular cartoon book known as Where’s Waldo? or Where’s Wally?, depending on which country you live in. It has pages packed full of hundreds of cartoon characters doing a lot of different activities, and your task is to find the tiny man in red and white on a visually rich and often chaotic two-page display.

Tyler: If you look at a big Where’s Waldo, it’s really hard to find Waldo, right? You look around and you’re trying to find it, but if someone came in and drew a box around Waldo that was not exactly on Waldo but within some radius of Waldo, if it were a circle, that makes it a lot easier, right? It narrows the scope you have to search in. That’s what blocking does for us, and we use some genetic algorithms to find the best circle to draw around Waldo basically, where our Waldo is a set of people we think might be relevant to this person or records.

There’s a lot of interesting literature on blocking. Traditionally, it’s been about information overlap so for example you might implement some heuristics around blocking which says like, “if you are born 100 years plus or minus this thing, you should not be compared because that’s just not possible.” What we have done, we actually use some genetic algorithms to tune our search mechanism for blocking to be able to find the appropriate things for high recall, meaning that blocking you really want to be performant, so within some performant constraint, you want to have really high to recall, meaning it doesn’t miss things it should get, and then you want your your record-linking algorithm to be high precision, so the blocking thing gives you all the things that could possibly be right within some time constraint, and then the compare algorithm or the entity linker is really, really good at getting precise decisions on the things that the blocking gives it.

Curtis: Ancestry has some intricate custom machine learning models at play here, and I wanted to get Tyler’s perspective on a recent trend towards using more turnkey models, as opposed to building something out yourself, such as Google’s AutoML or other tools that are coming out on the market that make process data easier. Now depending on what type of machine learning you’re doing or the type of model you have, this might make sense or it might not. From Tyler’s perspective, there could be some great turnkey tools out there, but there are some really important questions you have to ask yourself before deciding to use them.

Tyler: I think that the modeling side, I don’t want to say it’s easy, but for some problems, it’s a little bit more standardized, so for example, if you’re trying to do like a customer churn type of model and you have a lot of the same data that other people have on customer churn and then you just kind of want to plug it into a model, I think it’s very easy these days for an engineer or even maybe a nonengineer to kind of go find these tools, plug-in the data, and get something up. I think where it’s a little bit trickier is understanding how to debug that thing. So for example, maybe you get the answer you wanted. It’s really, really good, but do you understand? Do you know where it’s not working well? Do you know where it might bite you? Do you know what assumptions you may be were valid/not valid so that you can kind of know when you go to production is this going to work how it worked in my test offline? I think a lot of that takes some understanding of how that algorithm is working, and I know there’s work on trying to explain machine learning too, but I think a lot of that’s hard to know without actually knowing how the models are working on some level. Do you need be able to write that from scratch yourself? Probably not, but you need to know enough to understand how it’s working and where it’s working, and if it’s not working, what do you do next? I mean that’s one of the challenges when I work with deep learning is you might run a perfectly valid model, and this happens to me all the time when I play with reinforcement learning. Like, “oh, this thing should work. I think this is exactly the thing that they made,” but for some reason it doesn’t. And so what do you do? Like how do you fix it? It’s, like, not trivial, and even before you get to the modeling, how do you get your data in a state that it’s going to be modelable, like removing outliers, getting the right features. I think those are sometimes are the harder problems that we think about that really influence the modeling side that we don’t have very good automated ways, so like for example, for obituaries, the size of the text can matter for extracting names, because sometimes the name is the first word, and it’s bigger, and so one of the things we looked at was including the size of the name as a feature in our model, which isn’t a standard thing because most people don’t have that data, right? They just take the raw, the raw NLP data, the raw text, and so just a turnkey NLP thing would never take that into account, but we knew enough about how it works to plug it in.

Ginette: Tyler brings up a key point here: understanding, at least on a basic level, machine learning algorithms and data science processes is really, really important in order to successfully create, deploy, and adapt models, and often the best way to understand them is through active learning. Many of you have reached out to us saying that you’d love to start working with machine learning even though it isn’t what you originally studied. One place you can look is Brilliant.org where you can learn in bite-sized pieces at your own pace. Their courses are entertaining, challenging, and educational, and they go beyond lectures to help you actively learn. It’s a great resource. They offer courses in computer-science algorithms, artificial neural networks, and machine learning that will help you better understand algorithms.

If you’d like to deeply understand machine learning and data science, give them a try by going to brilliant.org/DataCrunch. They were good enough to sponsor this episode, and using this link lets them know that you came from us, and you can sign up for free, preview courses, and start learning! Also, the first 200 people that go to that link will get 20% off the annual premium subscription. Once again, that’s brilliant.org/DataCrunch to understand machine learning!

Brilliant.com's logo with details

Ginette: Now let’s talk about another interesting machine learning project called “Newspapers.” Ancestry has the largest historical newspaper collection available online.

Tyler: So that was the major thing I did when I first started. As I’ve started taking on more of a lead roll, someone else on my team, Drew Pearson, has kind of taken that on with someone named Maria Morley (Fabiano), and then there’s another set of people who I work closely with, I guess if I’m saying names, Carol Anderson and Phil Crone who are full-time here are working on newspapers. So newspapers is exactly what it sounds like. It’s a collection of newspapers in the US. I’m pretty sure there’s over hundreds of millions of pages of newspapers, and what they are working on and doing a great job with, is trying to use natural language processing and deep learning to understand, at this point, obituaries, so even to this day people write obituaries in newspapers when someone passes away.

And you can find them online a lot now on, like, the online version of newspapers, and they’re a really, really great source for genealogists because they’re often written by family members, so they’re usually very good information, like the death date is going to be right. You can usually trust it. Information on people’s names are usually going to be right. The problem with newspapers’ data is that they are unstructured text, right? Like no one’s gone through and said “oh, here’s the birthday or the death day. Here’s the name. Here’s the spouse’s name,” etc., and so what we’ve been working on is how do we understand using machine learning a raw obituary and extract all the entities. In machine learning, this would be named entity recognition is what it would most commonly be called, but for us it’s named entity recognition specifically for obituaries, and so we’ve leveraged deep learning specifically algorithms around like convolutional neural networks, LSTMs, conditional random fields, to basically take in all this raw data and tag the appropriate people and places and dates with pretty high accuracy, so we can say this is this person who died. This is the day they died. This is the place they died. And once we can do that, we can then start connecting you in to our entity linker to link you to the right person so now that person gets relevant information from newspapers, which they might not otherwise get.

Curtis: To prepare for a project like this, it takes a training set, which often requires people to do labeling by hand so the computer can learn what is correct, and this is a problem that many companies come across and often leads them to build out an in-house team or outsourcing the work to a third party. Building out that team can take explaining and convincing the higher-ups that developing a labelling team is necessary and valuable.

Tyler: This project spawned an in-house labeling team, so we now have a team, which looks like will exist for the near future, whose sole job, or not sole job but a lot of their job is labeling machine learning data, so what we actually did is we used this tool that’s open source called Brat, and Brat what it will do is you can give it a sentence and it will make it easy for people to tag things, so it’s basically a labeling tool, and we spun up a team here, trained them, which isn’t trivial, a lot of iteration back and forth on figure out like how exactly like do we want these things labeled, what are the special cases, and they went in and labeled all the entities, the places, the relationship connections so that we can learn from it, which was super important and really unlocked a lot of value for us, because otherwise getting off the ground would have been extremely hard, and it’s been something we’ve leveraged moving forward as we’ve had other projects that need labels, cause we already have this team that has that capability, we’ve been able to move a lot of other projects forward a lot faster. Getting the training data was definitely challenging, especially since we didn’t have a team, like getting the team spun up and convincing that to happen took some time.

Ginette: Tyler also points out the prominence of open source tools for machine learning. Gone are the days when you would have to spend a lot of money on software to do modeling and understanding data—today, the state of the art is in open source, making it available to anyone, and in fact we’ll have one of the top thinkers in open-source python coming up on a future episode.

Tyler: Interesting thing about machine learning is a lot of the like, especially in deep learning, a lot of the open source stuff is the state-of-the-art. Deep learning especially is an interesting field because there’s so many different hyperparameters or architectures you could try and if each attempt at a model say takes 10 hours on one GPU, it’s not cheap to try different architectures, right? So what you’ll see often in industry is that people leverage the great open source community, where Google, Facebook, and companies that have the resources to run these types of things, openAI, will make available things that they found that works, and so we started from what they’ve done, so like a pretty standard model is to use LSTMs with text data, right? If you have SQL models, you might drop a conditional random field on the end which helps order things out at the end, and starting from there, it actually got us pretty far, and then from there, we can make the tweaks necessary to kind of adapt it to our problem and take into account things that are different for us, but really the great open source community around machine learning right now, it really leverages us to get to a pretty good starting place pretty quickly, even tools like SciKit Learn for traditional machine learning are just great ways to get a baseline type model in like a day.

Curtis: Ancestry and many other companies are benefiting a lot from the open source community and all the great tools that have been developed there. And this is only the tip of the iceberg for Ancestry. We talk to Tyler a little bit about what machine learning might bring genealogy in the future.

Tyler: Machine learning is really going to help leverage a few core things for genealogists. I think (1) it’s going to help the expert, make their job easier, like how do we help the expert genealogist, who a lot of our users are, find those next discoveries that aren’t trivial. Like, I’m not an expert in my genealogy, so my discoveries are a little bit easier to serve up to me, but how do you really use machine learning to dive deep in that data and find that needle in the haystack that they’re looking for in a way that they could probably do by themselves but that we could help them do faster, and the other side is how do we use the data we have and hopefully machine learning to make the experience better for a novice or a potentially interested user. like we have a lot of people who come to our site and try things out, and expert genealogy is a lot harder than maybe hobbyist genealogy, meaning you want to know a little bit more about your family and understand some things but you’re maybe not committing tons and tons of time to really digging deep into the search system, right? And so how do we get them to things that they need to to get the answers they’re looking for about their family in a way that’s easily consumable in a way that’s really nice for a user that’s not going to spend you hours and hours every day working on that kind of stuff, and I think in general that’s an interesting field around how do you merge machine learning and user experience in a way that’s just nice, and I think we’re still trying to figure that out and a lot of companies are looking into this, and how do we provide algorithms’ suggestions in a way that people can use them and also give us feedback cuz I think that’s an interesting UX problem.

Ginette: And a huge thank you to Ancestry and Tyler Folkman for being on our podcast. He also shared with us a lot of great advice on how to break into the machine learning industry, so if you are thinking about getting into the space, stay tuned for a future episode.

And as mentioned in our podcast, if you want some freebies from Lightpost Analytics, a sponsor of today’s episode, go to lightpostanalytics.com/datacrunch. They’ll teach you how to find insights and share them effectively, creating improvements for your company and greater success for you.

And as always, for the transcript and links for this podcast, you can go to datacrunchpodcst.com, and you’ll find the links at the bottom of the show transcript. If you like what you’re learning here with us, please share our podcast with your coworkers and friends and go to iTunes or your favorite podcast playing platform and leave us a review.

Links

lightpostanalytics.com/datacrunch

brilliant.org/DataCrunch

Attributions

Music

“Loopster” Kevin MacLeod (incompetech.com)
Licensed under Creative Commons: By Attribution 3.0 License
http://creativecommons.org/licenses/by/3.0/

Picture

Photo by Roman Kraft on Unsplash