A pile of soybeans

How to Predict World Events with Predata

There have been some spectacular fails when it comes to looking at Internet traffic, think Google Flu Trends; however, Predata, a company that helps people understand global events and market moves by interpreting signals in Internet traffic, has honed human-in-the-loop machine learning to get to the bottom of geopolitical risk and price movement.

Predata uncovers predictive behavior by applying machine learning techniques to online activity. The company has built the most comprehensive predictive analytics platform for geopolitical risk, enabling customers to discover, quantify and act on dynamic shifts in online behavior. The Predata platform provides users with quantitative measurements of digital concern and predictive indicators for different types of risk events for any given country or topic.

Dakota Killpack: Over the past few years, we’ve have collected a very large annotated data set about human judgment for how relevant many, many pieces of web content are to various tasks.

Ginette Methot: I’m Ginette,

Curtis Seare: and I’m Curtis,

Ginette: and you are listening to Data Crunch,

Curtis: a podcast about how applied data science, machine learning, and artificial intelligence are changing the world.

Ginette: Data Crunch is produced by the Data Crunch Corporation, an analytics training and consulting company.

Let’s jump into our episode today with the director of Machine Learning at Predata.

Dakota: My name is Dakota Killpack and I’m the director of machine learning at Predata, and Predata is a company that using machine learning to look at the, the spectrum of human behavior online organizes it into useful signals about people’s attention and we use those to influence how people make decisions by giving them a factor of what people are paying attention to. Because attention is a scarce cognitive resource. People tend to pay attention only to very important things, If they’re about to act in a way that might cause problems for our potential clients, they’ll, they’ll spend a lot of time online doing research, making preparations, and by unlocking this attention dimension to web traffic, we’re able to give some unique insights to our clients.

Curtis: Can we jump into maybe a concrete use case into what you’re talking about just to frame and put some details around how someone might use that service?

Dakota: Absolutely. So one example that I find particularly useful for revealing how attention works online is looking at what soybean farmers did in response to a tariffs earlier this year. So knowing that the, they weren’t going to get a very good price on soybeans at that particular moment. A lot of them were looking up how to store their grain online and purchasing these very long grain storage bags, purchasing some obscure scientific equipment needed to insert big needles into the bags to get a sample for testing the soybeans and moisture testing devices to make sure they wouldn’t grow mold. And all of these webpages are things that tend to get very little traffic. And when we see an increase in traffic to all of them, at the same time, we know that a, a very influential group of individuals, namely farmers, is paying attention to this topic. Using that we’re able to give early warning to our clients.

Curtis: Sounds like looking for needles in a haystack of data. Right? So how do you determine what is a useful bit of information in the context of what your clients are looking for? Do they kind of have an idea of what you’re looking for and then you’d go out and search for that or, or does your algorithm find anomalies in the data and then characterize those anomalies so that you can then report that back? How does it work?

Dakota: It’s a mix of both. Because the, the Internet is such a rich and complex domain. It’s, it’s very dangerous to just look for anomalies at scale. There there’ve been some high profile failures, most notably the Google Flu Trends experiment where people have tried to link arbitrary online activity to real life human behavior and to get around some of the problems. We, we use human-in-the-loop machine learning. It’s very much a process where human expertise about the world is the foundation of all of our models. So we have a team of analysts that knows how people might act both in real life and online, and they’re able to constrain the space of things that we look at to make sure we’re not finding spurious patterns. And we know that any anomalies that surface in the set of things that we’re monitoring is going to be more likely than not relevant. And once those are surfaced, we pass them back to a human before finally alerting. And that gives us one more final check to make sure we’re not sending out spurious alerts as well as giving us even more training data using the whole human-in-the-loop system.

Curtis: This is sort of the setup where you have humans defining the problem space or the areas to look in the first place. The algorithm then parses that comes back with results and then human checks those and adds context to the client. Is that, is that correct?

Dakota: Right. And using that over the past few years, we’ve have collected a very large annotated data set about human judgment for how relevant many, many pieces of web content are to various tasks, whether it be in the realm of predicting geopolitical risk or predicting price movement for some currencies or anything that our experts cover. We know exactly what web content is and isn’t relevant and we’re using that to constantly improve our system.

Curtis: So you guys are finding actually that the, and I think this is true of a lot of cases, although it’s not apparent a lot of times to people who are starting out in this field, that looking at the data and making sure the data that you’re, that you’re bringing in is the right set of data and that it’s clean and actionable, that’s where you guys see, see the biggest lift, not necessarily in cutting-edge algorithms,

Dakota: Right. If you have a a good enough set of predictors, then the classical stuff works extremely well. Where the classical stuff does not work, is trying to capture human intuition about what the various factors for something might be and that’s where we find that having the human and the machine work in conjunction is best. It’s a lot of these tasks we’re trying to solve. Geopolitics and finance or even more broadly the problem of what are humans paying attention to, when, how does that change over time and what are the important parts of that? Those are problems that even humans themselves can’t solve particularly well. So we need the best of both human and machine to succeed at that time.

Curtis: Sure. And now you mentioned geopolitical risk and you mentioned finances. Is there a, an area that you guys tend to focus? Is it pretty even again across those two, or maybe there’s more use cases as well, that your guys are finding that this state is good at solving. Can you talk a little bit about the use cases?

Dakota: Yeah, anything where attention is an important component, I’d say, is what connects our are use cases. So in geopolitics you often have many, many actors, large state actors, and each one has its own viewpoint about the world, and they’re viewing each other in very particular ways. So being able to track what does the attention of a large organization towards another one look like? What does the public’s attention to these organizations look like? Using that kind of behavioral profiling and following the attention at every step is, is key to solving these problems both in geopolitics as well as in finance where we might try to track what are institutional investors paying attention to today? How does that differ from retail traders? How does that differ from the general public doing research into a company because they saw it in the news today.

Curtis: And again, this may be you know too deep into your secret sauce and if so, we don’t have to go down this road, but I am really curious about how you even start to curate a dataset that tells you those things. Is that something you can comment on?

Dakota: I’d say it starts with the human. A lot of our analyst team worked in the professions that we’re, we’re trying to create an index of online, so they’re able to make very deep judgment calls as to whether any piece of web content would have been relevant to them in the past phase of their life.

Curtis: And how long you guys been at this now? You said you’ve, you’ve done at least enough iterations of this that you have a pretty robust set of training data and context for what you’re doing. How long you guys been at it?

Dakota: I’ve been with the company a little over three years and it had been going on experimentally for around a year before that point.

Curtis: Okay, got it. And and how long would you say it took you guys to kinda hit, hit your stride where you said, “okay, we have enough, we’ve done this enough where we now have something that is useful and it works”? So I’m trying to get to is how, you know, how, how long was that curve to really get to something that was workable?

Dakota: I’d say it took a few years of building up enough of an internal culture of how to understand these things since it’s really a unique way of thinking about the world. Most people don’t have data about everyone’s attention at their fingertips. Fundamentally, it’s something that our team has been able to wrap their heads around, but it did take a few tries.

Curtis: So even how you decided to collect the data, the sources you were looking at, it sounds like that was a long process to really nail that and get that right.

Dakota: Right. We’ve been constantly adjusting which data sources we think are our most valuable and the main takeaway is anything where people spend more time and effort actually interacting with or accessing a piece of web content tends to make it more predictive.

Curtis: Yeah, that’s interesting. Maybe give us an example of how one of your clients might use some of these predictions and things. Being able to know something in advance and then take an action on it. What’s some of the value that people have been able to extract from these predictions?

Dakota: Then the geopolitical realm, we’ve had prenatal used in various missions around the world and they’re finding that it gives them much much greater lead time than a lot of things they’re currently looking at. I can’t really say too much about those national security applications, but a consistent theme is that it, it helps them prioritize having a dashboard of the world’s attention to look at, tells them which of their classified resources they opt to direct somewhere.

Curtis: So they actually then engage with the dashboard, and I’m assuming they can punch in criteria things that they kind of want to look for and that helps them with their overall strategy.

Dakota: Right. They’re avid users of the platform there. They’re building their own signals with our, our customizable tools. So they’re, they’re conducting their own research on the Internet and feeding that into our platform and

Curtis: Oh, that’s interesting. So your users can actually then bring in their own data sets and, and kind of mix it with what you guys are doing to, to enhance it.

Dakota: Right. It’s a fully extensible platform.

Curtis: That’s really cool. What would you say is the, you said it took a couple of years to really get this right. What would you say was maybe one or two of the biggest challenges you guys had in making this thing work and how’d you overcome those?

Dakota: On the human side, it was realizing that attention is actually the most valuable thing for making predictions. And not only that improve performance, but it also made things much more interpretable when it finally got to the client, which made it much easier for, for them to start incorporating it into their decision making process.

Curtis: So you’ve found even just with some simple visualizations or even just a simple notes that say this is happening, this is where people’s attention or focus like that that’s sufficient to help your clients make decisions and get value from this.

Dakota: Right. And that comes back to my earlier point where using simpler classical models as the final stage makes a lot more sense because it’s much easier to provide interpretability around those models. And that’s, that’s something that our clients love, the fact that the final models are a glass box rather than a black box.

Curtis: That’s awesome. I love that because you don’t hear often hear about examples where the innovation is getting the right dataset. Right. And, and that’s really what’s driving things as opposed to the newest fancy algorithm. That’s, that’s really interesting. Did you guys come at this problem originally with that idea that, “hey, we’re going to use some really powerful algorithms” and then you were surprised when you found, “oh, like it’s just the data and classical algorithms work well enough?” Or did you kind of have that notion?

Dakota: It was a lot of back and forth trying out lots of different things from many different academic worlds. Originally we tried various signal processing techniques, everything under the machine learning umbrella, even some techniques from statistical chemistry, but the results we were getting weren’t good until we, we really took time to work things through from first principles. What are people doing online and why does it matter and when we directed all the powerful machine learning at letting us answer that question then we were able to get a data set that we could use for predictive tasks.

Curtis: You also mentioned the financial side of things. I’m curious if you have a story or something that you could share from a use case in the financial world that might be interesting for people to to hear about.

Dakota: Sure, so the soybean example is definitely one. Anything in the commodities space tends to be good for us since we’re able to break it down in terms of supply and demand factors and for commodity is the people responsible for the tend to be in very particular industries with web browsing that doesn’t overlap that much with the general public. A lot of us rarely viewed web pages about industrial techniques.

Curtis: And, and I’m assuming that’s what differentiates you guys, would you say, or maybe there’s some other differentiators that you could comment on, but what, what makes Predata better than some other firms that are, they’re maybe trying to do the same thing? I think there are a couple of other people in a similar space,

Dakota: Right? And other firms in the alt-data space, they tend to have a good human behavioral link, but they’re basing it on a rudimentary forms of data or something where you don’t need a great leap of creativity or methodology to really extract value. Something where simple machine learning can work. Things like prying do track foot traffic to certain stores by looking at mobile phone data. There’s that a simple one-to-one relationship or using satellites to estimate economic development in an area or to look at the water level that a certain tankers are floating at to get an idea of how much is in them or other things with this, that one-to-one relationship and what we do is unlock the power of data for, for use cases about human behavior in a much more complicated way. I’d say our differentiator is that we’re able to describe the full spectrum of what humans do online rather than being limited to things with that one-to-one relationship because we we approach structuring the data in this very behavior first way.

Curtis: I’m curious just, ah, cause that’s not a term that a lot of people probably come across alternative data or alt data. Can you just define that for us as someone who who is in this space?

Dakota: Sure, so that term refers to data that’s not market data. So data that often tends to be generated by people going about their, their day-to-day life. No credit card transaction data or satellite data, mobile phone, Geo-location data. And that’s all used to extract an edge about how on a certain company’s performing or get ahead of economic releases.

Attributions

Music

“Loopster” Kevin MacLeod (incompetech.com)

Licensed under Creative Commons: By Attribution 3.0 License

http://creativecommons.org/licenses/by/3.0/