Thoughts and Reflections on the Process
We’ve spent our spare time in the last six weeks participating in the 538 Academy Awards Prediction Challenge. On Sunday, we’ll find out how we did. But even though we expect to crash and burn on the acting awards and are probably no better than 1-3 in a very close movie race, we ended up quite satisfied with our unique process and the model that emerged. You can get full and deep description of our culture matching model with it’s combination of linguistic analysis and machine learning in this previous post.
What I love about projects like this is that they give people a glimpse into how analytics actually works. Analysis doesn’t get made at all the way people think and in most cases there is far more human intuition and direction than people realize or that anyone reading screeds on big data and predictive analytics would believe. Our culture-matching analysis pushes the envelope more than most we do in the for-pay world, so it’s probably an exaggerated case. But think about the places where this analysis relied on human judgment:
- Deciding on the overall approach: Obviously, the approach was pretty much created whole-cloth. What’s more, we lacked any data to show that culture matching might be an effective technique for predicting the Oscars. We may have used some machine learning, but this approach didn’t and wouldn’t have come from throwing a lot of data into a machine learning system.
- Choosing potentially relevant corpora for Hollywood and each movie: This process was wholly subjective in the initial selection of possible corpora, was partly driven by practical concerns (ease of access to archival stories), and was largely subjective in the analyst review stage. In addition to selecting our sources, we further rejected categories like “local”, “crime” and “sports”. Might we have chosen otherwise? Certainly. In some cases, we tuned the corpora by running the full analysis and judging whether the themes were interesting. That may be circular, but it’s not wrong. Nearly every complex analysis has elements of circularity.
- Tuning themes: Our corpora had both obvious and subtle biases. To get crisp themes, we had to eliminate words we thought were too common or were used in different senses. I’m pretty confident we missed lots of these. I hope we caught most. Maybe we eliminated something important. Likely, we’ll never know.
- Choosing our model: If you only do 1 model, you don’t have this issue. But when you have multiple models it’s not always easy to tell which one is better. With more time and more data, we could try each approach against past years. But lots of analytic techniques don’t even generate predictions (clustering, for example). The analyst has to decide which clustering scheme looks better, and the answer isn’t always obvious. Even within a single approach (text analytics/linguistics), we generated two predictions based on which direction we used to match themes. Which one was better? That was a topic of considerable internal debate with no “right” answer except to test against the real-world (which in this case will be a very long test).
- Deciding on Black-Box Validity: This one is surprisingly hard. When you have a black-box system, you generally rely on being able to measure it’s predictions against a set of fairly well known decisions before you apply it to the real-world. We didn’t have that and it was HARD to decide how and whether our brute force machine-learning system was working at all. But even in cases where external measurement comparisons exist, it’s the unexpected predictions that cause political problems with analytics adoption. If you’ve ever tried to convince a skeptical organization that a black-box result is right, you know how hard this.
- Explaining the model: There’s an old saying in philosophy (from James) that a difference that makes no difference is no difference. If a model has an interesting result but nobody believes it, does it matter? A big part of how interesting, important and valid we think a model is comes from how well it’s explained.
This long litany is why, in the end, the quality of your analysis is always about the quality of your people. We had access to some great tools (Sysomos, Boilerpipe, Java, SPSS, R and Crimson Hexagon), but interesting approaches and interesting results don’t come from tools.
That being said, I can’t resist special call-outs to Boilerpipe which did a really nice job of text extraction and SPSS Text Analytics which did a great job facilitating our thematic analysis and matching.
Thoughts on the Method and Results
So is culture matching a good way to predict the Oscars?
It might be a useful variable but I’m sure it’s not a complete prediction system. That’s really no different that we hoped going into this exercise. And we’ll learn a little (but not much) more on Awards night. It would be better if we got the full vote to see how close our rank ordering was.
Either way, the culture-matching approach is promising as a technique. Looking through the results, I’m confident that it passes the analyst sniff test – there’s something real here. There are a number of extensions to the system we haven’t (and probably won’t) try – at least for this little challenge. We’d like to incorporate sentiment around themes, not just matching. We generated a number of analyst-driven cultural dimensions for machine training that we haven’t used. We’d like to try some different machine-learning techniques that might be better suited to our source material. There is a great deal of taxonomic tuning around themes that might drive better results. It’s rare that an ambitious analytics project is every really finished, though the world often says otherwise.
In this case, I was pleased with the themes we were able to extract by movie. A little less with the themes in our Hollywood corpus. Why? I suspect because long-form movie reviews are unusually rich in elaborating the types of cultural themes we were interested in. In addition, a lot of the themes that we pulled out of the culture corpus are topical. It’s (kind of) interesting to know that terrorism or the presidential campaign were hot topics this last year, but that isn’t the type of theme we’re looking for. I’m particularly interested in whether and how successful we can be in deepening themes beyond the obvious one. Themes around race, inequality and wealth are fairly easy to pick out. But if the Martian scores poorly because Hollywood isn’t much about engineering and science (and I’m pretty sure that’s true), what about its human themes around exploration, courage and loneliness? Those topics emerged as key themes from the movie reviews, but they are hard to discover in the Hollywood corpus. That might be because they aren’t very important in the culture – that’s certainly plausible – but it also seems possible that our analysis wasn’t rich enough to find their implicit representations.
Regardless, I’m happy with the outcome. It seems clear to me that this type of culture matching can be successful and brings analytic rigor to a topic that is otherwise mostly hot-air. What’s more it can be successful in a reasonable timeframe and for a reasonable amount of money (which is critical for non-academic use-cases). From start to finish, we spent about four weeks on this problem – and while we had a large team, it was all part-timers.
This was definitely a problem to fall in love with and we’d kill to do more, expand the method, and prove it out on more substantial and testable data. If you have a potential use for culture matching, give us a call. We probably can’t do it for free, but we will do if for less than cost. And, of course, if you just need an incredible team of analysts who can dream up a creative solution to a hard, real-world problem, pull data from almost anything, bring to bear world-class tools across traditional stats, machine-learning and text analytics, and deliver interesting and useful results…well, that’s fine too.