Have you ever wondered what makes some TED talks more popular than others?
Well, I’ve analyzed a dataset of 2550 ted talks to get some answers for this question. I explored which of the available variables of a given talk, such as the number of comments, number of languages translated, duration of the talk, number of tags, or day it was published online– are a strong predictor of its popularity, measured in number of views.
The Bottom Line
After analyzing and regressing views over the other available variables in the dataset, some interesting associations were revealed.
What are features of a popular talk with high number of views?
- High number of comments (naturally).
- Translations in many languages (also, naturally).
- The combination of many comments and many translation languages yields a much higher view count than expected by each of them alone.
- It shouldn’t be too short. Duration didn’t have a big effect at all, but if any, it was positive for longer talks. The most popular talks were between 8–18 minutes.
- Higher number of tags, ideally between 3–8.
- It would be uploaded on a weekday, preferably a Friday!
- You may see some funky occupations yielding much higher than average views for their talks, such as: Neuroanatomist, Quiet revolutionary, Lie detector, Model, beatboxer, Vulnerability researcher, or Zen Priest. This isn’t representative, but they did yield the highest number of votes combined (which is an unfair game, but hey).
Let’s dig into the data
First, you should be aware: don’t take this (or almost any other conclusion from observational data) as a causal inference. This wasn’t an experiment and not rigorous enough to prove causation. Observations aren’t matched between control and treatment groups or various subgroups of the data, and even after the regression, the variables’ explanatory power isn’t rigorous enough. The available numerical parameters I had in hand are not sufficient for that kind of a conclusion; I couldn’t match the content that really matters to compare apples to apples, and even with controlling with multiple regression — not all things are equal (ceteris paribus assumption is still not met). However, I was able to get a decent predictor and understand which variables are most strongly associated with higher view counts.
What does the data look like?
The dataset includes the name, title, description and URL of each of the 2550 talks, name and occupation of the main speaker, number of speakers, duration of the talk, the TED event and date it was filmed at, date it was published online, number of comments, languages translated and views. It also included a compressed form of arrays of associated tags, ratings, and related talks, but as inside arrays and these need transformations before they can be used. For a full list, see the dataset page on Kaggle.
First, let’s see how each variable is distributed with their histograms.
Histograms are a way to represent the distribution of the values this variable takes; or intuitively representing: “how many talks were there of 1 minute? 2 minutes? 3 minutes? etc…”. Formally, a histogram is a “diagram consisting of rectangles whose area is proportional to the frequency of a variable and whose width is equal to the class interval”.
See how our variables look like below.
Duration of talks is closer to a normal distribution, but with a wide right-side tail of a few talks at longer duration, around a mean of 14 minutes and median of 12 minutes. Almost all talks range between 1–18 minutes (maximum length of a normal Ted talk).
Number of comments is a Poisson distribution (visually resembling an exponential distribution, but it is not technically since comments are measured at discrete numbers) strongly skewed to the minimum of 0 comments for the unpopular videos.
Number of views is also a Poisson like distribution skewed leftwards, With a Median number of views: 1124524, and Mean (average) number of views: 1698297
Finally, an overall picture of our variables distributions:
From these distributions we see that:
- Number of tags also is a Poisson distribution skewed leftwards, peaks between 4–7 tags.
- Number of languages of translations has a peak at 0 for unpopular talks, but mostly is between 20–40 languages offered.
- While the film dates are low before 2012, they are all published at a much more uniform rate.
What the most common occupations?
Apparently, writer is the most common occupation for a TED speaker! followed by other creative occupations and number of speakers with that occupation:
- Writer: 45
- Artist: 34, Designer: 34
- Journalist: 33
- Entrepreneur: 31
- Architect: 30
- Inventor: 27 (Apparently that’s an occupation!)
- Psychologist: 26
- Photographer: 25
- Filmmaker: 21
But what is the occupation leading to the most popular talks on average?
These occupations are very different! they are very skewed by the outliers of the most popular ted talks (for example, #  Vulnerability researcher refers solely to Brené Brown), and thus not a representative sample. Anyway, there are some surprising findings here:
 Life coach; expert in leadership psychology
 Model # okay, nature of man doesn't change
 Vulnerability researcher
 Career analyst
 Quiet revolutionary # great occupation
 Lie detector
 Psychiatrist, psychoanalyst and Zen priest
 Director of research, Samsung Research America
 Illusionist, endurance artist #(really?)
 Gentleman thief
 Health psychologist
 Comedian and writer
 Leadership expert
 Social activist
 Relationship therapist
 Vocalist, beatboxer, comedian
 Clinical psychologist
 Comedian and writer
2: Correlations between parameters
Let’s look at the overall correlation pairs scatter-plot matrix between each pair of numerical variables;
To complement this, see a correlation matrix with colors representing the intensity of the correlation from 0 (white) to dark blue (+1) or dark striped red (strong negative correlation, -1), and with asterisks (***) signifying significance by p-values.
Most of the parameters don’t have strong correlations.
- Naturally, there was a very high correlation between published data and filmed date. Filmed date seemed to be less associated with views numerically and logically — since the audience is more affected by the date a ted talk is released than whether it was recorded a month ago or a year ago.
- There is a relatively higher positive correlation between number of comments and views, which makes sense (more audience, more comments)
- Some positive correlation between number of languages of translation and number of views (0.38) and number of comments (0.32)
- Small negative correlation between duration and number of languages; the shorter the talk, the more translated languages there are, probably because it is easier to translate.
3: Now for the interesting part, how much each of these parameters was correlated with the number of views?
How do these variables correlate with views? Is it a linear relationship, nonlinear, or non? This is important to understand to know how to insert them into the regression if at all.
Let’s start with day of week the talk was published at.
So, day of week does seem to have some association with average (mean) views! Ted Talks published on weekends seem to have much less views, with Saturday being the lowest, and Friday is the most popular day for ted talks published day.
Below are scatter-plots with LOESS flexible regression in white and linear regression in pink, to see how different would a linear shape look from a flexible moving average. This shows us that usually, except for in the tales of the distributions of these variables, where there are only a couple of outliers’ data, the linear model described the relationship somewhat well.
Let’s start with duration and number of tags:
Surprisingly, duration had almost no consistent correlation with number of views; except for the fact that most popular talks were closer to 8–18 minutes. Number of tags seemed to be optimal between 3–8, which is also the more common number of tags; with some outliers of particularly popular talks.
What about numbers of Comments and Translation Languages?
Unsurprisingly, number of comments is, obviously, very well correlated with number of views, and so does number of languages — they all come from having many viewers. Thus, it is not “fair” to predict views based on these factors, and in the real world, we couldn’t use these parameters to predict, since they are not causes for more views, but they are also a result of many views, and a cause in a reinforcing feedback loop: the more comments, the more engaged the community is around the talk and likelier to spread; the more languages, the more viewers can watch; and the more viewers, the more audience there is to comment and translate. The rest had a small linear effect, where that didn’t deviate much, though small.
The pink lines are linear regressions while the white lines are LOESS. It seems that none of these loose much information by a linear regression versus a LOESS regression, which is arbitrarily flexible and would reveal a clear non-linear shape. while some of them do have nonlinear shapes — from a closer look, it is only in the tail where data is scarce and it is biased by the few data points there and some outliers (as in the Comments correlation). Therefore, inputting the regressor as a linear fit might be sufficiently explanatory.
Models and Results
I first ran a regression of number of views upon each variable of interest separately, which enabled to see which variables have explanatory power and how strong is it. Later, I included the better explaining ones, adding them one by one into a multiple regression, while checking that their addition actually improves our performance. The results are in the table below, where each column is a different model, for models (1) through (8).
Chosen model is the last model since it had the best explanatory power in terms of R squared, adjusted R squared, p value and F-statistic, although it had only marginal improvements over model (5) with only comments and languages translated.
For predicting purposes, I would choose model 8 with all variables. For explanatory purposes, I would choose model 5 to explain that comments and languages are by far the most correlated with views and explain most of its variance.
Model 5 suggests that every additional comment is associated with 4,044 more views (p-value under 0.01) and that every additional language translated is associated with 60,650 more views (p-value under 0.01). However, the constant is negative (-733) views, which makes no sense, but that comes with the restriction of a linear model. These together explained 0.33 of the variance (both R-squared and adjusted R-squared).
Y(views)= — 733 + 4044*comments + 60650*languages
However, adding all the other variables into model 8 improved slightly the R-squared to 0.336 and Adjusted R-squared to 0.334. So, if we are after accuracy for prediction, I would use this last, full model:
Y(views)= — 1455238 + 3931*comments + 68222*languages + 408*duration + 26625*num_tags — 41407*is_weekend
The results, and particularly model 8, show overall significance. Most variables show significance, although weekend does not, but adding it still improved the explanatory power slightly, so I’m keeping it. F-statistic is relatively lower, and R and R-squared are not great at 0.336 and 0.334 respectively, but the best performance out of this set of models. The constant decreased much more, giving more power to the variables to raise the predicted view count. The coefficients (estimation of the effect) of comments decreased from 4044 to 3931 and was redistributed to higher coefficient for the number of languages and new coefficients for the newly added variables: 408 more views for every additional second, 26625 more views for every additional tag, and this is compensated by reducing the predicted number of views by 41407 if it was published on a weekend.
To finish up, I ran a final regression with polynomials of number_of_tags and duration up to the third degree, and adding the interaction term between number of comments and number of languages made a significant contribution, improving the accuracy with an R-squared of 0.436 and Adjusted R-squared of 0.434. With a positive coefficient, this suggests that talks with many comments and many language translations together yield even more views than talks with just a high number of one of them. Yet, this model it is much less convenient for making general conclusions for, might be less generalizable and prone to over-fitting. Not all polynomials shows statistical significance, but they still improved results from when I dropped either of them. Therefore, I will not count heavily on its implications here, but it would be a slightly better prediction model for similar data. Here is how it performed (and how to run it):
lm(formula = views ~ comments + languages + num_tags + I(num_tags^2) +
I(num_tags^3) + weekend + duration + I(duration^2) + I(duration^3) +
comments * languages, data = ted)
Min 1Q Median 3Q Max
-26821919 -682956 -290905 234527 25305423
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.548e+05 3.394e+05 -1.340 0.18035
comments -5.796e+03 4.838e+02 -11.979 < 2e-16 ***
languages 3.522e+04 4.930e+03 7.145 1.17e-12 ***
num_tags -9.210e+04 7.291e+04 -1.263 0.20668
I(num_tags^2) 9.235e+03 6.441e+03 1.434 0.15174
I(num_tags^3) -2.236e+02 1.666e+02 -1.343 0.17950
weekend 1.342e+05 1.905e+05 0.704 0.48120
duration 2.038e+03 4.664e+02 4.369 1.30e-05 ***
I(duration^2) -9.895e-01 3.355e-01 -2.949 0.00321 **
I(duration^3) 1.367e-04 5.555e-05 2.462 0.01389 *
comments:languages 2.464e+02 1.171e+01 21.052 < 2e-16 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1879000 on 2539 degrees of freedom
Multiple R-squared: 0.4369, Adjusted R-squared: 0.4346
F-statistic: 197 on 10 and 2539 DF, p-value: < 2.2e-16
This limited model conveys correlation, not causation. It does not convey causal relationship well because the fundamental problem of causal inference is not well addressed with these variables, these predictors are not independent from the y variable, and they are highly correlated (mostly comments and number of languages which are the best predictors, naturally. I don’t believe that with these available numerical predictors we could have reached a causal inference. Next attempts might use the transcription of the talk to analyze the content, or audio to analyze the level of clapping, or the visuals in the talk and the clothing of the speaker to better predict using the content of the talk.
Recognizing that this is not causality, these correlations might help you to (1) better predict the number of views for a talk given these parameters, and (2) hypothesize on possible causation directions, and test them further.
So, what would be features of a popular talk?
- High number of comments. It is likely that the more views, the more comments you’d get, but it is also likely that causality goes both ways here: the more comments → even more views. Therefore, if you want to increase your view-count, It might help to get all of your friends to comment and get a discussion going. This might spur a viral spread for your video, increase views, and would potentially increase the probability of TED to feature it. Experiment and let me know!
- Translations in many languages. This also probably has double sided causality; the more views → the more translations → more people can watch and even more views. So if you want to start that cycle, it might help if get your friends or freelancers to translate and advertise the talk. If there is both many comments and many translation languages — that’s when you’d know to bet for a much higher number of views.
- It shouldn’t be too short . Duration a very slight positive correlation; with the most popular talks being between 8–18 minutes. This might show that usually, talks that are too short are not sufficiently deep to be inspiring enough. However, if your talk is naturally short, I don’t believe it’s a good idea to extend it and risk it being boring.
- Higher number of tags, ideally between 3–8! The regression suggested a positive effect for more tags, which makes sense, for being broad enough and suggested to view from more topics and other talks as sources. The most popular talks had between 3–8 tags, so it might mean that it’s not a good idea for a talk to be also too broad or not focused.
- It would be uploaded on a weekday, preferably a Friday! This one was a little surprising, but apparently, uploading on a weekend had a negative effect (except for in the last model, which is probably confounded by having so many other forms of other variables); but undeniably, the correlation between day of week to number of average views showed talks that were uploaded on weekdays, and particularly on Fridays, were significantly more popular than weekends. Are people that much on top of new TED releases when they are at work? or more specifically, when they are slacking off at work, which is most common to do on a Friday?
Finally, now you may go on with your betting-on-view-counts TED binge-watching party with your friends and take some more educated guesses! You may also send me some token of appreciation if you win big-time.
Have a great week, and make sure to check out a new TED talk with 3–8 tags, is 8–18 minutes long, has many comments and languages and which was uploaded on a Friday!
Gurupriyan is a Software Engineer and a technology enthusiast, he’s been working on the field for the last 6 years. Currently focusing on mobile app development and IoT.