This page details the methods and visualisations techniques used for the Polandball article for INC Longform

The approach within the article is predominantly qualitative. A big inspiration are the invaluable insights provided by five moderators or r/polandball. Their ansers were incorporated in the text. Some quantitative methods were used as well in order to give empirical evidence to the claims made. More specifically, r/polandball’s post and comment history was filtered through Google BigQuery. This data was visualised through Excel graphs, RankFlow and Gephi.

OC Numbers according to link flair were derived using Google BigQuery with the following SQL code:

SELECT link_flair_text, COUNT(link_flair_text) as count
FROM (SELECT * FROM
[fh-bigquery:reddit_posts.2016_01],
[fh-bigquery:reddit_posts.2016_02],
[fh-bigquery:reddit_posts.2016_03],
[fh-bigquery:reddit_posts.2016_04],
[fh-bigquery:reddit_posts.2016_05],
[fh-bigquery:reddit_posts.2016_06],
[fh-bigquery:reddit_posts.2016_07],
[fh-bigquery:reddit_posts.2016_08],
[fh-bigquery:reddit_posts.2016_09],
[fh-bigquery:reddit_posts.2016_10],
[fh-bigquery:reddit_posts.2016_11],
[fh-bigquery:reddit_posts.2016_12])
WHERE subreddit="polandball"
GROUP BY link_flair_text
ORDER BY count DESC

Providing the table:

Row link_flair_text count
1 redditormade 3426
2 repost 305
3 collaboration 54
4 meta 6
5 berndmade 1
6 DickRhinomade 1
7 null 0

Figure 5: Numbers derived using Google BigQuery with the following SQL code:

SELECT day, subreddit, AVG(score), COUNT(author), COUNT(DISTINCT(author)), COUNT(id)
FROM (
SELECT score,subreddit, author, id, LEFT(STRING(SEC_TO_TIMESTAMP(created_utc)), 10) day
FROM (select * from [fh-bigquery:reddit_comments.2016_06], [fh-bigquery:reddit_comments.2016_07])
)
WHERE subreddit="polandball"
AND author NOT IN (SELECT author FROM [fh-bigquery:reddit_comments.bots_201505])
GROUP EACH BY 1,2
ORDER BY 1,3 DESC

The results were visualised with the graph function in Excel.

Figure 6: Numbers derived using Google BigQuery with the following SQL code, made by Felipe Hoffa:

SELECT word, s_count, s_ratio, g_count, g_ratio, s_to_g_ratio, weight FROM (
SELECT s.word word, s.c s_count, ROUND(s.ratio,4) s_ratio, g.c g_count, ROUND(g.ratio,4) g_ratio,ROUND(s.ratio/g.ratio,2) s_to_g_ratio, ROUND(s.ratio/g.ratio,2) * s.c weight FROM
(SELECT c, word, ssum, (c/ssum)*100 ratio FROM
(SELECT c,word, sum(c) over () as ssum FROM
(SELECT count(*) c, word FROM
(SELECT REGEXP_EXTRACT(word,r'([a-z0-9\-\']*)') word FROM
(SELECT SPLIT(lower(body),' ') word FROM [fh-bigquery:reddit_comments.2016_04] WHERE lower(subreddit)="polandball"))
GROUP BY word))) s
JOIN EACH
(SELECT c, word, gsum, (c/gsum)*100 ratio FROM
(SELECT c,word, sum(c) over () as gsum FROM
(SELECT count(*) c, word FROM
(SELECT REGEXP_EXTRACT(word,r'([a-z0-9\-\']*)') word FROM
(SELECT SPLIT(lower(body),' ') word FROM [fh-bigquery:reddit_comments.2016_04]))
GROUP BY word))) g
ON g.word = s.word
WHERE s.word NOT IN ('gt','lt','amp')
)
WHERE s_to_g_ratio > 5
ORDER BY s_ratio DESC;

This code was then adjusted to get data from each month in 2016. The tables from all months were appended in an Excel sheet, and the resulting datasheet was visualised with RankFlow, a visualisation tool developed by Bernhard Rieder.

Figure 9: Identical to the method used for figure 5, but with post data appended, using the table bigquery:reddit_posts.2016_11.

Figure 10: Numbers derived using Google BigQuery with the following SQL code, originally written by Felipe Hoffa and adjusted slightly to work with new subreddits:

SELECT sub_a, sub_b, percent, sub_ac, sub_bc
FROM (
SELECT sub_a, sub_b, percent, COUNT(*) OVER(PARTITION BY sub_a) sub_ac, sub_bc
FROM(
SELECT a.subreddit sub_a, b.subreddit sub_b, INTEGER(100*COUNT(*)/FIRST(authors)) percent, COUNT(*) OVER(PARTITION BY sub_b) sub_bc
FROM (
SELECT author, subreddit, authors
FROM FLATTEN((
SELECT UNIQUE(author) author, a.subreddit subreddit, FIRST(authors) authors
FROM [fh-bigquery:reddit_comments.2017_05] a
JOIN [heatmap-new-medi-1475496764277:subreddits_table.subreddit_ranks2017_05] b
ON a.subreddit=b.subreddit
WHERE rank_authors>0 and rank_authors<1500
GROUP EACH BY 2
),author)
) a
JOIN EACH (
SELECT author, subreddit
FROM FLATTEN((
SELECT UNIQUE(author) author, subreddit
FROM [fh-bigquery:reddit_comments.2017_05]
WHERE subreddit IN (SELECT subreddit FROM [heatmap-new-medi-1475496764277:subreddits_table.subreddit_ranks2017_05]
WHERE rank_authors>0 and rank_authors<1500
)
GROUP BY 2
),author)
) b
ON a.author=b.author
WHERE a.subreddit!=b.subreddit
AND (a.subreddit='polandball' OR b.subreddit='polandball')
GROUP EACH BY 1,2
HAVING percent>10
)
)
ORDER BY 2,4 DESC

The resulting code was then visualised with Arbor.js.