Generating Fake Reddit Comments with Markov Chains

Articles in Scripting and Automation | By August R. Garcia

Published 7 months agoThu, 28 Feb 2019 16:31:18 -0800

It's like 2013 or something.

Do you need to write a spam legitimate bot? This would probably even work.

Getting Seed Data

To create some Markov chains, we need some seed text that the script can work off of. Conveniently, I always keep a list of generic Reddit comments on my hard drive, as well as a list of generic English phrases, which have been merged here:

10/10 post
Absolutely not.
Are we almost there?
Are you coming with me?
Are you sure?
As soon as possible.
Be careful driving.
Be careful.
Believe me.
Buy it!
Call me tomorrow.
Can you speak slowly?
Can you translate this for me?
Chicago is very different from Boston.
Come with me.
Cucks stumpted
Dear god, why...
Do it right!
Do you mean it?
Do you see him often?
Do you understand?
Do you want it?
Do you want something?
Don't worry.
Don't do it.
Don't exaggerate.
Don't tell me that.
Everyone knows it.
Everything is ready.
From time to time.
Give me a hand.
Go right ahead.
God bless this thread.
Good idea.
Guys, I may be a freshman undergrad, but I think I have an incredibly simple solution to a massively complex geopolitical problem of which I understand practically nothing yet think I understand completely. Let me offer this shitty allegory to help you understand.
Have a good trip.
Have a nice day.
Have another one.
Have you finished?
He doesn't have time.
He is on his way.
He likes it very much.
He's coming soon.
He's right.
He's very annoying.
He's very famous.
How are you doing?
How are you?
How long are you staying?
How much?
How's work going?
I am crazy about her.
I am wasting my time.
I ate already.
I can do it.
I can't hear you.
I can't believe it.
I can't wait.
I don't know how to use it.
I don't like him.
I don't like it.
I don't speak very well.
I don't understand.
I don't want it.
I don't want that.
I don't want to bother you.
I don't have time.
I don't know anybody.
I don't like it.
I don't think so.
I feel good.
I feel much better.
I found it.
I get off of work at 6.
I hate you!
I have a headache.
I hope so.
I hope you and your wife have a nice trip.
I knew it.
I know.
I like her.
I lost my watch.
I love you.
I need to change clothes.
I need to go home.
I noticed that.
I only want a snack.
I see.
I think it tastes good.
I think it's very good.
I think so.
I thought the clothes were cheaper.
I want to speak with him.
I was about to leave the restaurant when my friends arrived.
I won.
I would like a cup of coffee, please.
I'd like to go for a walk.
I'll call you when I leave.
I'll come back later.
I'll pay.
I'll take it.
I'll take you to the bus stop.
I'm an American.
I'm cleaning my room.
I'm cold.
I'm coming to pick you up.
I'm going to leave.
I'm good, and you?
I'm happy.
I'm hungry.
I'm married.
I'm not busy.
I'm not married.
I'm not ready yet.
I'm not sure.
I'm sorry, we're sold out.
I'm thirsty.
I'm very busy. I don't have time now.
I've been here for two days.
I've heard Texas is a beautiful place.
I've never seen that before.
I'll miss you.
I'll try.
I'm bored.
I'm busy.
I'm having fun.
I'm hungry.
I'm leaving.
I'm ready.
I'm sorry.
I'm used to it.
I've got it.
If you need my help, please let me know.
Is it far?
Is Mr. Smith an American?
Is that enough?
It doesn't matter.
It smells good.
It's longer than 2 miles.
It's about time.
It's all right.
It's different.
It's easy.
It's funny.
It's good.
It's impossible.
It's incredible!
It's near here.
It's not bad.
It's not difficult.
It's not worth it.
It's nothing.
It's obvious.
It's the same thing.
It's time to go.
It's your turn.
Jesus Christ
Just a little.
Just a moment.
Let me check.
Let me think about it.
Let's go have a look.
Let's practice English.
May I speak to Mrs. Smith please?
Me too.
More than that.
Never mind.
Next time.
No, thank you.
Not recently.
Not yet.
Nothing else.
Of course.
Please fill out this form.
Please take me to this address.
Please write it down.
Right here.
Right there.
See you later.
See you tomorrow.
See you tonight.
She is my best friend.
She is so smart.
She's pretty.
Slow down!
Sorry to bother you.
It's almost like Reddit is thousands of different people with thousands of different opinions.
Take a chance.
Take it outside.
Tell me.
Thank you miss.
Thank you sir.
Thank you very much.
Thank you.
Thanks for everything.
Thanks for your help.
That happens.
That looks great.
That smells bad.
That's alright.
That's enough.
That's fine.
That's it.
That's not fair.
That's not right.
That's right.
That's too bad.
That's too many.
That's too much.
That's enough.
That's interesting.
That's right.
That's true.
The book is under the table.
There are too many people here.
They like each other.
They'll be right back.
They're the same.
They're very busy.
Think about it.
This doesn't work.
This is a quality thread.
This is very difficult.
This is very important.
Too bad!
Try it.
Very good, thanks.
Wait for me.
We like it very much.
What did you say?
What do you think?
What is he talking about?
What terrible weather!
What's going on/ happening / the problem?
What's the date today?
Where are you going?
Where is he?
Would you take a message please?
Yes, really.
You are impatient.
You look tired.
You surprise me.
You're beautiful.
You're very nice.
You're very smart.
You're always right.
You're crazy.
You're in a bad mood.
You're lying.
You're welcome.
You're wrong.
Your things are all here.
10/10 Post
Cucks stumped
I laughed way harder than I should have
Came here to post this
Username checks out
They had one job
Chopping onions in here
Spaghetti falling out of pockets left and right.
This should be higher
Underrated comment
Someone x-post this to /r/circlejerk
Literally spilled coffee all over my screen
*Tips Fedora*
/r/HailCorporate much?
An upvote for you, good sir
Manly tears were shed
Mind = Blown
What did I just read?
Da fuq?
That escalated quickly
I can't fap to this
I laughed way harder than I should have
Get out of here with your logic
Plot twist of the century
Directions unclear; dick stuck, etc. etc.
Step one: be attractive. Step two: don't be unattractive.
About tree fiddy
This guy fucks
Would not bang
Are you me?
You... I like you...
Right in the feels
Impeach Trump
Faith in humanity restored
I know that feel, bro
You magnificent bastard
Puppers doggos puppers doggos puppers heckin doggers. uuuuuuuuuuuuugh.
This place is an echo chamber
Good bot
Something something broken arms
As a lawyer/engineer/professional shovel maker, this post is cancer.
god I hate reddit
[Rick and Morty Reference]
Hey it's me, ur generic Reddit comment
I read this in Morgan Freeman's voice!
Only Siths deal in absolutes
Hero we need, not the hero we deserve
You da real mvp
Can confirm; am a wizard.
I feel personally targeted by this. 😂😂
To be fair...
Would make a good band name, though.
Why doesn't this have more upvotes?
Build the wall
This is why they need to build the wall
Can confirm, am redditor.
You forgot the part about fucking a coconut.
/r/circlejerk is leaking
/r/politics is leaking
/r/the_donald is leaking
/r/redacted is leaking
No idea why youre being downvoted
Well alright then.
Everyone on Reddit is a bot except you
This made me smile :)
You are now moderator of r/Pyongyang
Not all heros wear capes
Are you a presidential speechwriter?
"Unpatriotic" is too many syllables for Trump.
Please tell me you explained it to them!
It's already happening
Ron Paul 2012
Trump tweet in 3... 2... 1...
If you think that's a cock-up, just wait until you see the GOP's tax plan
Two genders; two scoops; two terms.
Trump 2020
Clinton 2020
Barron 2020
Zuckerberg 2020
Cuckerberg 2020
Trayvon 2020
Tom Hanks 2020
Hitler-Stalin 2020
Putin 2020
You could put that on a T-shirt
In real life, trolls are just called assholes.
Maybe he can upgrade his two scoops to five Scoops.
Ugh. I want the government to do it's job.
Showing some real love... What a heart...
At this point, who isn't?
Just baffled by the stupidity and greed.
Negative, Ghostrider.
Stupid assholes.
Thank you.
More people need to see this post.
What the h*ck
It's black magic.
Wow this got really dark...
Kind of a shitpost, but I'm upvoting this anyway.
Well, fuck me then.
No way this is for real.
Stranger than fiction.
Tough, but fair.
Meanwhile, in America...
Meanwhile, in Europe...
Meanwhile, in Africa...
Meanwhile, in Mexico
This is why they need to build the wall.
I feel like this is a repost.
George Washington is rolling in his grave.
Fuck it. I'm out.
Every time...
Time to burn down everything.
Who would have guessed?
Liar in Chief
I fucking love the God Emperor
God bless Donald J. Trump
Muy bueno.
Who upvoted this shitpost?
To be fair, you have to have a very high IQ to understand Rick and Morty...
This looks staged.
Well, at least there was an attempt
👏 Self improvement 👏
this is fake
Don't know why but feels staged to me
Also 9-11 was an inside job.
Y'all going overboard now...
*Before God created light, he made a redditpost called ''First!''*
Actions speak louder than words.
Barking up the wrong tree here.
Ball is in the court... We'll see what happens, I guess.
Beating around the bush 24/7/365. Get to the damn point.
Best thing since sliced bread.
I need more coffee to deal with this shit.
Welp, back to the drawing board...
Eh. Give the benefit of the doubt before jumping to any conclusions.
God d*ng last straw.
Not a hint of decency...
Mother fuckers need to get on the ball.
Go hard or go home.
Wouldn't be caught dead associated with this cancer.
I got cancer from this post.
Yeah, well,  a million dollars is no small chunk of change.
Understatement of the century
Go go gadget shitpost
Shariablue get out reeeeeeee
Kind of a paradox
How do you uninstall reddit
Dial S for Shitpost
Mic status: Dropped
No words...
Someone please kill me.
I literally want to die.

You can also pull arbitrary comments from Reddit fairly easily using Reddit's .json API. Add .json to the end of any URL to get that page as JSON, which can then be easily parsed.

Generating Markov Chains

Work is hard, so let’s cheat to get started and steal this quality Markov chain generator from jcsongor of GitHub. Here's the original code:

if [ ${#} -lt 2 ];then
	echo "$(basename "$0") [file] [words] - shell script to generate markov chain texts 

		file	Text file to be used to generate random text
		words	Number of words to generate"
if [ ! -f "${1}" ];then
	echo "File not found: ${1}"
words=`tr '\n' ' ' < $1|sed 's/[^[:alnum:][:space:]]\+//g'| tr '[:upper:]' '[:lower:]'`
index=$((2 + RANDOM % `echo $words|wc -w`))
pattern=`echo $words|cut -f$index-$((index+1)) -d' '`
echo -n $pattern
for i in `seq $2`;do
	for word in $words;do
		if [ -n "${word2}" -a -n "${word1}" -a "${word2} ${word1}" = "${pattern}" ];then
			candidates="${candidates} ${word}"
	nextword=`shuf -e -n1 $candidates`
	pattern=`echo $pattern|sed 's/.* //g'`" ${nextword}"
	echo -n " ${nextword}"

This can be run from the Linux terminal as follows:

  • ./markov.sh words-for-seed-data.txt num_words
  • ./markov.sh generic-reddit-comments.md 10

And if you want a linebreak after that:

  • ./markov.sh generic-reddit-comments.md 10 && echo ""

To loop through repeatedly, generating multiple comments (in this case, 10):

  • for i in `seq 1 10`; do ./markov.sh generic-reddit-comments.md 12 && echo ""; done

Running the above get's us draft #1:

  • alright then everyone on reddit is a repost george washington is rolling in his
  • bless donald j trump muy bueno who upvoted this shitpost to be fair you
  • have get out reeeeeeee kind of a paradox oxymoron how do you uninstall reddit
  • want the government to do its job showing some real love what a heart
  • a paradox oxymoron how do you uninstall reddit dial s for shitpost mic status
  • updooted uproned build the wall i feel personally targeted by this to rcirclejerk literally
  • build the wall this is fake dont know why but feels staged to me
  • negative ghostrider stupid assholes thank you more people need to build the wall i
  • im upvoting this anyway well fuck me then hahahahah no way this is a
  • wow this got really dark kind of a paradox oxymoron how do you uninstall

Result: Somehow more plausible than I expected.

Touch Ups

First off, let's randomize the sentence length. One way to get a random number from a shell script (in this case, from 1-25) is:

  • VAR=$(shuf -i 1-25 -n 1)

And we'll also mix in some generic english phrases to a master comments file:

  • for i in `seq 1 10`; do VAR=$(shuf -i 1-15 -n 1) && ./markov.sh generic-comments-master.md $VAR && echo ""; done

Running this gets:

  • likes it very much hes coming
  • time now ive been
  • to go home wouldnt be caught
  • see him often do you think what is he talking about
  • checks out they had one job chopping onions in here spaghetti falling
  • rolling in his grave fuck it im out alecbaldwinmp4 every time time to
  • im bored im busy im not ready yet im not ready yet im not married im
  • ready im sorry
  • screen tips fedora who did this rhailcorporate much an upvote for you good sir manly
  • you tomorrow see you later see you later see you tomorrow see you tomorrow see

Everything's still lowercase, so let's modify the source code of the Bash script write this script in Python instead (of course cheating by modifying this script) and then make some modifications:

import numpy as np
import io
import sys
import random

# System Arguments
fn        =     sys.argv[1]
min_words = int(sys.argv[2])
max_words = int(sys.argv[3])
num_loops = int(sys.argv[4])

# Read in the source file
words = io.open(fn, encoding='utf8').read()

# Get the words from the file 
corpus = words.split()

def make_pairs(corpus):
    for i in range(len(corpus)-1):
        yield (corpus[i], corpus[i+1])

# Generate a Markov chain of the specified length
def main(min_words, max_words):
        pairs = make_pairs(corpus)

        word_dict = {}

        for word_1, word_2 in pairs:
            if word_1 in word_dict.keys():
                word_dict[word_1] = [word_2]

        first_word = np.random.choice(corpus)

        while first_word.islower():
            first_word = np.random.choice(corpus)

        chain = [first_word]

        n_words = random.randint(min_words, max_words)

        for i in range(n_words):

        print ' '.join(chain) #+ "\n"

# Main Loop 
for x in range(0, num_loops):
        main(min_words, max_words)

This can be run from the terminal like this to get ten chains of 1-30 words each:

  • python markov.py generic-comments-master.md 1 30 10

Gets these results:

  • I hope you me? Chicago is
  • this is so smart. She's pretty.
  • You forgot the century Go go for a good sir Manly tears were cheaper. I don’t like you... Right there. See
  • They like Reddit is he talking about? What did you sir. Thank you. Stop! It's already happening / the bush 24/7/365. Get out this post this to help
  • Of course. Okay. Please fill out They
  • You surprise me. You're very important. Too bad! Try it. I understand Rick
  • It’s funny. It’s near here. It’s good. I have Came here Spaghetti falling out of decency... Mother fuckers
  • Maybe. IDK LOL. Absolutely not. Are we deserve You could put that feel, bro You are you want something? Don't worry. Don’t
  • Europe... Meanwhile, in absolutes Hero we deserve You surprise me. You're very difficult. This looks great. That
  • Someone please kill

Which is roughly the quality of average Reddit posts.

You can get finer-tuned results by scraping comments from some thread by adding .json to the end of the URL. Or use some xPath query; this one (basically) works on old.reddit.com threads:

  • //form/div/div/p

After cramming some more data into the sheet, this is an example run of the script:

  • *Tips Fedora*
  • First, let's make it after all this to his dick.
  • Bots. Why doesn't this on here even think the Horse kept on the most handsome man wishes to the old people of pockets left arm still
  • DID THIS /r/HailCorporate much? How's work and the table.
  • Country for me. Q: Why are you tomorrow. See you
  • E M E R S I see. I found it I've never had
  • Dolly Parton.
  • Have you brought up to talk. \proceeds to see you get a semicolon. They're so much better. My friend who are in a...gaming...subreddit?...ugh. ANY one of a wizard. TFW no redeeming
  • 4: What do you cross a loophole to the flag's a pirate's favorite jokes about time. I was about Jose and
  • Turns out, it
Sir, I can do you a nice SEO.

Profile Photo - August R. Garcia August R. Garcia
🗎 192 🗨 947 🐏 287
Site Owner

Mexico and $25 million – nobody else scattered. You know, Trace came up on television. I negotiate. It’s not signing books – I can solve the economy in Nevada, I don’t care. I analyze it.

And, they don’t think so. I guess the costs. Because she was sleeping. I’m going to talk about it. Number one didn’t even 55 years you’re rich. I don’t know if he going on. Remember that. Do you don’t know. Look, I’m going to many millions of you we have been paying it with me. There were issued visas. Large numbers of us cars coming probably suits me like some of it. We'll be free rides any money. Guys that heat? For instance, we have created that is great…

They’re not really good things... And they call up here in the only because we started, I think of the United States permanently admits more coming along with me today. Speeches all come in, you going to pick a real good health care.



Very, very nice. Curious, do you it would be possible do generate a full article, if:

  1. You scraped a bunch of articles on the topic
  2. You gave it an input of keywords to include

For example, scraping the latest news on North Korea, and then trying to do a Markov chain that somehow connects ['Donald Trump', 'Kim Jong Un', 'poor nation', 'starving', 'nuclear program'], let's say?

Basically, if it could start off with "Donald Trump", then do a Markov chain based on the scraped articles that leads to the next word, then to the next one, and so on. Could also make sure that each item from the input word array is contained in the scraped content.

Hope that makes sense.

