Author: Paul Schulze
Installation | How it works | Initialization | Filter | Search | Sort | Print | Activity Plot
Data analytics toolbox and data crawler in Python for Discourse.
The discourseAnalytics API makes it easy to sort, filter and search through posts and display or plot the data.
selenium downloads the user data and public posts as html files. BeautifulSoup4 takes the html files and converts the data into python objects. The data is cached as json files and converted into a custom dataset format that is accessible via the discourseAnalytics API.
Download the latest release here
Navigate to the downloaded discourseAnalytics folder and install the needed dependencies with
pip install --upgrade --user -r requirements.txt
Discourse Analytics works best in a Jupyter notebok environment, but also works from the command line.
Import discourseAnalytics and initialize the dataset
from DiscourseAnalytics import DiscourseDataset
dataset = DiscourseDataset("discourse.website.com")discourseAnalytics downloads the user profiles and post histories into the folder ./datasets/Discourse/html_files.
The downloaded files get scraped and packaged into one json file that contains all the available user information of every user.
Click on any of the headers to see more information:
Download folder
You can change the download folder to a custom folder with the argument dataset_folder.
dataset = DiscourseDataset(discourse_website, dataset_folder='/home/user/Data/DiscourseDataset')Download speed
The download process can take some time. To speed things up, you can use the argument sleep_time. This changes the amount of seconds that selenium waits to load more content after scrolling to the bottom.
dataset = DiscourseDataset(discourse_website, sleep_time=1)Redownload data
If the source files get corruped, the data can be redownloaded with the arguments overwrite_html=True and overwrite_html=True.
dataset = DiscourseDataset(discourse_website, overwrite_html=True, overwrite_json=True) Detailed output
A more detailed output while downloading and scraping can be printed with the argument supress_output=False.
dataset = DiscourseDataset(discourse_website, supress_output=False)Calling the DiscourseDataset with a filter argument such as username returns a new instance of DiscourseDataset with the respective subset of the posts.
# all posts by user "JohnSmith"
posts_by_john = dataset(username="JohnSmith")
# all posts in topic "Hi I am John"
posts_in_hi = dataset(topic="Hi I am John")
# posts by user "JohnSmith" in the topic "Hi I am John"
posts_by_John_in_hi = dataset(username="JohnSmith", topic="Hi I am John")Posts can be filtered by username, full_name, topic and category.
And with time filter arguments such post_before, post_after, join_before, join_after, last_post_before, last_post_after, or a combination of the above
The DiscourseDataset class can also filter posts according to different times.
You can filter according to the post time, the join time and the last time a user posted something.
For this purpose, you need to hand over a datetime object in the call method of the DiscourseDataset class.
# import datetime
import datetime.datetime as datetime
# create a datetime object
end_of_2007 = datetime.date(2007, 12, 31)
# pass datetime object and filter posts
posts_after_2007 = dataset(post_after=end_of_2007)Posts can be searched with a keyword by calling .search(keyword). All posts that contain the keyword in their text, topic, category or username are returned.
Pretty printing the dataset.
Pretty print
print(dataset)Display posts with hyperlinks to topics and users
dataset.display()Write html file of posts with hyperlinks to topics and users
filename = 'my_posts
dataset.write('my_posts')A histogram of post times sorted by month can be plotted with .plot(). The histogram can be given a title with .plot(title="Title of Plot")