less than 1 minute read

I have an interest in robustness in NLP, and therefore also in a variety of different types of language. Altough Wikipedia is commonly used in NLP research, the topically oriented Fandom wikis can be an interesting source for transfer learning. Fandoms are online wikipedias focusing on a certain topic (usually games, entertainment or culture), where fans can collect and find information.

From https://about.fandom.com/about: “Fandom encompasses over 40 million content pages in over 80 languages on 250,000 wikis about every fictional universe ever created.”

The license of Fandoms is CC BY-SA, you have to use the same license when re-distributing

I have open-sourced the script that I use for downloading a specific Fandom, it contains clear instructions for each step, and some steps (tokenization, deduplication) can be skipped depending on your needs. The script focuses on the main text on the wiki, not including its history, comments etc. The script can be found on: https://bitbucket.org/robvanderg/fandom_download/src/master/

Updated:

Comments