This project aims to visualize the representation learned by a word-level language model trained on a collection of classics of the literature. The aim is purely aesthetic and not scientific, what presented here should be interpreted with caution (or not intepreted at all).
This project was inspired by What do numbers look like.
Features
- Automated data preparation: from PDF to numpy arrays.
- Integrated hyper-parameters tuning for the language model.
- Possibility to grid-serach UMAP hyper-parameters.
Language Model and Encoder
The Artificial Neural Network architecture used for this project was implemented according to the following architecture:
For extracting the representation learned by the model we constructed an encoder composed by all the transformation perfromed by the model in its first portion:
How to Use
- Create a folder in
data/raw
namedyour_project_name
. - Populate the the
your_project_name
with the books you want to embed in PDF format. - In
data/jsons
createyour_project_name.json
mapping the title of each book to a validmatplolib
colormap.{ "Dracula": "Reds", "The Picture of Dorian Gray": "plasma", "Strange Case of Dr Jekyll and Mr Hyde": "viridis", "King Solomon's Mines": "autumn", "Twenty Thousand Leagues Under the Sea": "winter", "The Invisible Man": "summer" }
- From the terminal, launch
run_pipeline.py
and specifyyour_project_name
when prompted to do so.
Alternatively, each script in run_pipeline.py
can be launched separately (in case a specific step needs to be executed in isolation)
- When the script is done (this can take quite some time), use the notebook
generate_visuals.ipynb
for obtaining the visuals.
Example
The League of Extraordinary Gentlemen
In this example we attempted to visualize the novels linked to some of the members of Alan Moore and Kevin O’Neill’s graphic novel “The League of Extraordinary Gentlemen”:
- Mina Harker
- Allan Quatermain
- Hawley Griffin
- Dorian Gray
- Edward Hyde
- Captain Nemo
In the following visualizations each novel is assigned a specific sequential color palette (see the “How to Use” section). All the sentences constituting the novels are represented as sequences of coloured dots. Each dot represents a word in a sentece while the sequentiality of the colour palette indicates the position of that secific word inside the sentence.
Word embedding for the entire collection of books
Word embedding for specific books
Sequential visualization of word embedding for a specific book
Credits
- The core idea for this project comes from What do numbers look like.
- The PDF files used for producing the examples in this repository come from Project Guttenberg.
- The selection of novels was inspired by The League of Extraordinary Gentlemen co-created by Alan Moore and Kevin O’Neill.
- The selected novels have been created by:
- Dracula by Bram Stoker.
- King Solomon’s Mines by Sir Henry Rider Haggard.
- The Invisible Man by Herbert George Wells
- The Picture of Dorian Gray by Oscar Fingal O’Flahertie Wills Wilde.
- The Strange Case of Dr Jekyll and Mr Hyde by Robert Louis Stevenson.
- Twenty Thousand Leagues Under the Sea by Jules Gabriel Verne.
- The music is Lacrimosa from the Requiem in D minor by Wolfgang Amadeus Mozart.
License
The code produced for this project is under MIT License.