Module 4 Exercises

There are many different tools and approaches you could use to visualize your data, both as a preliminary pass to spot the holes and also for more formal analysis. In which case, for this module, I would like you to select two of these exercises which seem most germane for your own research project. You are welcome to work through more of them, of course, but I want the exercises to move your own research forward. Some of these I wrote; some are adapted from The Macroscope; others are adapted or used holus-bolus from scholars like Miriam Posner, Fred Gibbs, and Heather Froehlich (and I'm grateful that they shared their materials!)

nb If you start working with the R exercises below, I would suggest you read the introductory bits from Lincoln Mullen's book-in-progress, DH Methods in R, especially the 'setup' part under 'getting started' (pay attention to the bit on installing packages and dependencies). If you spot any exercises in Mullen's book that seem relevant to your project, you may do those as an alternative to the ones here. Alternatively, go to Swirl and learn the basics of R within R (It's an interactive tutorial. Try it out.)

Texts	Networks	Maps	Charts
Topic Modeling Tool	Network analysis and visualization	Simple mapping & georectifying	Quick charts using RAW
Topic Modeling in R	Converting 2-mode to 1-mode	QGIS (tutorials by Fred Gibbs)
Text Analysis with OverviewProject	Graphing the Net	Geoparsing with Python
Corpus Linguistics with AntConc	Choose your own adventure	Palladio with Posner
Text Analysis with Voyant
Text Analysis in R

Exercise 1

Network Visualization

In exercise 1, you will transform your Texan Correspondence data into a network, which you will then visualize with Gephi. The detailed instructions are here. I would recommend that you also take a long look at Scott Weingart's series, Networks Demystified. Finally, heed our warning.

Exercise 2

Topic Modeling Tool

In exercise 2, you will use the 'Topic Modeling Tool' to create a simple topic model and a webpage that allows you to browse the results.

Download the tool. (The site for the tool is https://code.google.com/p/topic-modeling-tool/.
Make sure you have the Colonial Newspaper Database handy on your machine. (You can grab my copy from here).
Double-click on the file you downloaded in step 1. This will open a java-based graphical user interface with one of the most common topic-modeling approaches, 'Latent Dirichlet Allocation'.
Set the input to be the Colonial Newspaper Database.
Set the output to be somewhere neat and tidy on your computer.
Set the number of topics you'd like to model.
Click 'train topics' to run the algorithm.
When it finishes, go to the folder you selected for output, and find the file 'all_topics.html' in the 'output_html' folder. Click on that, and you now have a browser-based way of navigating your topics and documents. In the output_csv folder created, you will find the same information as csv, which you could then input into a spreadsheet for other kinds of visualizations (which we'll talk about in class.)

Make a note in your open notebook about your process and your observations. How does reading this material in this way change/challenge/or focus your understanding of the material?

Going Further Remember when we did the Canadiana API and WGET exercises in Module 2? Somewhere on your machine you have a collection of those materials. Now, you can load those materials into the Topic Modeling Tool if you have all of the txt files in a single folder. In the case of the slavery documents, that was something like 7500 items. That's a lot of drag-and-drop. You can 'flatten' the folder structure so that all of the documents in your subfolders are put into a single folder. If you are on a Mac, try these instructions. On a PC, try this one (there are scripts you can use, but for the time being this is probably simplest). Then, you can point your topic modeling tool at your flattened folder, and boom you have a topic model fitted to your collection.

exercise 3

Topic Modeling in R

Exercise 2 was quite a simple way to do topic modeling. There are more powerful ways, and one of these uses a program called R Studio, which is an interface for the R statistical programming language. R is a powerful language for exploring, visualizing, and manipulating all kinds of data, including textual. It is however not the most intutive of environments to work in. In which case, RStudio is what we need. Download the free version and install it on your machine. *Note also that you need to have R downloaded & installed first! Then, go to http://tryr.codeschool.com/ and work your way through some of that tutorial. This tutorial mimics working right in the R console. Remember working in git bash or the terminal in Module 3? It's somewhat similar to that, but just for R. A handy pdf that explains a bit more about working within the R Studio enivornment can be had here. In essence, you put your code in the script window, execute each line of it, and the output appears in the console. Or in the image plots window. This handout will guide you around the interface.

In this exercise, we're going to grab the Colonial Newspaper Database from my github page, do some exploratory visualizations, and then create a topic model whose output can then be visualized further in other platforms (including as a network in Gephi). The walkthrough can be found here. Each gray block is something to copy-and-paste into your script window in R Studio. Then, put the cursor at the start of the first line, and hit ctrl+enter to get RStudio to execute each line. In the walkthrough, when you get to another gray block, just copy and paste it into your script window after the earlier block. Work your way through the walkthrough. The walkthrough gives you an indication of what the output should look like as you move through it. (The walkthrough was written inside R, and then turned into HTML using an R package called 'Knittr'. You can see that this has implications for open research! For reference, here's the original Rmd (R markdown) file that generated the walkthrough.)

By the way: when you run this line: topic.model$train(1000) your console will fill up with data as it iterates 1000 times over the entire corpus, fitting a topic model to it. This is as it should be!

In this way, you'll build up an entire script for topic modeling materials you find on the web. You can then save your script and upload it to your open notebook. In the future, you'd be able to make just a few changes here and there in order to grab and explore different data.

Make a note in your open notebook about your process and your observations.

Going further If you wanted to use that script on the materials you collected in module 2, you would have to tell R to load up those materials from a directory, rather than by reading a csv file. Take a look at my script for topic modeling the Ferguson Grand Jury documents, especially this line:

documents <- mallet.read.dir("originaldocs/1000chunks/")

You feed it the path to your documents. If you are on a windows machine, the path would look a bit different, for instance:

```

----

## exercise 4
### Text Analysis with Overview

In exercise 4, we're going to look at the Colonial Newspaper Database again, but this time using a tool called 'Overview'. Overview uses a different approach that the topic models we've been discussing. In essence, it looks at word frequencies and their distributions within a document, and within a corpus, to organize the documents into folders of progressively similar word use. 

You can download Overview to run on your own machine, but for our purposes, the hosted version at [https://www.overviewdocs.com/](https://www.overviewdocs.com/) is sufficient. Go to that page, watch the video, create an account, and then log in. (More help about how Overview works [may be found on their blog](https://blog.overviewdocs.com/), including helpful videos.)

Once you're inside, click 'import from a CSV file', and upload the CND.csv (which you can download and save to your own machine from [here](https://raw.githubusercontent.com/shawngraham/exercise/gh-pages/CND.csv) <- 1="" 2="" 3="" 5="" 6="" 7="" right-click="" and="" save="" as.="" On="" the="" 'UPLOAD="" A="" CSV="" FILE'="" page="" in="" Overview="" click="" 'browse'="" select="" CND.csv.="" It="" will="" give="" you="" a="" preview.="" There="" are="" number="" of="" options="" here="" -="" can="" tell="" which="" words="" to="" ignore,="" added="" importance="" to.="" What="" select?="" Make="" note="" your="" notebook.="" Then="" hit="" 'upload'.="" new="" appears,="" called="" 'YOUR="" DOCUMENT="" SETS'.="" Click="" on="" one="" just="" uploaded.="" file="" folder="" tree="" showing="" documents="" progressively="" greater="" similarity="" open;="" right="" hand="" side="" be="" list="" within="" each="" box="" (the="" question="" greyed="" out="" when="" it,="" so="" know="" where="" are).="" You="" search="" for="" document,="" they="" are;="" tag="" that="" find="" interesting.="" The="" system="" allows="" jump="" between="" distant,="" macroscopic="" view="" close,="" document="" level="" view.="" Jump="" back="" forth,="" see="" what="" find.="" For="" suggestions="" about="" how="" use="" effectively,="" try="" [their="" blog](https:="" blog.overviewdocs.com="" ).="" notes="" observe="" Also,="" export="" tagged="" set="" from="" Overview,="" could="" visualize="" patterns="" tagging="" spreadsheet="" (for="" instance).="" *Going="" further*="" Do="" upload="" collected="" during="" Module="" 2?="" ----="" ##="" exercise="" ###="" Corpus="" Linguistics="" with="" AntConc="" Heather="" Froelich="" has="" put="" together="" an="" excellent="" step-by-step="" using="" exploring="" textual="" within,="" across,="" corpora="" texts.="" Work="" way="" through="" her="" [tutorial](http:="" hfroehli.ch="" workshops="" getting-started-with-antconc="" )="" Can="" get="" our="" example="" materials="" (from="" Colonial="" Newspaper="" Database)="" into="" AntConc?="" [This="" might="" help="" you](http:="" www.themacroscope.org="" ?page_id="418)" split="" csv="" individual="" txt="" files.="" Alternatively,="" do="" have="" any="" own,="" already="" collected?="" Feed="" them="" AntConc.="" see?="" if="" compare="" against="" other="" texts?="" FYI,="" [here="" is="" collection="" explore](http:="" www.helsinki.fi="" varieng="" CoRD="" index.html)="" -----="" Text="" Analysis="" Voyant="" In="" module="" recall,="" we="" worked="" transform="" XML="" stylesheets.="" Melodee="" Beals="" used="" [stylesheet="" ](https:="" github.com="" mhbeals="" Colonial-Newspaper-Database="" master="" Transformers)="" database="" series="" exercises="" above,="" transformer="" was="" make="" single="" file.="" this="" exercise,="" going="" [Voyant="" Tools](http:="" voyant-tools.org)="" word="" database.="" read="" either="" *or*="" text="" advantage="" uploading="" files="" that,="" chronological="" order,="" Voyant's="" default visualizations="" also="" arranged="" order="" thus="" change="" over="" time.="" Go="" [http:="" voyant-tools.org](http:="" voyant-tools.org).="" Paste="" URL="" CND="" database:="" [https:="" raw.githubusercontent.com="" shawngraham="" gh-pages="" CND.csv]([https:="" CND.csv])="" .="" Now,="" open browser="" window,="" go="" voyant-tools.org="" ?corpus="colonial-newspapers&stopList=stop.en.taporware.txt](http://voyant-tools.org/?corpus=colonial-newspapers&stopList=stop.en.taporware.txt)" difference?="" latter="" articles="" been="" uploaded="" individually,="" treated="" as="" order.="" Explore="" corpus,="" comparing="" terms="" time,="" looking="" at="" keywords="" context,="" RezoViz="" tool="" create="" graph="" people,="" places,="" organizations="" appear="" same="" (and="" across="" documents)="" connected="" (you="" 'rezoviz'="" under="" cogwheel="" icon="" top="" panel).="" embed="" tools="" blogs="" by="" 'save'="" getting="" iframe="" or="" code.="" apply="" 'stopwords'="" clicking="" different="" tools,="" selecting="" stopwords.="" Apply="" stopwords="" globally,="" you'll="" only="" once!="" highlight?="" Which="" ones="" useful?="" strike="" interesting?="" Note="" all="" down.="" Upload="" explore="" them.="" Quick="" Charts="" Using="" RAW="" quick="" chart="" handy="" thing="" have.="" Google="" spreadsheets,="" Microsoft="" Excel,="" host="" programs="" charts="" quickly="" their="" wizard="" functions.="" Never="" hesitate="" turn="" these.="" However,="" not="" always="" good="" non-numeric="" data.="" 3,="" NER="" extract="" place="" names="" text.="" After="" some="" further="" munging="" regex,="" ended="" up="" looks="" like="" [this](https:="" hist3907b-winter2015="" module4-holes="" texas.csv).="" visualization="" information?="" One="" useful="" [RAW](http:="" raw.densitydesign.org="" Open="" window.="" Copy="" table="" data="" places="" mentioned="" Texan="" correspondence,="" paste="" it="" input="" screen.="" ####="" munge="" should="" error="" message,="" effect="" need="" check="" 'line="" 2'.="" What's="" gone="" wrong?="" checked values="" row,="" compared="" columns="" row="" (which="" contains="" column="" names).="" sees="" two="" don't="" match.="" add="" null="" value="" those="" cells.="" So,="" [Google="" Sheets](https:="" www.google.ca="" sheets="" ),="" 'go="" google="" sheets'="" button,="" then="" big="" green="" plus="" sign="" start="" sheet.="" following="" top-left="" cell="" (cell="" A1):="" `="IMPORTDATA("https://raw.githubusercontent.com/hist3907b-winter2015/module4-holes/master/texas.csv")`" Pretty="" neat,="" eh?="" here's="" thing:="" even="" though="" sheet="" _looks_="" filled="" information,="" it's="" (at="" least,="" far="" script="" run="" concerned).="" That="" say,="" itself="" data,="" grabbing="" info="" elsewhere="" web="" dynamically="" filling="" we're="" works="" static="" (more="" less).="" cursor="" B1.="" Mac,="" `shift+cmnd+downarrow`.="" Windows="" machine,="" `shift+ctrl+downarrow`.="" Mac="" `shit+cmnd+rightarrow`,="" `shitf+crtl+rightarrow`.="" copy="" (`cmnd+c`="" `ctrl+c`).="" Then,="" 'Edit'="" 'paste="" special'=""> 'paste VALUES only'.

The formula you put in cell A1 now says `#REF!`. You can delete this now. This mucking about is necessary so that the add on script we are about to run will work.

We now need to fill those empty values. In the tool bar, click `add ons` -> `get add ons`. Search for `blanks`. You want to add `Blank Detector`.

Now, click somewhere in your data. On Mac, hit `cmnd+a`. On Windows, hit `ctrl+a`. This highlights all of your data. Click `Add ons` -> `blank detector` -> `detect cells`. A dialogue panel will open on the right hand side of your screen. Click the button beside `set value` and type in `null`. Hit `run`. All of the blank cells will fill with the word `null`. Delete column A (which formerly had record numbers, but is now just filled with the word `null`. We don't need it). **If you get the error, run exceeded maximum time** just hit the run button again. This script might take a few minutes.

You can now copy and paste your table of data into the data input box in RAW, and you should get the green thumbs up saying x records have been successfully parsed!

#### Playing with RAW
RAW takes your data, and depending on your choices, passes it into chart templates built on the d3.js code library. D3.js is a powerful library for making all sorts of charts (including interactive ones). If this sort of thing interests you, you can follow the tutorials in [Elijah Meeks' excellent new book](http://manning.com/meeks/).

With your data pasted in, you can now experiment with a number of different visualizations that are all built on the d3.js code library.  Try the ‘alluvial’ diagram.  Pick place1 and place2 as your dimensions - you click and drag the green boxes under 'map your data' into the 'steps' box. Leave the 'size' box empty. Under 'customize your visualization' you can click inside the 'width' box to make the diagram wider and more legible.

Does anything jump out? Try place3 and place 4. Try place1, place2, place3, and place4 in a single alluvial diagram. When we look at the original letters, we see that the writer often identified the town in which he was writing, and the town of the addressee. Why choose the third and fourth places? Perhaps it makes sense, for a given research question, to assume that with the pleasantries out of the way the writers will discuss the places important to their message. Experiment! This is one of the joys of working with data, experimenting to see how you can deform your materials to see them in a new light.

You can export your visualization under the 'download' box at the bottom of the RAW page - your choices are as a simple raster image (png), a vector image (svg) or a data representation (json).

-----
## exercise 8
### Simple Mapping and Georectifying

In this exercise, you will find a historical map online, upload a copy to a mapwarper service, georectify it, and then display the map online, via a hosted service like CartoDB, and also through a map you will build yourself using leaflet.js. Finally, we will also convert csv to geojson using http://togeojson.com/, and we'll map that as a github gist. We'll also grab a geojson file hosted on github gist and import it into cartodb.

#### Georectifying
Georectifying is the process of taking an image (whether it is of a historical map, chart, airphoto, or whatever) and manipulating its geometry so that it matches a geographic projection. Think of it like this: you take your handdrawn map, and use pushpins to pin down known locations on your map to a globe. As you pin, your image stretches and warps. Traditionally, this has not been an easy thing to do, if you are new to GIS. In recent years, the curve has flattened significantly. In this exercise, we'll grab an image, upload it to the Harvard Library MapWarper service, and then export it as a tileset which can be used in other mapping programs.

1. Get a historical map. I like the Fire Insurance plans from the [Gatineau Valley Historical Society](http://www.gvhs.ca/research/maps-fire-insurance.html); I'm sure you can find others to suit your interests.
2. Right-click, save as.... grab a copy. Save it somewhere handy.
3. Go to [Harvard World MapWarp](http://warp.worldmap.harvard.edu/) and sign up for an account. Then login.
4. Go to the upload screen: 
 ![Imgur](http://i.imgur.com/bmNCzg6.png)
5. Fill in as much of the metadata as you can. Then select your map from your computer, and upload it.
6. On the next page, click 'rectify'. 
 ![Imgur](http://i.imgur.com/yULDRQR.jpg)
7. Pan and zoom both maps until you're sure you're looking at the same area in both. Double click in a map, select the pencil icon, and click on a point (location) you are sure you can match in the other window. Then click on the other map window, select the pencil, and then click on the same point. The 'add control point' button below and between both maps will light up. Click on this to confirm that this is a control point you want. Do this at least three times; the more times you can do it, the better the map warp.
8. Having selected your control points, click on 'warp image'.
9. You can now click on the 'export' panel, and get the URL for your georectified image in a few different formats. If you clicked on the KML option, a google map window will open [like so](https://maps.google.com/maps?q=http://warp.worldmap.harvard.edu/maps/4152.kml&output=classic&dg=feature). For many webmapping applications, the Tiles (Google/OSM scheme): Tiles Based URL is what you want. You'll get a URL like this: ```http://warp.worldmap.harvard.edu/maps/tile/4152/z/x/y.png```   Save that info. You'll need it later.

You have now georectified a map. Let's use that map as a base layer in [Palladio](http://palladio.designhumanities.org/#/)

We need some place data for Palladio. Here's what I'm using 
 ![Imgur](http://i.imgur.com/vTEiRxh.png) 
 Note how I've formatted this data. I'll be copying and pasting it into Palladio. (For more on how to input geographic data into Palladio, see [this tutorial](http://hdlab.stanford.edu/doc/scenario-point-to-point.pdf)). Basically, you want something like this:


|   | Place      | Coordinates            |
|---|------------|------------------------|
|   | Mexico     | 23.634501,-102.552784  |
|   | California | 36.778261,-119.4179324 |
|   | Brazos     | 32.661389,-98.121667   |

etc: that is, a tab between 'place' and 'coordinates' in the first line, a tab between 'mexico' and the latitude, and a comma between latitude and logitude. 

2. Go to [Palladio](http://palladio.designhumanities.org/). Hit 'start' then 'upload spreadsheet or csv'. In the box, paste in your data. **You can progress to the next step without having any real data: just paste or type something in - see the video below.** Obviously, you won't have any points on your map, but if you were having trouble with that step, this allows you to bypass it to continue on with this tutorial.
3. Click on 'map'. Under 'places', select 'coordinates'. Then, click 'add new layer'. In the popup, beside 'Choose one of Palladio default layers or create a new one.', select 'custom'. This is where you're going to paste it that tiles based URL from the map warper. Paste it in, but **replace** the ```/z/x/y``` part with ```{z}/{x}/{y}```. Click add.

Here is a video walk through; places where you might have got into trouble include getting past the initial data entry box on Palladio, and finding where exactly to past in your georectified map url.



Congratulations! You've georectified a map, and used it as a base layer for a visualization of some point data. Here are some [notes on using a georectified map with the CartoDB service](https://gist.github.com/shawngraham/a49a9834984ae0792345).

![Imgur](http://i.imgur.com/0gCjh5X.jpg)

----

## exercise 9
### Text Analysis in R

I would suggest, before you try this, that you look at the walkthrough for exercise 3, and that you become familiar with R. Then, you can try [this tutorial](http://onepager.togaware.com/TextMiningO.pdf), starting at page 3. On that page, the author tells you to create a folder called /corpus/text, and to fill it with text files you'd like to analyse. So why not grab some of the materials you collected in module 2? The problem is, where is this folder supposed to go? In R studio, find out where your working director is by typing

``` getwd()

in the console. Then, you can create the /corpus/text folder & subfolder at that location. Alternatively, you can set the working directory to wherever you like like so:

setwd("C://my-working-folder//") on a pc, or setwd("~/my-working-folder/") on a mac.

Then, to get going, you'd need

install.packages("tm")

library(tm)

You can then work through the entire pdf, or jump ahead to page 37 to see what the completed script would look like (here's my version using the CND again. Makes notes of what you find. Google any error messages you find to try to figure out a solution.

exercise 10 QGIS

QGIS

There are many excellent tutorials around concerning how to get started with GIS. Our own library, in the MADGIC centre has tremendous resources and I would encourage you to speak with the map librarians before embarking on any serious mapping projects. In the short term, the historian Fred Gibbs has an excellent series on using the open source GIS platform QGIS to make and map historical data.

For this exercise, I would recommend you try Gibbs' first tutorial,

'Making a map with QGIS'

...and then, try georectifying a historical map and adding it to your GIS:

'Using historica maps with qgis'

Going Further

There are many tutorials at The Programming Historian that are appropriate here. Try some under the 'data manipulation' or 'distant reading' headings.

If you're into social media as a data source, you might try Twarc.