Gather round, friends, for I am about to relay a saga.
Melodramatic? Perhaps. The title of this post, after all, boils down to “how to install algorithms”, words that instantly put 90% of the adult human population to sleep. (Fellow graduate students in desperate need of some shut-eye, take note.)
This particular installation adventure, however, took multiple computers, multiple operating systems, 44 emails, and also a month. So.
The Short Version
I’m working with some sensitive data that, per IRB restrictions, can only be kept on a computer that is not and has never been connected to the internet. I need to analyze these data using R, Python, and spaCy algorithms, which are useful for things like tagging parts of speech. This requires putting a lot of stuff usually installed and configured via the internet on a laptop without the internet. Chaos ensues.
Why I Am Sharing This Very Specific Problem
While it is unlikely that you will find yourself working with this particular type of data using this particular configuration of software in this particularly irksome situation, there is a profound humility in recognizing how much we rely on the internet in our day-to-day analyses. To install an R package, which I imagine most of us do as needed without much thought, we connect to a web-based repository. To configure Python, we connect to the internet. To get Python to work with natural language processing (NLP) algorithms, we rely on web access to all of the relevant models, dependencies, and so on. So much of modern coding and data analysis assumes an internet connection, and we forget about all of the selection happening under the hood. It’s worth dissecting how that works.
Assumptions Before We Get Started
I assume facility with R in what follows. Most how-tos for spaCy, Python, and the like further assume some familiarity with bash or the command line. If I’ve already lost you, fear not. There’s no getting around this stuff, but it’s good to learn it!
tl;dr: I assume you could figure out how to do all of this on a computer with the internet if you needed to. Andrew Heiss has a very good tutorial on that.
I work on a Mac and so will assume Mac operating systems, binaries, etc. in my code. If you work on Windows or Linux, I will highlight where you’ll need to make changes.
To Be Clear, What We’re Doing
- We’re installing R and Python on a laptop that’s not connected to the internet.
- We’re setting up R packages on that laptop.
- We’re installing the spaCy natural language processing (NLP) software.
- We’re setting up the relevant language model so that we can actually do our analysis.
- We’re trying very hard not to pull our hair out.
NLP Meta: Why spaCy?
(I don’t care; skip this part.)
As mentioned, I’m using spaCy to identify parts of speech in a text corpus and count them. This helps us understand language usage across native and non-native speakers, as well as evaluate assumptions about how rhetorical constructions are linked to complexity of speech. There are other NLP algorithms besides spaCy that can do this, such as the Natural Language Toolkit (NLTK) or CoreNLP. So why use spaCy here?
A few reasons. First, spaCy is broadly useful. The algorithm is trained on a wide range of texts, rather than just newspaper texts as is common, and so we can have more confidence that it will correctly identify parts of speech (for example) in a variety of different types of texts. Second, it excels at word tokenization and parts of speech tagging, which were my interests here. (It’s less good at sentence tokenization; you’ll want NLTK for that.) Third, spaCy is object-oriented, so it plays more nicely with Python than string-handling systems like NLTK.
The unsexy reason, to be quite honest, is path dependence: I learned how to use spaCy first and stuck with it. In this case, that turned out to be an inadvertently wise choice.
Step 1: Setting up R and Python
The first thing you’ll need to do is download the installation file for R (and RStudio, if you use it) on an internet-connected machine, put it on a flash drive, and transfer it over to the non-connected laptop. If this seems straightforward and you’re wondering why this post is already 700 words long, well, buckle up.
Then, download the installation file for Python and transfer that over as well. This can potentially get complicated if you are working on a Mac, especially one that hasn’t been updated in a while and that you can’t update because you can’t connect it to the internet. Older versions of the Mac OS come with Python 2.7 pre-installed, which is a problem for numerous reasons:
- Python 2.7 will no longer be supported once we hit 2020.
- Installing a newer version of Python and forcing your Mac to run it via an alias is doable, but the OS will attempt to thwart you at every turn, periodically switch back to the old version without informing you until you try to do something and get an error message, etc. If you want to try it, run alias python = ‘python3’ (or ‘python2’, or whatever) in bash.
- Python 2.7 includes a number of dependencies as disutils. What those are doesn’t matter; what does is that if you try to uninstall dependencies that are disutils so that you can run newer versions, you may break your entire OS. Best to leave them alone. More on dealing with disutils later.
To check what version of Python is installed (or whether you have one installed in the first place, if you’re not sure), open Terminal on a Mac and run python –version. (If you’ve never worked in Terminal before, congratulations: you’re learning bash! You run commands by hitting “return”. Also, that should be two hyphens, not an en-dash. Fuck off, WordPress GUI.)
tl;dr: install Python, and be aware of what version you’re using; this will come up again in a bit.
Step 2: Setting Up R Packages
To get R to talk to spaCy, you’ll need three packages: tidytext, cleanNLP, and reticulate. All of these have dependencies—other packages that they rely on in order to work properly. When you install an R package on an internet-connected machine, dependencies download automatically. When you don’t have an internet connection, however, you’ll need to download all of these manually. Ughhhhh.
Getting the package you actually want: You’ll need to download an archive of the package from the CRAN website. Here, for example, is the page for tidytext. For Mac, you’ll want a .tgz or .tar.gz archive; for Windows, you’ll want a .zip archive. Do not unzip the archive! Transfer it over via flash drive and install like this (change to win.binary if you’re on Windows, and note that you might need to specify a more complete file path):
install.packages("tidytext_0.2.0.tgz", repos=NULL, type="mac.binary")
Then try to load the package with library(). If the package has dependencies which you haven’t installed yet, this will throw an error.
Getting dependencies: You can either keep installing dependencies one by one until you stop getting errors, OR (and please, please do this instead), you can use miniCRAN. miniCRAN is a package that will create a local repository for a package and all of its dependencies, which you can then transfer to your non-networked computer via flash drive. Here’s an extended tutorial on how to do this.
The most useful feature of miniCRAN, for me, is the ability to quickly see just how many dependencies a package has (as well as all of the dependencies of its dependencies). If a package only has a couple, it’s probably more efficient to download the archives by hand rather than wait for miniCRAN to compile them into a repository. If, however, you work in the tidyverse, and you were not previously aware that that package has 85 dependencies…well, now you know.
Step 3: Setting up spaCy
Are you sitting down? Okay, good. (This shaved years off of my life; you stop being dramatic.)
Before we talk about getting spaCy configured properly, we need to talk about how Python packages work. In order to do that, we need to talk about bash. (Skip this part if you already know how to install Python packages with pip in bash.)
If you’ve ever looked over someone’s shoulder while they’re coding and seen them working in a vaguely scary-looking, plain-text-ish window, that’s probably a version of bash. Bash (or the bash shell) is a command language that lets you issue commands directly to your operating system. It’s analogous to the command line in Windows (although you can now install the bash shell on Windows machines as well) and, on a Mac, works through Terminal (which you can access from the “Other” menu in the Launchpad). You’ll need either Terminal or the command line on a Windows machine for what we’re going to do next. I’ll refer to these as “bash” from now on for simplicity’s sake.
Python packages are installed via bash through a package management system. If you’ve used Python before, you’ve probably used pip, which comes with Python. Conda is an alternative, but we’ll work with pip from here on out.
The basic command in bash for installing a package on an internet-connected machine is (for spaCy; note that bash, unlike R, is not case-sensitive):
pip install spacy
This won’t work sans internet, however. Much like we needed .tgz or .zip archives to install R packages, we need the actual installation files for Python packages as well. These are called wheels, end in .whl, and are all available on PyPi, the Python analogue to CRAN.
So, in a fantasy world where spaCy had no dependencies, this would work like so. The file name here is copied from PyPi and for Python 2.7 on a Mac. Important to note here is that different wheels exist for different versions of Python. Download accordingly!
pip install spacy-2.0.18-cp27-cp27m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl
If you try to download a spaCy wheel, transfer it to a non-networked laptop, and install it via pip, the installation will fail because you haven’t installed spaCy’s dependencies yet. There are a lot! You’ll need wheels for the following:
- atomicwrites
- attrs
- certifi
- chardet
- cymem
- Cython
- cytoolz
- dill
- funcsigs
- idna
- mock
- more_itertools
- msgpack_numpy
- msgpack
- murmurhash
- nose
- numpy
- pathlib-1
- pathlib-2
- pbr
- plac
- pluggy
- preshed
- py
- pytest
- regex
- requests
- scandir
- setuptools
- six
- thinc
- toolz
- tqdm
- ujson
- urllib
- wrapt
Even more annoyingly, you will need the correct versions of all of these wheels for your particular operating system and version of Python. If your non-connected computer is running Python 3.7 and an updated OS, you’re probably in the clear. But if not, you’re going to need to generate some wheels yourself.
Yes indeed: if you have the internet, you can create your own installation files for Python packages. This is pretty cool. It becomes measurably less cool once you realize that wheels you make yourself will be specific to whatever version of Windows, Mac OS X, or Linux that you’re running. So, for instance, I can create a wheel for regex, a Python package you need in order to run spaCy, in bash like this. (I knew to specifically ask for the 2018.01.10 version of regex because the error message I got when I tried to install regex on the non-connected computer told me so. Bash is helpful like that.) If I were running multiple versions of Python and wanted to create a wheel specifically for Python 2.7, I would run the same command but with pip2.
pip wheel regex==2018.01.10
However, because I am running OS X Mojave, this wheel will only work on computers running Mojave or any future versions of OS X. If, entirely hypothetically, my non-connected computer is running, say, OS X Sierra, which is several versions out of date now, my Mojave-generated wheel won’t install properly. And if, entirely hypothetically, this happens to you and you want to throw your laptop out the window, you have my tacit support.
Actual solutions:
- Create a VirtualBox environment that will allow you to temporarily run multiple versions of the same operating system on an internet-connected computer, then generate the wheels you need in the correct OS.
- If you’re familiar with TravisCI, you can use it to generate a wheel and specify osx_image=whateverversionofXCoderunsontheOSthatyouneed (for Mac).
- If all of that makes you want to leave academia and go live in a hut in the woods, you can do what I did: find a friend who can either do these things or has an old laptop laying around, and bribe them with cookies to make you wheels.
Installing the wheels, once you have them, will take some trial and error in terms of figuring out which are dependent on each other and so must be installed first, etc.
An aside: Superuser privileges
You may need “superuser” privileges in order to install some (or all) of these packages. To invoke these, add “sudo” in front of any command you run; for example:
sudo pip install spacy-2.0.18-cp27-cp27m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl
This will prompt you to enter your password. I ended up running all “pip install” commands with “sudo” in front just to be safe.
An aside: Disutils on Macs
As mentioned above, some Python packages are disutils, in that they come pre-installed on older Macs and so may not be the version you need. Uninstalling them is a very, very bad idea: because Python is part of your operating system on these older Macs, trying to remove disutils could break your entire OS. Luckily, for the first time in this entire tutorial, something is easy.
sudo pip install regex-2018.01.10-cp27-cp27m-macosx_10_11_intel.whl --ignored-installed
That’s it. Well, not really, because if you plan on running Python in bash in the future (and you might need to, in order to finish this tutorial, if you run into trouble), you’ll want to initialize Python when you do so like this. Otherwise, Python will prioritize older versions of packages.
PYTHONPATH=/Library/Python/2.7/site-packages python
If you’re using a Mac, things might get worse: or, what to do if you’re getting errors that pip isn’t installed
So, here’s a story. Let’s say you’re running an older version of OS X. Let’s say it happens to be Sierra, and let’s say Sierra happens to come without pip installed. It’s supposed to have an alternative, called easy-install, but let’s suppose your computer is a little bitch (because computers tend to be little bitches) and thinks that easy-install has, in fact, not lived up to its name and is not installed.
Fortunately, this story has a happy if multi-part ending. You’ll need three .zip or .tgz archives: one for setuptools, one for wheel (that’s a package; I did not forget an article), and one for pip. For ease of access, the Mac versions are, respectively, here, here, and here.
Once you’ve downloaded and transferred those over, run the following in bash:
unzip setuptools-40.7.0.zip
cd setuptools-40.7.0
python ./setup.py install
cd ..
tar -xzf wheel-0.32.3.tar.gz
cd wheel-0.32.3
python ./setup.py install
cd ..
tar -xzf pip-19.0.1.tar.gz
cd pip-19.0.1
python ./setup.py install
THEN, finally, you should be able to use pip to install packages.
Step 4: Setting Up the En Model
Once you have spaCy installed, you’re ready for the final step before we can return to the blessedly straightforward land of R: installing the correct language model.
spaCy works in a number of languages. When you run spaCy on an internet-connected machine, you install the relevant language model in bash. To use spaCy to analyze English-language texts, this looks like so. You can find a list of all spaCy models here.
python -m spacy download en
Of course, this is all more complicated sans internet. Like Python packages, language models have wheels. They also have dependencies. All of these need to be compatible with whatever version of spaCy you’re running. This is deeply irritating.
Finding language model wheels online is not straightforward—almost as though no one thought you would need to employ an internet-based algorithm without the internet. For basic English-language parts of speech tagging, I needed the en_core_web_sm model. Here’s a Github repository. Alternatively, here’s a Dropbox link with the en_core_web_sm model and dependencies. Install these as you would any Python wheels.
Last Step: Getting spaCy to talk to R
You’ve set up spaCy. You’ve installed all of the R packages you need. Now, the final challenge is to get R and spaCy to communicate.
Recall the three packages you’ll need to do this: tidytext, cleanNLP, and reticulate. Load them now, and run the following code. This will tell R where spaCy’s installed and which language model to use. (Note that if you were doing this on an internet-connected machine, you could leave out the model_name argument, and R would find this automatically.)
library(tidytext)
library(cleanNLP)
library(reticulate)
use_python("/usr/local/bin/python3", required=FALSE) %this points R toward Python version 3.7 but also tells it to look for other versions if it can't find 3.7 and use those instead
cnlp_init_spacy(model_name="en_core_web_sm")
And that should do it! R is all ready to go for parts of speech tagging and anything else you might want to do with spaCy. (I linked Andrew Heiss’ tutorial above; here it is again.)
Takeaways
Now that we’ve celebrated our coding prowess and taken a few shots of whiskey (please lament your life choices responsibly), a few lessons learned, both practical and more meta.
The practical: Without the internet, compatibility between one’s OS, software versions, and package versions constitutes a surmountable but challenging hurdle. I came into this project in the middle as the data monkey and so had no input in the IRB protocols or in what non-networked laptop would be purchased—but should you anticipate working with sensitive data under similar circumstances, the following might help you from the get-go:
- See if you can set everything up on a computer before sensitive data get anywhere near it. The first time I used spaCy on my personal laptop, I was much newer to Python and pretty unsure of what I was doing, and it still only took a couple of hours to get everything set up. Not a month. Ergo.
- Contingent upon funding, try to get a non-connected computer running the same OS as your personal computer. That way, if you end up needing to generate wheels yourself, you can do so knowing they’ll be compatible with your non-connected machine.
The meta: It’s difficult to overstate the degree to which social scientists who do quantitative and/or computational work rely on the internet. Software is built with a Wi-Fi connection in mind—a fair assumption in almost all scenarios, and one I did not really solve here in that I still needed an internet-connected laptop to obtain all of the software that I had to put on the non-connected machine. This certainly streamlines our workflow—see above comparison of two hours vs. a month to do the same task—but it also means we aren’t forced to reckon, on a day-to-day basis, with just how complex the technologies with which we’re engaging have become. This goes beyond knowing what modeling decisions are being made for us under the hood to the actual architecture of our software: how its parts speak to each other and how they come together to produce a package or algorithm that we then have to decide how to employ. From a purely appreciative perspective, it’s worth it, in my mind, to occasionally look up how many dependencies a package has, or to learn how to install a model by hand. And it’s worth it to consider how these structural features both enable us to pursue more powerful computations and constrain the ways in which we are able to do so.
My thanks to Alex Tahk for working through all of this with me.