yobiduck

Tuesday, May 11, 2021

Reviews of Ancient Systems Papers: "Distributed Operating Systems" by Tanenbaum and Van Renesse, 1985.

I've been reading some of the papers that were assigned reading in MIT class 6.033 Computer Systems Engineering, both when I took it in about 1984 and when I helped teach it, as a teaching assistant, in about 1992. This class is aimed at MIT seniors, and involves a large writing component. I think there is a lot of value in those old papers. Some of it hasn't aged very well, however. Here I'll review some of those old papers, 37 years after I originally read most of them.

I'll start this series (note the optimism! A series! So far, containing one element) by reviewing item number 28 1991 6.033 Reading List is Tanembaum and Van Renese, "Distributed Operating Systems"Tanenbaum and Van Renesse, "Distributed Operating Systems", Computing Surveys,17:4, December 1985.

The authors are interested in "distributed operating systems". On such a system, the users can be largely unaware "of the fact that multiple machines are being used." As far as I know, such such operating system has gained traction. We are still using "network operating systems" today. I suspect that's for good reason: failures in distributed systems are too hard to hide.

Keep in mind that this paper was written in 1985, before Paxos (or Viewstamp replication) was invented. So the problem of coping with errors was wide open. No one knew how to make a distributed transaction that wouldn't get stuck. In 1988, Lynch, Dwork and Stockmeyer showed that consensus was possible, in theory, in distributed system. In 1988 Oki and Liskov described Viewstamp replication, and in 1989, Lamport described Paxos, both of which can be used to handle errors. In some sense, those consensus algorithms, which I'll call "Paxos" here, are the beginning of modern distributed systems: I might even argue that every correct distributed algorithm is built starting with Paxos. Feel free to prove me wrong with examples!

This paper, in Section 2.1.3, "error handling", alludes to the problem. It points out that "in a centralized system, a system crash means that the client, server, and communication channel are all completely destroyed, and no attempt is made to revive them. In a distributed system, matters are more complex." They go on to write about how if one machine fails, the other may end up waiting forever, and then if it has a timeout, it may timeout just when the failing machine comes back online. "Simply waiting until the server is rebooted and trying again sometimes works and sometimes does not". They conclude that doing an operation exactly once "is probably impossible". Today we know better.

Their section on scheduling feels out-of-date. These days we know how much more about scheduling.

They spend a lot of time on the details of remote procedure calls, naming, and protection. As far as I can tell, these issues haven't changed much in 37 years.

When this paper was written, Sun's NFS was brand new. NFS doesn't handle failures very well, and sadly it's almost still the state of the art for filesystems. I don't know any distributed filesystem protocols that actually work in the face of failures. (There are plenty of distributed filesystems built on top of Paxos, such as Amazon's EFS and Oracle's File Storage Service. But they all talk NFS to the clients, and it's really tough to write applications that cope with NFS failures. I find the paper's discussion of how to build distributed filesystems to be a painful: they didn't really get at the difficult issues of distributed filesystems. The problem isn't the filesystem (at least not when you have Paxos), the problem is that the programming interface that everyone is using just doesn't work well in a distributed environment. For example, Unix open(), read(), write(), fsync(), and close() functionality is an unsolved problem. What's needed is an API that both permits caching (so that the filesystem will be fast enough) and understands failures (so the application can cope with things like network partitions).

They conclude that distributed operating systems "will be an interesting and fruitful area of research for a number of years to come". It seems like that may still be true.

Sunday, October 11, 2020

While cleaning up I found this manual for Illustrate, which was implemented by Dawn Clarke and Dinarte Morais on Lisp machines, and modeled on the Xerox Alto draw program. The user interface was nice compared to most modern tools. One could do many operations with the keyboard or if you did need the mouse, you didn't need many clicks, so you could draw things very quickly.

Dawn M. Clarke and Dinarte R. Morais. Illustrate Reference Manual, Tanglewood Design Note 12, 10 January 1985. MIT Lab for Computer Science.

Tuesday, February 25, 2014

Trying to build gcc cilk from sources following the instructions at https://www.cilkplus.org/download. This step repeatedly fails since the gnu.org git server is slow or something:

git clone http://gcc.gnu.org/git/gcc.git cilkplus-gcc

Here's how to incrementally do a giant git clone from a slow server.

git init

git add .
git remote add cilkplus-gcc http://gcc.gnu.org/git/gcc.git
git pull http://gcc.gnu.org/git/gcc.git

It's painful, but with enough git pull operations, I seem to be able to get everything. Thanks to http://superuser.com/questions/154647/how-to-continue-cloning-a-git-repository-that-the-download-was-stopped

Note: the git add . gave an error, but I don't know if that meant it was a no-op.

Thursday, October 17, 2013

I am general chair for Charles E. Leiserson's 60th-Birthday Symposium and Party. It's a symposium (for the techs) and it's a party (for everyone). Hope to see you there.

Monday, September 17, 2012

XLDB Tutorial

Michael Bender and I presented a tutorial on Data Structures and Algorithms for Big Databases at XLDB 2012.

Thursday, May 31, 2012

How to digitally sign a pdf file using free software in GNU/Linux

I recently needed to sign a pdf document for legal purposes. The other party was willing to accept a digital signature. That does not mean copying an picture onto a pdf document. It's using public key crypto to sign the pdf document. Acrobat can do it, but I use free software, so, after about 30 minutes I figured out how to do it.

I extracted my csail certificate from my browser. In firefox
- edit->preferences->advanced->encryption->view certificates
- select the certificate
- then hit backup
- save it as a pkcs12 file
- you'll be asked for a password for the backup. You'll use this later.
I got jsignpdf-1.3.0 and installed it
- $ unzip JSignPdf-1.3.0.zip
- $ cd jsignpdf-1.3.0
I ran it: $ java -jar JSignPdf.jar
- A simple window popped up with some forms to fill in.
  - keystore type: PKCS12
  - keystore file: choose the backup made of the certificate
  - keystore password: the password you used
  - input file: the pdf file to sign
  - output file: I chose another name to avoid overwriting something useful
  - I clicked the checkbox on "visible signature"
  - Went to the "settings" box next to 'visible signature"
  - I chose the page where the signature needed to be placed, and I chose coordinates. I found the coordinates by running gv input.pdf since ghostview shows the coordinates in the upper left corner. For example, I put my signature on page 7 and put signature at coordinates
    - 147
    - 482
    - 369
    - 519
  - I chose Display: "Signature name and decsription"
  - Hit "close"
- Then "sign it" on the original jpdfsign window.

Monday, December 6, 2010

my webpages are mostly back

After being down for nearly a year my web content at bradley.csail.mit.edu is mostly back up at http://people.csail.mit.edu/bradley/.

There remain a few things missing...