Thursday, 23 July 2020

Game of Thrones

Beyond the pandemic this has been a period of national dialog about bias.  In that context I thought it might be appropriate to reflect on assessment and bias in computer science.  The topic is a large one, but blog postings need to be reasonably short, so for today, I want to highlight a narrower question.

Assessment is fundamental to a research career: work is constantly assessed in various ways.  We participate in the process, and are ourselves assessed, and careers thrive or fall on it.  As a PhD student I helped review some papers for a few conferences, but as a new faculty member I was surprised to receive a blizzard of requests: paper reviews, proposal reviews, tenure and other promotion reviews, award reviews, reviews of entire departments and research centers.  Over my career, I've led reviews of portions of the National Science Foundation and other equally large agencies.

The very idea of ranking quality of work, or individuals, or departments poses fundamental questions.  One can assess the expected mileage for a car without too much debate.  But the subjective nature of assessment in computer research is the puzzle: suppose that you have an idea that promises big payoffs but departs from standard ways of building cloud computing platforms.  I'm asked to assess you for a promotion, or your group for research funding, or your department as a whole.  Should I applaud your audacity and risk?  Or should I criticize you for going way out on a limb?  Will the cloud computing vendors adopt your methods, and if they do, will you get explicit recognition?  

How impactful is your work?  Universities rarely pursue patents, because a series of court cases led to a recognition that software patents only protect specific instances of programs: the mathematical content cannot be patented.  As a result, when we invent ideas in academic settings, and then publish papers on them and release them as open source software, we often abandon the possibility of seeking a patent.  Companies are free to read those papers and may be influenced (they might even use the software), yet they are under no obligation to cite that prior work in any public way.  In fact, because companies do protect their products with patents, and patent law treats prior-work citations according to a different model than we use in research, ideas can be highly impactful and yet the inventor may never get proper recognition. 

Rather than focusing on promotion and tenure assessments, or even program committees for conferences, let's drill down on the question in a less emotionally charged context: rankings of entire departments or universities.  

One can actually trace the concept of ranking departments to colleagues of mine here at Cornell: early in the development of computing as a field, people like Juris Harmannis and David Gries were heavily involved in the creation of organizations like the Association for Computing Machinery and the Computing Research Association (CRA).  The CRA was explicitly founded to aggregate data on topics like salaries and workloads, and over time, this expanded to include an early role in assessing quality and ranking departments.

This was "early days" for the field, and the real goal of these rankings was to promote the idea that national research funding should be spread over a broader and more diverse set of institutions.  The need for self-advocacy drove the process as a whole: unlike established disciplines such as physics or mechanical and aerospace, computer science had few voices in Washington back then, and was not winning much of the national research investment funding stream.  One can trace this to the evolution of computer science from mathematics on the one hand, and electrical engineering on the other.  Thus, we saw support for theoretical computer science that emerged as an outgrowth of the mathematics arms of organizations like NSF, AFOSR and ONR, all of which had long track records of supporting the mathematical and theoretical sciences, side by side with support for computer engineering from corporate partnerships and the military.  My own work is in computer systems: an engineering discipline.  For my area, the defining event occurred back in 1978 when DARPA emerged as a funding leader, soon followed by AFRL and ARO.  In distinction from the theory side of the field, all of these focused on concrete "deliverables:" useful artifacts, like the original BSD Unix extensions, or the Internet.

Even these early steps were consequential.  When I joined Cornell, the department was widely seen as a leader in theory but weak in systems: precisely the message conveyed by those national assessments I just listed.  As a young researcher, I applied for NSF funding from a program that nominally supported work of all kinds.  Foolish me: the NSF program officer called for a quiet chat to explain that although my proposal was ranked high, the NSF program only wanted to support research with a predominantly theoretical focus.  We negotiated a strange compromise: NSF would fund the theoretical aspects of my work, and didn't wish to be cited on papers that were entirely focused on software systems!  Obviously, this was a quirky situation centered in many ways on that one program officer, but I find it quite telling that as a new researcher, I was already being warned that my work would be judged in one of two very distinct ways.  Much later, I was asked to chair an external assessment of NSF itself, and was somehow not surprised to discover that this same bias towards mathematics was still very evident, although I didn't hear any stories of phone calls like the one I had received!

Meanwhile, though, Jim Gray did me a favor.  Shortly after that call, I was contacted by someone at DARPA who was creating a new program in software fault-tolerance, and had spoken to Jim.  They were excited about the idea of software tools for fault-tolerance and requested a short proposal focused on software deliverables.  The program director emphasized that DARPA didn't want to see incremental work: he was ready to spend a lot but wanted to "buy" a disruptive new idea, the kind of advance that could create entire new industries.   And he explicitly said that Jim had suggested a few people, but he was limiting himself to the very top departments.

When I took this job, it impressed me that Cornell was ranked among the top five departments (at that time my colleagues griped endlessly that we should be even higher ranked!), but I didn't really appreciate the degree to which this would be beneficial.  Systems research is expensive, and without that funding, it would have been very hard to succeed.  In retrospect, it is absolutely certain that Cornell's broader reputation was a key reason that people invested in my vision.

Assessments and rankings play a central role at every step of a research career.  The particular NSF program officer who I spoke to make it clear that he was only prepared to support my work because of Cornell's high repute in theory: he viewed it as a bet that "for a change" systems research might have a robust theoretical basis if I engaged in a meaningful way with my colleagues.  As it turned out, this worked well because it created an obligation: In effect, I agreed to set aside a part of my effort to really pursue the theory underpinning state machine replication and fault-tolerant consistency, and this in turn shaped the software systems I built.  

In effect, NSF made a bet on my group, and DARPA did too, based very much on reputation -- but of course at the start, I myself had no real reputation at all.  This was a bet on Jim Gray's reputation, and on Cornell's reputation.  And they did so with very distinct goals: for DARPA, this was about creating a substantial software group and impacting a real and important user community.  Meanwhile, NSF was keen to see papers in top conferences and journals, and wanted to see that those papers were being cited and triggering follow-on work by a broad crowd: the NSF officer wanted the distributed systems theory community to think more about relevant impact and to see examples of theoretically rigorous work that  had real-work value.  Thus both were betting on a form of disruption... but in different ways, and both were basing that bet heavily on Cornell's ranking: they both believed that Cornell offered a high enough podium to amplify the potential disruptive impact.  That it happened to be me who benefitted from this was somewhat incidental to both organizations, although it obviously worked in my favor!

As this played out, DARPA kept pushing to move faster.  They urged me to spin off companies (I ultimately did so a few times... one was quite successful, the others, less so).  There was even some discussion of them investing venture capital, although in the end holding stock in a startup was deemed to be outside of DARPA's authorized scope.   But even that one successful company opened doors: my group was invited to design the core consistency and resiliency mechanisms of the French air traffic control system, the New York Stock Exchange, the US Navy AEGIS, and many other real systems.  How many PhD students can say that as their PhD topic, they redesigned the floor trading system of the NYSE?  One of mine did!  And those successes in turn brought contacts with development teams at many companies facing a next set of challenges, which became follow-on research projects.  Along the way, I also was promoted to tenure, which wasn't such an easy thing for a systems researcher at Cornell back then (less of an issue today). So you can look at this and recognize that in a direct way, assessments of the department, and expectations that I really imposed on myself when defining "milestones" shaped and guided my career.

Thus, that original CRA ranking actually had a huge impact that really shaped my career.  The CRA ranking was just one of many.  US News and World Report quickly emerged through its special issues that rank colleges and universities, including research specializations such as computing.   Back in 2010 there was a National Research Counsel ranking (produced by a group within the National Academy of Sciences).  The NRC set out to put ranking on a firmer statistical footing, but then seems to have left those rankings untouched since releasing them a decade ago.   Of late, one sees a new commercial ranking site more and more often: the "QS ranking of top research universities".  

Should we trust a ranking such as the one by US News and World Report, or the QS one?  As it happens, US News and World Report is relatively transparent about its process, or at least was transparent about it when I inquired back in the late 1990's.  At that time, their ranking of computer science PhD research programs had emerged as the winner; people need to cite some sort of consensus ranking in various situations, and this was the ranking most of us would point to.  Relatively few, however, realized that the entire US News and World Report ranking centers on a single number provided by each of two people in each institution they track.  (QS seems to be similar, but doesn't disclose details of the way they arrive at their ranking, so we won't say more about that one).

The way it works is this: every three years (the next cycle should start soon, with the results published in spring 2021), the company sends a questionnaire that collected various data about research expenditures, program sizes, and a few other metrics, but also includes a little fill-in-the-bubble sheet, listing 500 programs that grant PhD or MS/MEng degrees.  500 is a large number, particularly because all of these programs are located in the United States (perhaps also Canada).  As a result, the survey also reaches out to a great many schools that don't do research and do not grant PhD degrees. The relevance of this will be clear in a moment.

All of this data is "used" but not all of it contributes to the ranking.  US News and World Report does summarize statistics, but the actual ranking depends only on this second bubble-sheet and the 1000 or so people who are asked to complete it.  The bubble-form is sent to two people at each graduate-degree granting school, one copy to the department chair or dean, and one to the director of the graduate program.  These two people rate each program by its research strength: from exceptional to poor, five options.  Presumably, many don't respond, but the publication takes the forms they do receive, computes average scores for each department, and voila!  An instant ranking.

There is an aspect of this approach that I have to admire for its frankness.  As you might expect, it is terribly hard to devise a formula that would properly weight research funding, papers, impact, program size, coverage of hot areas, etc.  CRA believed this could be done, even so, and the early CRA ranking centered on a somewhat secretive formula that was created by a few national research leaders, who in effect locked in their own intuition about the proper ranking.  Later, this mysterious formula came under scrutiny, and when the NRC decided to create a new and "better" ranking, they concluded that no single formula could properly address the diversity of students and student goals.  Accordingly, they collected all sorts of metrics, but their web site asked each visitor to provide their own weights for each element.  This makes a lot of sense: there is no obvious way to come up with standardized weights that would somehow reflect all of our different value models and objects.  Yet it leads to a very unstable ranking: the CRA ranking was quite stable for decades (a topic that led to criticism by schools not in the top few!), but when using the NRC version, even tiny changes in weights led to very different rankings. Moreover, it is quite easy to end up with a ranking that produces nonsensical results!  For example, it seems natural that funding levels should be a factor in rankings, yet this ignores the huge research centers that some schools have attracted in specialty areas -- centers that they are permitted to include into their funding summaries.  Thus any weight at all on funding will hugely promote the standings of those schools.  Yet this overlooks the actual substantive quality and nature of those centers, some of which do classified work or are very narrow in other senses.  If a center doesn't contributed to the academic profile of a department, reflecting it into a funding metric is a questionable choice.  And again, this is just one of many examples I could give.  The trouble with metrics is that they are often biased in ways that one might not have anticipated.

Yet even this statement expresses a form of biased expectations: like any researcher, I have a pretty good idea of the top programs and groups in my own research area, and hence a strong sense of what the ranking should look like both in systems and more broadly, for computer science departments as a whole.  For example, Cornell is quite low in the most recent (2018) ranking of systems research departments in US News and World Report.  This really bugs me because we have a superb group, and I know perfectly well that we should be one of the top five or six in any sane ranking.  Of course, I can remind myself that by reducing the ranking to a 1-5 bubble score vote that polled 1000 or so computer science academics is terribly simplistic.  For me, the ranking that matters would be one you could get by polling my peers, who are concentrated at a much smaller set of schools: perhaps 25 here in the United States, and another 25 globally.  US News and World Report polled department chair and directors, and included 475 schools that don't conduct research in systems, at all.  Yet it still rankles!

On the other hand, there is also an argument to be made that US News and World Report has hit on a brilliant and "robust" approach, at least for overall rankings of departments as a whole.  Consider this: by asking the chair of a computer science department and the director of a graduate program to distill their impressions into a single number, the US News and World Report treats these individuals as a kind of biological neural network, which their system "samples."  Granted, the basis of their individual responses may be hard to quantify, yet if one accepts that as professionals in the field those impressions have meaning, then one has to accept that the resulting ranking is based on... something.  

But what, in fact, is this basis?  The survey doesn't include any guidance.  Respondents can, of course, visit department web sites -- but how many do so?  My guess is that most just zip down the page filling in blobs based on vague impressions, skipping the schools they have never heard of.

For example, what is the first thing that comes to mind when you think about University of Illinois (UIUC)?  Presumably, you'll note that the department is renowned for its work on high performance computing hardware, but of course that really isn't a computer science topic.  Then you might point to the Mosaic web browser: back when CERN came up with the idea of turning the Internet into a world wide web, UIUC created the first tools to make this a reality.  Ask 1000 professional academics about the history of the web, and I would bet that quite a few of them do know this aspect of history -- enough to bias the ranking substantially. It isn't an accident that UIUC soared in the subsequent US News and World report ranking, and stayed there.  

Game of Thrones focused on individuals, grudges, and nasty back-stabbing behavior.  Yet I've written almost entirely about assessment through a crowd-sourcing approach that really eliminates most opportunities for individual bias to torque the outcome.  Ironically I was very critical of the US News and World Report rankings for years: to me, they rank but do not "assess" in any meaningful sense.  But what I see today is that any other approach to ranking would almost certainly be worse, and much more at risk of manipulation.  As for assessment... I honestly have no idea whether what we do is valid, even after participating in the system for nearly 40 years.

This thought invites the next: to ask whether computer systems has endured its own Game of Thrones played out through paper reviews and promotion letters.  The area certainly has had its share of colorful disputes, and anger doesn't lend itself to fairness.  But what has helped, a lot, is that that more than two decades ago the systems area leadership began to take concrete actions aimed at supporting existing researchers while also educating newcomers.  We have a workshop specifically focused on diversity, there are more and more anti-bias mechanisms in place, and we talk openly about the harm bias can cause and the importance of active measures, not just lip-service.  To me it is obvious that this has helped.  The field is far more diverse today, more welcoming, and even paper reviews have evolved to focus more on positivism and less on destructive negativism.  We have a long path to follow, but we're moving in a good direction.

Let me end by sharing a story about Jim Gray, who passed away in 2012.  Jim, used to go out of his way to engage with every new systems person he could find.  I mentioned one example of how he helped me above, but in fact this was just one of many times he helped in my career.  In fact I thought of him as an advisor and mentor from the very first time I met him, not long after I started my graduate studies at Berkeley: he urged me to work on fault-tolerance and consistency questions "outside of database settings", and that advice ultimately shaped everything!  The only thing he ever asked in return was that those of us he helped "play it forward," by helping others when the opportunity arose.  

How did Jim act on this philosophy?  He was constantly reaching out, offered to read papers, and without fail would find something to get excited about, something to encourage.  He wrote wonderful tenure and promotion letters, taking the time to think hard about each element of a candidate's work, listening hard to their personal research and teaching vision statements, and invariably looking for the positive story.  He was behind that first phone call I got from DARPA, and later, when I launched companies, he helped introduce me to potential customers who might have been very, very, nervous without his endorsement.  For Jim, strengths always outweighed weaknesses,  risk was a virtue, and even a project that ultimately failed often yielded insights we could build upon.  Where others saw high-risk work that rejected things they had worked on, Jim loved debating and took actual pleasure in finding weaknesses in his own past approaches.  He invariably viewed edgy work as having promise.  He could be blunt when something didn't work, but he didn't view failure as an individual weakness.  Instead he would ask why this happened, what assumption had we made that turned out to be wrong, and what intuition we needed to learn to question.  There was always a next step, built on insights from the first step.  

This was a style of assessment that we might all consider emulating.