jon pipitone

Notes from my graduate studies at the University of Toronto in the Department of Computer Science.

The Munk Debates: Climate Change

Thursday, November 12, 2009

Be it resolved that climate change is mankind's defining crisis, and demands a commensurate response.
On December 1st, the Munk Debates feature a debate on Climate Change: George Monbiot and Elizabeth May take on Bjørn Lomborg and Lord Nigel Lawson. Tickets to attend the event live are sold out, but you can sign up for seats at an overflow showing or sign up to watch the debate as a webcast.

Scientific software quality: what would it take to convince software engineers?

Monday, November 9, 2009

As I mentioned, last week at the CSER conference I presented a poster summarising my study. I got some great feedback about the coherence and integrity of my study, as well a great question: what would it take to convince the software engineering community of anything about the quality of scientific software?

I had about eight people stop by and actually engage in conversation with me. For most people (mostly other students), this was their first time thinking about scientific software as a type of software worth studying on its own. I had to walk slowly through the first slide to explain how computational scientists use software models to do science. I think this motivated the research questions well, since after the first slide most people understood that climate models are being built and used in an entirely different context from, say, accounting software: there isn't the same certainty about what correct output should look like.

I then explained my approach and findings. Almost everyone had one, or both, of the following reactions:
  • I'm not convinced.
Yup, just the one. I knew this was their reaction because either people would say so, or they would stare at the poster uneasily until I asked them questions about what they felt about the study. Some felt that the whilst the different parts of the study were reasonable in their own right the study didn't "gel" in a way that either built up (or eroded) their confidence in the quality of climate modeling software. They felt just as unsure about it as before I had explained my study and results.

Here's one example. With one fellow, a professor of software engineering (let's call him Mr. B), I pointed to the defect density chart. This one here:



And I asked whether it gave him any insight or feel for the software quality of the models. He said that yes, this chart led him to believe that the Model B and C were of good quality, and model A was not of great quality. When I asked whether it was the absolute defect rates or the comparison between the different models that led him to believe this, Mr. B said it was the relative rates. But then he also agreed that all of the models had defect densities that were "good". Hmmm... wait, what?

After going through the rest of the study, I asked him what he thought of the software quality of climate models and he said that they were "not very good". He maintained this even when I pointed to the defect density chart earlier on. I asked why he felt that, and he said that because there are so many ways for the climate scientists to interpret and represent their results that they may, even unknowingly, present their models as being more accurate than they really are. He felt that the models could be full of bugs and the scientists may never notice.

To be clear, I'm not convinced either way about the software quality of climate models. I'm trying to leave my opinions for later, once I've collected more interview data at least (and I've expressed before how poor I think measures like defect density are at gauging quality). But, one thing I've come to understand about the climate modelers is that they care very deeply about the correctness of their code -- they most certainly do not take it for granted. I pointed this out, to Mr. B, and I explained some of the ways I've learnt that the climate modelers verify and validate their models.

Mr. B was still unconvinced. And so, as I say, neither were many others.

I understand this. I'm not sure that what I've done convinces me that climate modeling software is "good" or "bad" either. But maybe good/bad and high/low isn't a meaningful way to thinking about quality. Still, I have this urge to distill what I've learnt about the scientific software quality into a quantity. I dearly want some objective measuring stick or benchmark to be able to judge/compare/assess software quality. I think this is what the folks at the poster session wanted too, and didn't get from my poster.

Near the end of the conversation, I asked everyone this final question: what would convince you, as a software engineer, that a climate model is of good software quality or not? I asked this question at the CASCON workshop as well. No one had an answer. In fact, most people just dismissed the question with a laugh. Is it that silly of a question? I think it's a great one (but sure, maybe a touch rhetorical).

I've asked a few climate scientists the same question in earnest: what convinces you that climate model software is of good quality or not? The answers have been quite varied. Knowing the history of the model, or the development team, the state of the documentation, whether they've seen the model code or not, and generally how open the development is, are some of the things that factor into their assessment. Defect densities do not.

What I appreciate about the climate scientists' answers is that they focus more on the internals of the development process than they do on metrics. My interpretation of this is that knowing that the modeling group is following the right processes for building the model, and being able to verify that yourself, is a better indicator of the software quality than any defect metric. (I'm of the mind that this is true in general of software quality assessment... but this is quite a new thought for me and I haven't spent much time thinking through the implications.)

But anyhow, what this leads me to is the idea that maybe a more satisfying way of assessing software quality is some sort of "maturity" assessment, like what is done as part of the CMMI process. "Best practices" for developing and verifying climate models could be established by the community to which each modeling centre could be assessed against. From my interviews so far I can already suggest some of the factors that would go into such standard (check the last two slides).

I'd guess that this sort of assessment scheme would be acceptable to climate scientists, but would it convince the software engineering community? If not, what would need to be included in the assessment? Or, is it just a matter of educating the software engineering community as to the nature of scientific software?

Addendum: As Jorge suggested, maybe I received such suspicion at the poster session simply because of the venue. Poster sessions rank at the bottom of the academic credibility scale, so maybe everyone at the poster session is bound to be suspicious.

Vegetables

Saturday, November 7, 2009


Yep, this post is not about research. It's about vegetables. Specifically, vegetables I just received from two of my friends, Tarrah and Nathan. Just two years ago they bought a farm up near Neustadt, Ontario. They are making a go of starting up an organic, mixed vegetable and livestock farm: Green Being Farm (website under construction).

If you know me, you know that I often affectionately talk about working there last summer. And if you know me, you may have also had the opportunity to eat some of the produce grown at their farm last year (those of you in the lab may remember the potato fairy delivering piles of potatoes to your desk last summer), or you may have been one of the lucky folks to get in on their amazing pastured pork, chicken or turkey.

To eat good, local organic food is a real treat. To eat good, local, organic food that friends of mine grew is amazing. I'm so proud of them and what they are doing, and I'm so proud to be eating their produce. I regularly have doubts about the benefit of my research to the world, but I never doubt the benefit of what people like Tarrah and Nathan are up to.

Modeling the solutions to climate change, part I complete.

Friday, November 6, 2009

A couple of weeks ago I mentioned a side project a few of us in the software engineering group are undertaking. In short, the purpose is to explore the idea of modeling (graphically, not in code) the various solutions to climate change proposed in a couple of recently published books. The objective is not only to model the solutions, but to model them in such a ways to make it easy to explore the differences and similarities between the different solutions. We have restricted ourselves to just looking at wind power, as a way to make this experiment do-able in a shortish period of time.

Our plan has been to look at each book in turn and completing several different types of model for each book: an ER model, a goal model, and maybe a systems dynamics model. Once we've completed those then we'll explore how to relate them to one another.

We began by looking at a David MacKay's book. Last time I posted the ER diagrams and the start to the goal models. Well, we're done the goal models now, so here they are.

This first goal model covers the first chapter of the book. This has very little to do with wind power, but gives us the the actors and some context from which to build out the wind power model.




And here's the wind power model:



Even if you don't understand i* syntax you should be able to follow these diagrams with some success. Again, I should note that these models represent the view of MacKay himself, as interpreted by our motley group.

Having completed these models marks the end of the first stage of this experiment. The next stage is to do make the same types of models for another book in the series.

I have a few observations to make on the modeling process and this project. This is the first time I've ever been involved in formally modeling anything. The process was, on the whole, frustrating and unsatisfying. I would leave each session feeling like whilst we had put up more boxes and arrows on a canvas we were somehow missing the essence of the text. When we made the entity-relationship diagrams we often seemed to be forcing syntax into the labels on the relationships. The goal models seemed equally inadequate albeit in a different way: there were concepts we all wanted to represent (e.g. facts like "typical wind speed is 6 m/s") but there was no way to say this is in i*. Jen, our modeling Guru who ran most of our sessions, would often tell us that what we were trying to say wasn't easily expressed in a goal model or ER diagram and that this frustration was something she was familiar with from her own work. It felt like putting on a one-armed sweater and fishing around with one hand for the other, non-existent, arm hole. This concern of mine is really just me saying I don't think that the modeling languages we are using are rich enough to really capture what is important. Or also that they don't seem to capture it in "the right way"; that they don't "carve nature at its joints" (plato).

I'm open to this one possibility though: maybe this dissatisfaction points to what is actually a helpful and normal aspect of using several different modeling approaches. Each approach on its own may not be able to represent the full meaning of the text, but taken together they may. And also, by constraining the modeling to specific concerns (entities in the ER diagrams, and goals and actors in the i* models) we are actually providing a more clear picture of the text, and in the future, a more clear picture of the differences between different texts. For instance, it may be easier (at the very least in terms of visual clutter) to show the differences between the basic concepts used in the texts by somehow showing differences in the ER diagrams, without having to, at the same time, navigate actors and goals and process concepts.

This is all just speculation, of course. I'm just trying to say, that my uneasiness may just be because I'm unfamiliar with thinking about something in parts like this.

There's another thing I'm unsatisfied with. If our goal is to show the differences between several types of texts, I fear we may be going about doing this in an unhelpful way. I have nothing to base this on other than just a sinking feeling I get. To me, the real work in this challenge is about how to represent and navigate differences is large and complex structures. In some sense, doing the modeling is easy, or at least known. Trying to explain how one model is different from another in a way that is actually helpful and useful... I have no idea. What we don't have is what we want this model "diff" to look like.

One thought I had: we start by writing up our own summary of the differences between two of the texts. That is, we read two of the books on the list and then collectively write up an essay that compares and contrasts the two authors' solutions in as much detail as we think is relevant. Then we model that document. Doing this will give us the goal posts and probably teach us heaps about what and how to make a visual comparison useful.

Validity and soundness in scientific software

Wednesday, November 4, 2009

In today's workshop on Software Engineering for Science we spent quite a bit of time discussing the different levels of correctness of scientific software. I was surprised since I had thought some of this was pretty basic stuff. After a bit of reflection I wonder if it isn't because we don't have common terms for these ideas.

To be clear, I'm referring to verification and validation. These activities are summed up by the questions, "Are we building the right thing?" (validation) and "Are we building the thing right?" (verification). Another way of looking at this is that verification is the act of checking that software meets its specifications, whereas validation is checking that software meets its requirements.

This comes up when you talk about scientific software since in many cases the software is supposed to enact a theory or mathematical model. Validation checks that the mathematical model is accurate where as verification checks that the software implements the mathematical model accurately.

Clearly we have words for "verification" and "validation", though I don't remember these words being used much today, or at all. The fact that they aren't commonly used and that we needed to discuss the distinction between these activities is curious to me.

But more so, whilst we have the words to discuss the activities we don't seem to have adjectives to refer to the software itself. (Do we? Tell me if we do.) I suppose we could use the terms "verified software" and "validated software". "Verified" is overloaded though. I immediately want to ask "by whom?", as if the term refers to software inspected and given a stamp of approval by an outside agency. "Validated software" seems okay though.

Borrowing from formal logic, could we refer to the "soundness" and "validity" of software?

Privilege

This deserves a much more in-depth discussion which I'm not going to go into here. But I wanted to just take a moment to publicly recognise how privileged I feel, and am, in school. Of course it's not just in being at school that I'm privileged.. it's the country I live in, the socio-economic class I am part of, the people I know, my ethnicity, and so on. And school is a whole other level of privilege.

Today leaving the CASCON conference with two of my colleagues I thought again about how damned lucky I am to be a student here. This is truly a luxurious life. I spent today sitting around a table in a warm room talking with other students and professors about whatever the hell interested us at the moment. We talked while we ate our free lunch. (I repeat, we had a free lunch!) After that we went into another room and talked some more. Again, we talked about whatever interested us. At some point we paused to have tea and stretch. Then we returned to talking until we had had enough. A few of us went home together and spent the entire trip discussing ideas for tomorrow. It was a day of ideas.

And that was a day of work. Ah-mazing. When I'm not at a conference I get to spend an entire day at a sunny desk, spending my day as I please, reading, talking to people, making notes to myself, and generally working on projects as I please.

I feel so so lucky and grateful to be here. It's a fullness of feeling which I'm not sure I can explain all that well. The flip side is that I also feel upset at myself for the times when I take this life for granted. I find it easy to do. Take it for granted, I mean. There are times when, to the exclusion of other feelings, I feel worried about my future, or about a deadline, or how my research project might turn out, etc... But, peanuts! I am a king!

I'm not sure why, but I feel compelled to acknowledge and mention this right now. Maybe just as a reminder for myself. But I'd appreciate hearing any thoughts you have on this topic; so use the comments.

CSER poster session

Monday, November 2, 2009

This week I attended the poster session at the CSER gathering. This was a great thing to do for a few reasons. Just creating the poster helped me pull together some of my thoughts and results so far. In the same vein, just having to pitch my study and explain what I've been up to helped to clarify my thoughts or bring up new questions. Then, of course, there's the feedback and criticism I get from the attendees, and the new questions they raise (intentionally or otherwise). It's also just fun and validating to have people listen to what I've been up to and engage in a discussion about it... makes me feel like I'm doing something worth talking about.

My poster was in the form of nine "slides". Here they are, with a bit of explanation about each of the slides.





My study is, as you know, still underway. What I'm presenting here are the method and some preliminary results. I wanted to present at CSER because I wanted to hear what other people would say about some of my findings so far, and whether anyone would have suggestions of where to go next.



As a motivation for my study, consider how the computational scientist qua climatologist goes about trying to learn about the climate. In order to test their theories of the climate, they would like to run experiments. Since they cannot run experiments on the climate they instead build a computer simulation of the climate (a climate model) according to their theories and then run their experiments on the model.

At every step there are approximations and error introduced. Moreover, the experiments that they run cannot all be replicated in the real world, so there is no "oracle" they can use to check their results against. (I've talked about this before.) All of this might lead you to ask ...Why do climate modelers trust their models? Or..


... for us as software researchers, we might ask: why do they trust their software? That is, irrespective of the validity of their theories, why do they trust their implementation of those theories in software?

The second question should actually read "What does software quality mean to climate modelers*?"

As I see it, you can try to answer the trust question by looking at the code or development practices, deciding if they are satisfactory and, if they are, concluding that the scientists trust their software because they are building it well and it it is, in some objective sense, of high quality.

Or you can answer this questions by asking the scientists themselves why they trust their software -- what plays into their judgment of good quality software. In this case the emphasis in the question is slightly different, "Why do climate modelers trust their software?"

The second, and to some extent third, research questions are aimed here.

* Note how I alternate between using "climate scientist" and "climate modeler" to reference the same group of people.


My approach to answering these questions is to do a defect density analysis (I'm not sure why I called it "repository analysis" on my slides. Ignore that) of several climate models. Defect density is an intuitive and standard software engineering measure of software quality.

The standard way to computer defect density is to count the number of reported defects for a release per thousand lines of code in that release. There are lots of problems with this measure, but one is that it is subject to how good the developers are at finding and reporting bugs. A more objective measure of quality may be their static fault density. So I did this type of analysis as well.

Finally, I interviewed modelers to gather their stories of finding and fixing bugs as a way to understand their view and decision-making around software quality.

There are five different modeling centres participating in various aspects of this study.




A very general definition of a defect is: anything worth fixing. Deciding what is worth fixing is left up to the people working with the model, so we can be sure we are only counting relevant defects.

Many of the modeling centres I've been in contact with use some sort of bug tracking system. That makes counting defects easy enough (the assumption being that if there is a ticket written up about an issue, and the ticket is resolved, then it worth fixing and we'll call it a defect).

Another way to identify defects is to look through the check-ins of the version control repository and decide if the check-in was a fix for a defect simply by looking at the comment itself. Sure, it's not perfect, but it might be a more reliable measure across modeling centres.


Presented here is the defect density for an arbitrary version of the model from each of the modeling centres. For persective, along the x-axis of the chart I've labeled two ranges "good" and "average" according to Norman Fenton's online book on software metrics. I've included a third bar, the middle one, that shows the defect density when you consider only those check-in comments which can be associated with tickets (i.e. there is a reference in the comment to a ticket marked as a defect).

The top, "all defects", bar is the count of check in comments that look like defect fixes. I have included in the count all of the comments made 6 months before and after the release date. You can see that bar is divided into two parts. The left represents the pre-release defects, and the right represents the post-release defects.

As yet, the main observation I have is that all of the models have a "low" defect density however you count defects (tickets, or check-in comments).

It's also apparent that the modeling centres use their ticketing systems to varying degrees, as well as they have different habits about referencing tickets in their check-in comments.





I ran the FLINT tool over a single configuration of, currently only two, climate models. The major faults I've found are about implicit type conversion and declaration. As well, there are a significant (but small) portion of faults that suggest dead code. Of course, because I'm analysing only a single configuration of the model, I can't be sure that this code is really dead. I've inspected the code where some of these faults occur and I've found instances of both dead code and of code that isn't really dead in other configurations.

One example of dead code I found came from a module that had a collection of functions to perform analysis on different array types. The analysis was simliar for each function, with a few changes to the function to handle the particularity of the array. The dead code found in this module was variables that were declared and set but never referenced. My guess from looking at the regularities in the code is that because the functions were so similar, the developers just wrote one function and then copied it several times and tweaked it for each array type. In the process they forgot to remove code that didn't apply.


Unfortunately, I have as yet only been able to interview a couple of modellers specifically about defects they have found and fixed. I have done a dozen or so interviews with modelers and other computational scientist to talk about how they do their development and software quality in general. So this part of the study is still a little lightweight, and very preliminary.

In any case, when I've done the interviews I ask the modelers to go through a couple of bugs that they've found and fixed. I roughly asked them these questions.

Everyone I've talked to is quite aware that their models have bugs. This, they accept as a fact of life. Partly this is a comment on the nature of a theory being an approximation, but they also include software bugs here too. Interestingly, they still believe that, depending on the bug, they can extract useful science from the model. One interviewee described how in the past, when computer time was more costly, if scientists found bugs part way through a 6 month model run they might let the run continue, publish the results but include a note about the bug they found and analysis about its effect.



The other observation I have is connected the last statement on the previous slide, as well as this slide.

Once the code has reached a certain level of stability, but before the code is frozen for a release of the model, scientists in the group will being to run in-depth analysis on it. Both bug fixes and feature additions are code changes that have the potential to change the behaviour of the model, and so invalidate the analysis that has already been done on the model. This is why I say that some bugs can be treated as "features" of a sort: just an idiosyncracy of the model. Similarily, a new feature might be rejected as a "bug" if it's introduced too late in the game.

In general, the criticality of a defect is in part judged on when it is found (like any other software project I suppose). I've identified several factors that I've heard the modellers talk about when they consider how important a defect is. I've roughly categorised these factors into three groups: concerns that depend on the project timeline (momentum), concerns arising from high-level design and funding goals (design/funding), and the more immediate day-to-day concerns of running the model (operational). Very generally, these concerns have more weight at different stages in the development cycle which I tried to represent on the chart.

Describing these concerns in detail probably involves a separate blog post.

Blog Archive

About Me