Notes from my graduate studies at the University of Toronto in the Department of Computer Science.

Scientific software quality: what would it take to convince software engineers?

Monday, 9 November, 2009

As I mentioned, last week at the CSER conference I presented a poster summarising my study. I got some great feedback about the coherence and integrity of my study, as well a great question: what would it take to convince the software engineering community of anything about the quality of scientific software?

I had about eight people stop by and actually engage in conversation with me. For most people (mostly other students), this was their first time thinking about scientific software as a type of software worth studying on its own. I had to walk slowly through the first slide to explain how computational scientists use software models to do science. I think this motivated the research questions well, since after the first slide most people understood that climate models are being built and used in an entirely different context from, say, accounting software: there isn't the same certainty about what correct output should look like.

I then explained my approach and findings. Almost everyone had one, or both, of the following reactions:
  • I'm not convinced.
Yup, just the one. I knew this was their reaction because either people would say so, or they would stare at the poster uneasily until I asked them questions about what they felt about the study. Some felt that the whilst the different parts of the study were reasonable in their own right the study didn't "gel" in a way that either built up (or eroded) their confidence in the quality of climate modeling software. They felt just as unsure about it as before I had explained my study and results.

Here's one example. With one fellow, a professor of software engineering (let's call him Mr. B), I pointed to the defect density chart. This one here:



And I asked whether it gave him any insight or feel for the software quality of the models. He said that yes, this chart led him to believe that the Model B and C were of good quality, and model A was not of great quality. When I asked whether it was the absolute defect rates or the comparison between the different models that led him to believe this, Mr. B said it was the relative rates. But then he also agreed that all of the models had defect densities that were "good". Hmmm... wait, what?

After going through the rest of the study, I asked him what he thought of the software quality of climate models and he said that they were "not very good". He maintained this even when I pointed to the defect density chart earlier on. I asked why he felt that, and he said that because there are so many ways for the climate scientists to interpret and represent their results that they may, even unknowingly, present their models as being more accurate than they really are. He felt that the models could be full of bugs and the scientists may never notice.

To be clear, I'm not convinced either way about the software quality of climate models. I'm trying to leave my opinions for later, once I've collected more interview data at least (and I've expressed before how poor I think measures like defect density are at gauging quality). But, one thing I've come to understand about the climate modelers is that they care very deeply about the correctness of their code -- they most certainly do not take it for granted. I pointed this out, to Mr. B, and I explained some of the ways I've learnt that the climate modelers verify and validate their models.

Mr. B was still unconvinced. And so, as I say, neither were many others.

I understand this. I'm not sure that what I've done convinces me that climate modeling software is "good" or "bad" either. But maybe good/bad and high/low isn't a meaningful way to thinking about quality. Still, I have this urge to distill what I've learnt about the scientific software quality into a quantity. I dearly want some objective measuring stick or benchmark to be able to judge/compare/assess software quality. I think this is what the folks at the poster session wanted too, and didn't get from my poster.

Near the end of the conversation, I asked everyone this final question: what would convince you, as a software engineer, that a climate model is of good software quality or not? I asked this question at the CASCON workshop as well. No one had an answer. In fact, most people just dismissed the question with a laugh. Is it that silly of a question? I think it's a great one (but sure, maybe a touch rhetorical).

I've asked a few climate scientists the same question in earnest: what convinces you that climate model software is of good quality or not? The answers have been quite varied. Knowing the history of the model, or the development team, the state of the documentation, whether they've seen the model code or not, and generally how open the development is, are some of the things that factor into their assessment. Defect densities do not.

What I appreciate about the climate scientists' answers is that they focus more on the internals of the development process than they do on metrics. My interpretation of this is that knowing that the modeling group is following the right processes for building the model, and being able to verify that yourself, is a better indicator of the software quality than any defect metric. (I'm of the mind that this is true in general of software quality assessment... but this is quite a new thought for me and I haven't spent much time thinking through the implications.)

But anyhow, what this leads me to is the idea that maybe a more satisfying way of assessing software quality is some sort of "maturity" assessment, like what is done as part of the CMMI process. "Best practices" for developing and verifying climate models could be established by the community to which each modeling centre could be assessed against. From my interviews so far I can already suggest some of the factors that would go into such standard (check the last two slides).

I'd guess that this sort of assessment scheme would be acceptable to climate scientists, but would it convince the software engineering community? If not, what would need to be included in the assessment? Or, is it just a matter of educating the software engineering community as to the nature of scientific software?

Addendum: As Jorge suggested, maybe I received such suspicion at the poster session simply because of the venue. Poster sessions rank at the bottom of the academic credibility scale, so maybe everyone at the poster session is bound to be suspicious.

22 comments:

gmcrews said...

Hi Jon,

What would it take to convince software engineers of the quality of scientific software? IMHO, absolutely convincing would be a successful independent audit of the SQA (software quality assurance) "objective evidence" and other documentation for the scientific software.

What is the state-of-the-art of SQA? It is pretty subjective and as far as I know something like the following.

The first step in determining an appropriate SQA level-of-effort for a scientific software development effort would be to categorize the potential consequences of software defects. The more dire the software defect consequences, the more effort that could be justified in avoiding, correcting, mitigating, or compensating for defects. Or the less the consequences, the less the effort needed.

Once an appropriate SQA category has been determined, the criteria against which these quality assurance activities are to be judged/audited must be established and graded. (Grading refers to the rigor and emphasis given to the criteria.) The toughest scientific/engineering software quality assurance criteria I know of are those for nuclear facilities. See 10 CFR Part 830, Subpart A. These criteria could be appropriately tailored for most any type of scientific software.

The software quality assurance plan that embodies these tailored/graded criteria and actually defines the level-of-effort would then be written and executed.

If a scientific software program went through all this effort, I don't see how its quality could seriously be doubted.

Ernie said...

I think you are asking a question that makes people uncomfortable. How do we know *any* software is high quality? It's easy to devise metrics and assessments, but do any of them establish quality?

What about comparing your domain to other domains, for example, space software? How does NASA establish quality? What about consumer software? How does MSFT establish quality? What does quality even mean to them?

Presumably in scientific software, the software is the theory, so quality is measured in the way a theory is measured. One approach might be falsifiability - is there something the model does that can be shown wrong?

As an aside, are you sure that a 'by-the-numbers' survey is the best way to do this? Would a case study tells us more about the basic constructs in the domain you are studying? (subject to M.Sc. time constraints!)

gmcrews said...

@Ernie

I think it is the subjectivity of quality that is making people uncomfortable. IMHO, the climate computer models can contain significant bugs and shortcomings and the scientists (and the public) can still remain very confident that the models are useful. Whereas for the command and control software for a NASA space probe or for a nuclear power reactor, even the remote possibility of a bug would make the engineers (and the public) very nervous.

The nature, rigor, and emphasis given to the SQA criteria depend on the results of a risk assessment of the consequences of using the software. In the real world, all this is subjective and usually informal.

On your comment that "quality is measured in the way a theory is measured." I agree, but would make the statement more complete. Not only must the theory be correct, but it must be embodied in code correctly. The theory can be beautiful, but the software full of bugs. (This gets back to Jon's discussion of validation and verification.)

For example, for the climate models, it could very well be that "the science is settled." However, that does not automatically mean that the software engineering activities that were performed to develop, modify, maintain, and use (interpret) the climate computer models are of high quality. The climate computer models are engineering products, not science products. Settled theory is not the goal, at least not the goal the public is concerned with. We want climate model predictions we can be confident in. We must be assured of the quality of the *code* before the science can become useful.

Robin Norwood said...

Have three different teams do closed-door development based upon the same specification. If two teams produce software models that agree closely, they probably did it right. :-)

regebro said...

Easy. Run the climate model. If it "predicts" known historical data reasonably accurately, it's a good candidate. If it can predict next years weather reasonably accurately too, I'm convinced it's reasonably accurate.

What I lack is websites that show climate model predictions and how it turned out. I have no clue how accurate current climate models are, because this data simply does not seem to be available even with quite a lot of reading and searching on the web.

Michael Tobis said...

regebro asks what looks like a reasonable question, but it's based on a fundamental misunderstanding of a question that is roughly equivalent to what the difference between weather and climate is. We have very little skill predicting one year out, even assuming no volcanoes and such. Most of the predictability of the detailed state atmosphere vanishes in three weeks or so.

But at a multidecadal time scale the problem changes character. Technically, we are no longer dealing with an initial value problem but with a boundary value problem, even though the underlying dynamics are the same.

We are not in that case looking for details in any specific year, but for the statistics over an extended period. In a mathematical sense that is an "easier" problem; it is more constrained by energy balances than by nonlinear fluid dynamics. The messy stuff basically averages out and the residual is what we try to predict.

In fact, maybe I'll use that as a definition of climate. It's "what you can say about the system after the messy unpredictable part gets averaged out".

That's the basis for climate change modeling.

As for Jon's question, the models are wretched pieces of engineering. No commercial shop would release anything nearly as balky, hard to deploy, or prone to failure. It makes open source look good. And that has almost no bearing on whether they are suitable for the purpose.

In fact, sometimes models are used well and sometimes they are used badly. This in turn is a scientific, not a software question.

All this said, I desperately wish the software were better, and I think we could address many more scientifically meaningful problems much more effectively if it were.

Finally, if you think the question is "global warming, yes or no" the large models in question are much less relevant than many people would have you believe. The answer to that question is yes, to the extent of about 3 degrees per CO2 doubling.

The idea that such a conclusion comes from complex models is wrong.

manuelg said...

Mr. Tobis's comment cannot be improved upon, by myself.

I will forgive Mr. Tobis for the dig against open source software ;) I maintain the difference in quality between the very best and the very worst engineered open source software projects makes it very difficult to say anything sensible about the totality.

Also, metrics in software development cannot predict the quality of output of any particular group working on a particular problem - too many confounding issues. For example, the developer with the highest bug count tends to be the best developer on the team - nobody else is trusted to tackle the hardest coding issues.

Another issue is the issue of making scientific computations reproducible. Even more basic than if a particular computation is correct is making sure computation can be reproduced by another group.

Making scientific computations reproducible
Computing in Science and Engineering archive
Volume 2 , Issue 6 (November 2000)
Pages: 61 - 67
ISSN:1521-9615
Authors
Matthias Schwab
Martin Karrenbach
Jon Claerbout
http://portal.acm.org/citation.cfm?id=369555

gmcrews said...

@Michael Tobis

I found your initial-value/boundary-value comment very interesting Dr. Tobis and would like the chance to ask you further questions about it. But decorum requires I try and stay on topic.

And Jon's topic deals with what it would take to convince software engineers of the quality of scientific software, particularly the climate models. I see from your own blog's author profile that you perform software engineering on climate models, so your opinion must be relevant. In your opinion, what would it take to make these "wretched pieces of engineering" better so they can be used to "address many more scientifically meaningful problems much more effectively?" (Not to mention the other problems of climate change.) Any low hanging fruit? An idea for a possible new approach to SQA for scientific software? Thanks in advance.

@regebro

Are you familiar with Climateprediction.net? From their website: "Climateprediction.net is a distributed computing project to produce predictions of the Earth's climate up to 2080 and to test the accuracy of climate models."

Robin Norwood said...

@regebro

The problem with your approach is that it tests both the model and the code. If the predictions turn out to be inaccurate, you have no way of knowing if the model is flawed or the code. Correct code could produce inaccurate predictions if they implement a flawed model. It's even conceivable (though very unlikely) that flawed code (ie, code that does not do what the model says) could produce accurate predictions!

Consider NASA, another case where software quality is of the utmost importance.

NASA ensures quality by following extremely conservative software engineering practices (to match their conservative engineering practices). This makes sense when flawed software means the loss of millions of dollars of equipment, many man-years of effort, and possibly even death. Even with all of their effort, they sometimes produce errors, such as with the Mars Climate Orbiter. The 'good' news there is that if there is a critical failure in the software, the results are obvious. If there's a non-critical failure, either the result is so small that it doesn't effect the mission objectives, or it is noticed and worked around.

With something like climate modeling, though, the failure may not be obvious. There is no way to check the results if the software is the only thing capable of producing them. Producing non-trivial amounts of code that is provably correct is (so far) an insurmountable task.

Another problem with the NASA model for something like climate modeling is that the added overhead makes development very slow - I imagine that a quick turn-around is pretty important when developing modeling software.

So, I really wasn't kidding in my previous comment. Have more than one team produce software to the same specification. They are unlikely to introduce the same bugs, so if the results agree, I'd be fairly confident that they are correct.

jon said...

Well now, this is some great discussion. Thank you everyone. I will reply in parts.

gmcrews: You suggest an independent audit of the software quality assurance (the practices used to ensure software quality).

I think this is where I was heading with my comment about some sort of maturity benchmark. If we (as software engineers) and the climate modelers can come up with a set of development guidelines and, as you say, criteria for quality, then we could grade or benchmark the models against this. As I say, I like the idea of looking at the modelers' development practices as indicators of quality, not just defect rates.

So, the question becomes, what development practices/criteria would you need in the SQA assessment to convince you that the modelers are on track?

Also, it's interesting to hear you state that "the climate computer models can contain significant bugs and shortcomings and the scientists (and the public) can still remain very confident that the models are useful". I've heard this from the climate modelers themselves and it's one of the most counterintuitive things I've come across so far. For some reason I expected that most bugs would cause numerical errors that would balloon out and lead to massive climate drift in the simulation (a la the "butterfly expect"). Clearly this just isn't so.

ernie: You suggest that asking this question is uncomfortable because no one is really sure what quality means or how to measure it in any domain. You also wondered if a case study might be a better approach to researching software quality, rather than a by-the-numbers look at defect density and such.

To your first point: Sure, ultimately I believe software quality is not a well-defined concept either. So part of the reason I asked the question though was to elicit just the sort of discussion we're having now. I don't expect a complete answer, but surely there are some indicators (or set of indicators) that we can use to get (even a partial) handle on software quality? Or is quality just one of those "I know it when I see it" situations? I don't think so.

The by-the-numbers part of my study is, yes, deeply unsatisfying to me too. That's why I've complimented it with interviews. A case study would probably be ideal, but infeasible for my masters (just getting ahold of the code took ages). Of course, Steve did a case-study of Hadley, so we have some data. (find his paper here: www.cs.utoronto.ca/~sme).

jon said...

robin norwood: You suggest the modeling teams work independently off of the same spec and compare how well their software matches.

Clearly this wouldn't work for assessing the existing climate models, but your idea brings up a few interesting points: One, whilst they don't compare code, the major climate modeling centres participate in massive comparison projects where they compare the output of their models under various scenarios. They do this for the IPCC reports of course, but they also do this as part of the various Model Inter-comparison Projects (see www-pcmdi.llnl.gov/projects/cmip). Not exactly what you're talking, but I'm told that scientists "learn a lot" from these projects.

Secondly, in one interview I was told that there is a worry in the climate modeling community of the climate models become too similar in construction (by using the same architecture, or by the sharing of bits of code, or even entire modules). There is a sense of confidence garnered from the various different implementations showing similar behaviour.

regebro: You suggest that looking at the model output is enough to gauge the software quality. If the model produces accurate output then the code must be accurate.

This is really compelling idea, isn't it? It seems hard to argue that if the model behaves in the way the climate does, then the model must be built correctly. Unfortunately, this won't work. Michael Tobis and Robin have done a great job at explaining the problem with this approach. I might just summarise by referring you to an earlier blog post (http://skoolr.blogspot.com/2009/05/on-quality-in-scientific-software.html) where I discuss Hook's characterisation of the fundamental problems with evaluating a model by its output.

The last thing to point out about climate models is that they aren't used for prediction! Not by the scientists at least. They are used as lab equipment on which to run experiments in order to do science. The scientists are often running them in configurations for which there is no real world data to compare. This is what Daniel means by the Oracle problem.

jon said...

Michael Tobis: You state that the models "are wretched pieces of engineering" that "no commercial shop would release".

On what basis do you make that claim? I've heard this before, of course; this is partly what got me interested in this line of research. So, what about them makes them so wretched? Is it poor commenting, spaghetti code, poor modularity, ... ? If so, in what way is it poor?

Also, comparing to commercial code may not be the best way to judge quality since the context and development cycle for scientific code is completely different. As I understand it, in general, the scientific code is under constant revision as new bits of science are added and changed.

Manuelg: You bring up several points, which I think boil down to something like what ernie was saying about how "quality" is ill-defined, as well as raising the issue of how good any metric will be at indicating quality. As well, you suggest that reproducibility may be a fundamental piece to the quality puzzle.

Reproducibility is an interesting one. I don't know much about this. But, ignoring for the moment that much of the code and configuration files aren't publicly available, my understanding is that inside the modeling centres there is great care taken to make sure experiments can be reproduced and restarted. There are also several projects to track the data and metadata used in various climate models and runs partly as an aid to reproducibility. See METAFOR: metaforclimate.eu, and CURATOR: earthsystemcurator.org).

jon said...
This comment has been removed by the author.
Michael Tobis said...

No commercial shop would release code that would typically expect the end user to spend weeks of work to get running. That has no bearing on the validity of the results.

As an example of what I mean by the models design constituting a limitation:

If it were not so difficult to get the codes running, scientists would habitually ask the same questions of several models, and thereby better constrain the uncertainty of their results as they apply to the real world.

In no case does this mean that the models are or are not adequate as simulations. That is a separate question, and one which needs to be better formulated before an answer is expected. I will happily state unequivocally that some CGCMs are extremely useful for some purposes. That doesn't mean that they are a pleasure to operate, like, say, Google Maps or Aquamacs or iTunes etc...

But once you get your half-baked Ogg Vorbis player to work, you can enjoy the music every bit as much as if it were coming from iTunes, even though the latter is a brilliant piece of engineering and design and your player is not.

Michael Tobis said...

Regarding reproducibility, I am associated with Sergey Fomel and as such indirectly with Jon Claerbout. You can consider me a member of the reproducibility community. I think there is a very long way to go before the advantages of Claerbout's approach are understood, never mind broadly implemented.

The status of reproducibility within climate modeling is not good. The seismic community has some advantages, in that they are only beginning to migrate to really big iron. I hope they manage the transition better than the climate community did. We have suffered from being early adopters.

gmcrews said...

Hi Jon,

In May of 2006, a National Science Foundation (NFS) Blue Ribbon Panel issued a report on its findings and recommendations for Simulation-Based Engineering Science. Since the climate models are simulations, their findings may be relevant to your interests.

Section 3.2 of the document talks about the verification, validation, and uncertainty quantification of computer-based simulations. The section addresses the question: "What level of confidence can one assign [to] a predicted [simulation] outcome in light of what may be known about the physical system and the model used to describe it?"

To quote from the Panel's findings:

While verification and validation and uncertainty quantification have been subjects of concern for many years, their further development will have a profound impact on the reliability and utility of simulation methods in the future. New theory and methods are needed for handling stochastic models and for developing meaningful and efficient approaches to the quantification of uncertainties. As they stand now, verification, validation, and uncertainty quantification are challenging and necessary research areas that must be actively pursued.

So I think they would agree that your research is important.

About verification and validation (V&V), the report stated:

The entire field of V&V is in the early stage of development. Basic definitions and principles have been the subject of much debate in recent years, and many aspects of the V&V remain in the gray area between the philosophy of science, subjective decision theory, and hard mathematics and physics.

On the subject of validation, the report states:

The twentieth century philosopher of science Karl Popper asserted that a scientific theory could not be validated; it could only be invalidated. Inasmuch as the mathematical model of a physical event is an expression of a theory, such models can never actually be validated in the strictest sense; they can only be invalidated. To some degree, therefore, all validation processes rely on prescribed acceptance criteria and metrics. Accordingly, the analyst judges whether the model is invalid in light of physical observations, experiments, and criteria based on experience and judgement.

And about verification the report states:

Verification processes, on the other hand, are mathematical and computational enterprises. They involve software engineering protocols, bug detection and control, scientific programming methods, and, importantly, a posteriori error estimation.

A more recent consensus is the 2009 WTEC Report which also has a section on validation, verification, and uncertainty quantification. The WTEC Report contains a lot more detail than the NSF Blue Ribbon Panel Report. However, not much has changed. This later Report notes that:

There are currently no funded U.S. national initiatives for fostering collaboration between researchers who work on new mathematical algorithms for V&V/UQ frameworks and design guidelines for stochastic systems.

Michael Tobis said...

V & V is a crucial question, and the way it shakes out in climate simulation is interesting.

My guess is that commercial software quality metrics have relatively little bearing on simulation V&V, save as symptoms of good or bad training in software engineering. The type of bug that would break a simulated climate is different from the type of bug that would cause problems at Amazon, and given the way climate simulation models are built, it's hard to imagine automatic flagging of such bugs.

It is possible to imagine better ways to develop the models, though.

DixieCupUA said...

Hi Jon, your post reminded me of a test of software quality in a related field ecological niche modeling - it uses a lot of the same data as the climate models you are discussing. A few years ago as lots of new software was developed the National Ecological Synthesis Center had a bake-off to compare strengths and weaknesses of each software. Novel methods improve prediction of species’ distributions from
occurrence data. Elith et al. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.137.940&rep=rep1&type=pdf

Does accuracy in testable predictions could in assessing software quality?

Jed said...

I came saw this post from reddit and replied there on the challenges of defining quality and also software design as opposed to correctness. I'm interested to hear what others here think on this subject so I'll watch this thread as well.

Mitch said...

Defect densities speak to the code implementation, but really has nothing to do with the software design. That’s one of the biggest issues in our industry, or at least my experience as a software development professional in this crazy biz for 20 years. We do not know how to design software is my claim. Whether your implementation is buggy or not is totally irrelevant if the design is wrong in the first place.

You might be interested in the following two links:

Software Abstractions

The Lost Art of Software Engineering

jstults said...

The department of energy has sponsored lots of work on verifying and validating physics codes. See the work by PJ Roache on the 'method of manufactured solutions' for formally verifying the correctness of PDE solvers. Validation just requires comparison of model predictions to empirical observation, and deciding that a code is 'validated' is then just a statistical inference.

The steps are straight-forward:
1.) Unit tests with good coverage (this isn't mandatory, just good 'software carpentry')
2.) Verification with the method of manufactured solutions
3.) Validation with comparison to measurements and appropriate statistical tests

jstults said...

Jon said:
Michael Tobis: You state that the models "are wretched pieces of engineering" that "no commercial shop would release".

On what basis do you make that claim? I've heard this before, of course; this is partly what got me interested in this line of research. So, what about them makes them so wretched? Is it poor commenting, spaghetti code, poor modularity, ... ? If so, in what way is it poor?


What gets produced by individual investigators or even small research teams tends to be what you software guys might consider throwaway code, but we don't throw it away...

Blog Archive

About Me