Friday, November 26, 2010

Anecdoctal evidence and the Problem of inferring Cause

I'm finally back running after taking a nearly 5 week hiatus due to a self-diagnosed metatarsal stress fracture. The fracture came on quite suddenly during a run at Twin Brook. I completed the run, hoping it would go away. It didn't and I took a break, which was easy enough since all my major fall races were over.

I'm quite confident that the cause of the stress fracture was racing the 5K XC race 6 days after my first marathon and then the PT8k a week after that. The body is very good at adapting to stresses and the stress that I was giving my body all summer and fall was running at a steady, marathon (6:40-6:50) pace. Other than the few fun "sprints" (really about 400-800 meters) at Twin Brook, I hadn't run faster than 6:30 for three months, and faster than about 6:05 since Spring. My average pace at the 5K XC race was about 6:15 and the flat and especially downhill sections were significantly faster. My average pace at the 8K was 5:58. My feet weren't adapted for this pounding and the result was the stress fracture.

Wait, scratch that paragraph and read this instead:

I'm quite confident that the cause of the stress fracture was due to low bone density because of Vitamin D deficiency, as I've unintentionally cut back on the amount of milk that I drink at dinner (replaced largely with wine). Vitamin D deficiency is very, very common and Vitamin D fortified milk is a great source. Vitamin D increases Calcium absorption in the gut. With lower gut absorption, my low blood Calcium levels triggered Parathyroid hormone release, which activates the cells (osteoclasts) that remove the bony matrix from bones, releasing calcium to the other tissues.

Wait, scratch that and read this instead:

I'm quite confident that the cause of the stress fracture was due to running a road marathon two weeks before the bone finally fractured. I run mostly trails. Indeed, in training for the marathon I only ran two long runs (one 16 miler one 18 miler) on the roads. All my other long runs were on trail. I did run a weekly 10 mile marathon-pace run on the roads. But that was it. Our body is very good at adapting to the stresses we give it and I made the mistake of running a road marathon without training enough on the roads. While I made it through the marathon without incident, I clearly stressed my metatarsals enough that all it took was a little more running to give me the stress fracture.

Wait, scratch that and read this instead.

I'm quite confident that the cause of the stress fracture was due to changing my form only three weeks before running my first marathon. I was watching good running video and reading some literature on efficient running kinematics and decided that I didn't let my trailing leg extend enough before toe off (actually it's extension at the hip). I worked on this new form both before and during the marathon and two post-marathon races. The more more extended hip places additional stresses on the more dorsiflexed foot. Our body is very good at adapting to the stresses we give it and I made the mistake of racing the marathon and two post-marathon races using my new form without giving my body enough time to adapt.

Wait, scratch that and read this instead.

I'm quite confident that the cause of the stress fracture was due to trying out a new, minimalist shoe that I had received for review. I typically run in racing flats of various sorts, such as the NB 790 and 100 on the trail and the Asics Piranha and Hyperspeed on the road. Prior to this year, I also typically did 1-2 barefoot cooldowns per week but because of a bruised left forefoot (from landing on a rock while running downhill) I had not done any barefoot runs at all this summer or fall. The new minimalist shoes have zero cushion - they are simply a vibram sole with some cloth that wraps that foot. After receiving the shoes, I immediately went out and did some 5-8 mile runs. Our body is very good at adapting to the stresses we give it but I made the mistake of running in the cushionless, minimalist shoes without properly building up to the novel stresses that these placed on my foot.

Wait, scratch that and read this instead

I'm quite confident that the cause of the stress fracture was running with tight calf muscles following my first marathon. This calf tightness was new to me, so I was surprised to have it persist for over two weeks - up until the day that I stopped running in fact. I have no idea what the connection between tight calves and metatarsal stress fracture is, but I've repeatedly read that tight calves is a risk for metatarsal stress fracture on the web.

OK, ok. What is the point of all of this? Quite simple - all of these explanations make plausible stories. And importantly, that's what humans do, we create stories to explain events. The actual cause of my stress fracture may be one of these that I've listed, or some combination, or none. I simply don't know. Running injuries have complex causes and there are always numerous, plausibly causal antecedents. Why do people tend to assign cause to a single antecedent? I would argue it's the antecedent most consistent with the person's world view. So we often hear: "this hip muscle imbalance gave me this knee injury" or "this drink made me recover faster" or "this workout made me run faster." Quite remarkably, world view often trumps plausibility.

I don't really care that individuals make up stories to explain events in their life. We all do this. It's part of being human. The problem is when these anecdotal stories determine how professionals in the health sciences - broadly and liberally defined - practice health care.

Monday, November 22, 2010

Is high fructose corn syrup (HFCS) eviler than sugar?

This is a long critique of Bocarsly et al 2010 “High-fructose corn syrup causes characteristics of obesity in rats: Increased body weight, body fat and triglyceride levels” published in Pharmacology, Biochemistry, and Behavior, which is edited by G.F. Koobs (full paper here).

I am moved to write this critique not because I have a dog in this fight but because I teach Human Anatomy & Physiology to students who come in with very rigid misconceptions about health and disease and who will communicate these misconception to their patients/clients/athletes unless the misconceptions are addressed in the classroom and on the web.

The title of the paper should not raise an eyebrow among physiologists. Indeed, the role of carbohydrates and especially sugars regulating appetite and satiation would seem to be a fascinating and important field of study. Even the abstract is not at all controversial. But the paper doesn’t simply claim that access to HFCS makes rats fat but that this effect is found only in HFCS and not sucrose. And it is this claim that was promoted in the first sentence of the Princeton University press release “A Princeton University research team has demonstrated that all sweeteners are not equal when it comes to weight gain: Rats with access to high-fructose corn syrup gained significantly more weight than those with access to table sugar, even when their overall caloric intake was the same” and emphasized by the PI, Bart Hoebel (from the same release): "Some people have claimed that high-fructose corn syrup is no different than other sweeteners when it comes to weight gain and obesity, but our results make it clear that this just isn't true, at least under the conditions of our tests," said psychology professor Bart Hoebel.

The claim that the metabolic consequences of HFCS and sucrose differ radically should raise eyebrows among physiologists. While I’d be skeptical of any results that find a difference, I would find good evidence for differences fascinating and it would make a great conversation in my A&P class. I’m not a carbohydrate physiologist but I can, nevertheless, come up with three hypotheses that might explain differences between HFCS and sucrose. Differences between sucrose and HFCS on oral receptors or the small differences in the proportions of glucose to fructose in the stomach and small intestine have (1) large effects on carbohydrate and lipid metabolism or (2) the hormonal regulation of appetite or (3) the hormonal regulation of activity metabolism. Hypotheses (1) and (2) seem unlikely given that the authors also claim that the total caloric intake did not differ between rats given access to HFCS and sucrose. So if I were a metabolic physiologist, I might be tempted to look into how HFCS makes rats less active.

Except that I wouldn’t because after reading the paper, I would note that the experimental design, statistical analysis, and interpretation of the results in Bocarsly et al 2010 are deeply flawed. Indeed, the elementary flaws in the statistical analysis and the absence of any discussion of the glaring paradoxes in the results combined with a much higher rigor of analysis and interpretation in the senior authors’ previous papers raises several ethical questions that will be addressed at the end.

Finally, I am not taking any time to better organize this critique because it has taken far too long to simply comment on all of the errors. There are also numerous small errors in presentation, analysis, and interpretation that I do not raise here because of time and space. It's also written as if I were asked to peer review it for the journal. Which is a good thing because while the ms may have been peer read and okayed, it certainly wasn't peer reviewed.

Flaws in the design and statistical analysis of Experiment 1

Experimental Design. The experiment lacks a 24-h Sucrose treatment and thus any interpretation of the 24-h HFCS treatment confounds two potential factors, time (12-h v. 24-h) and sugar (HFCS v. sucrose).

Statistical Analysis
Result 1. “Animals with 12-h 8% HFCS access gained significantly more weight in 8 weeks than animals with 12-h 10% sucrose access (F(2,25) = 3.42; P less than 0.05).” The authors do not give test results for any of the other comparisons. The authors do not give either the effect size or the raw or percent increase in weight in any of the treatment levels. The authors do not provide a chart showing the response by treatment level. The authors do give a table of the final weight of the four treatment levels, which is worth looking at here, but will be discussed in more detail below.
We are not given what the errors are (Given the one reported F statistic, I think it’s safe to assume that these are one standard error of the mean). Of course, to analyze or interpret this table, we would need either the initial mean weights or make the assumption that the initial mean weights were exactly the same. The authors do state that the intial groups were weight matched, so we’ll have to assume that the match was close enough to not effect the statistics. The asterisk for the 12-h HFCS treatment level indicates P less than 0.05 for the comparison with the chow only treatment level. Note also that the asterisk signifies a different comparison than the significant result reported in the text (and quoted above).

For the one comparison with statistics given (12-h HFCS v. 12-h Sucrose), the actual P value for F(2,25) is 0.049, which is, indeed, less than the traditional acceptable type I error rate (0.05) but doesn’t give me much confidence in the conclusions given the claim that 55% fructose has a metabolic effect while 45% fructose does not. Given that table one is the raw endpoint weight, I also have less than perfect confidence that the statistics were computed on the actual weight gain (or percent weight gain) and not simply on the endpoints. Were the P-value really low, this wouldn’t matter. Given that the P-value is 0.049, my concern matters.

Regardless, there are two egregious errors with the F statistic and the interpretation of the associate P-value. First, the F statistic is compared to a distribution using 2 and 25 degrees of freedom (df) but given two means and 10 animals per group the correct df should be 1 and 18. Assuming F = 3.42, the correct P-value, with 1 and 18 df, is 0.081, which fails to reject the null hypothesis if we follow the traditional frequentist interpretation. Given the author’s report of 2 and 25 df, I’m not confident in the value of the F statistic (3.42) itself as the df suggests that some other combination of individuals might have been used. Here’s one interpretation: the F statistic is based on an ANOVA with both HFCS treatments and the Sucrose treatment in the model and there is missing data. If this is what was done, then the P value says nothing about which one of the treatment levels differs, only that there is a difference somewhere. This is what posthoc tests are for. This is textbook Stats 101. But maybe I’ve guessed incorrectly how they came up with a F test with 2 and 25 df.

Second, and most importantly, because this is relevant to all of their results, the authors either fail or make no mention of controlling for type I error using something like a Tukey-HSD test, a test the authors have used previously. Given four means, there are six possible pairwise comparisons, all of which are orthogonal and all of which are of interest. The type I error rate given six orthogonal comparisons is 26.5%, well above the 5% assumed by the authors.


From table 1, I have computed the Tukey’s HSD for each comparison, as well as the more powerful Ryan’s Q, the Games & Howell modification of Tukey HSD for heterogenous variances, and a simple (naive) t-test with the SE and df computed for that comparison only. These computations assumed a sample size of 10 rats per group, even though there are several statistics that the authors report that suggest that they have missing data (not 10 rats per group), so my P-values will be liberal.
I’ll discuss more of this table later but I want to focus on the single comparison reported by the authors, which was 12h-HFCS v. 12h-Sucrose. My simple t-test P-value is slightly less than that computed by the authors, although as discussed above, I’m not sure how they computed F or why they used 2 and 25 df. Regardless, this P-value is less than 0.05, the traditionally acceptable type I error rate. Importantly, the Tukey HSD test and the more powerful Ryan’s Q test fail to reject the hypothesis of no difference between 12h-HFCS and 12h-Sucrose.

At best, we can conclude from experiment 1 that there is some evidence that the rats with 12 hour access to HFCS gained more wait than rats in the chow only treatment level. We cannot make this conclusion for the 24h-HFCS rats nor can we conclude that there was a difference in gained wait between the 12h-HFCS and 12h-Sucrose rats. I discuss various interpretations of these results below. While the source of the mistake in the df of the F statistic is a mystery, the failure to account for multiple post-hoc tests is disturbing, especially given the prior use of this adjustment in previous papers by the first author (a graduate student), the 3rd author (a post-doc), and the last author (the PI of the lab). The inflation of type I error with multiple tests and the methods to control this error are not esoteric minutiae only known to the statistical cognoscenti but are covered, often in great detail, in all undergraduate textbooks. The authors cannot claim ignorance given their prior use of the Tukey-HSD. I’ll leave it to the authors to explain why they ignored the Tukey-HSD in this paper.

Result 2. “There was no overall difference in total caloric intake (sugar plus chow) among the sucrose group and two HFCS groups.” Remarkably, the authors do not give any statistics to support this. There is no P value, or test statistic, or effect size, or confidence limits, or group means, or a chart showing total caloric intake as a function of treatment level. No statistics are reported for any of the other sorts of comparisons. The statement is simply baldly asserted with no evidence. Given the author’s assertion that the HFCS rats gained more weight than the sucrose or chow only rats, the statement that the total caloric intake did not differ is a really extraordinary result (with emphasis on the extra). In his reply to Marion Nestle’s concern with the study, the PI and last author states “As commonly done, we did not present the overall caloric intake since there was no difference between groups.” For some uncontroversial result, we might state something like “statistics not given) but the authors are arguing that the 12-hr HFCS rats gained more weight but consumed the same number of calories as the 12-hr Sucrose rats, which would be a remarkable result and suggest interesting differences in energetics. This statement begs for evidence. but the authors gave us no data to evaluate the claim. The reason why editors and reviewers (except apparently in this paper) ask for test statistics and sample size and effect sizes and standard errors is because careful readers need these to evaluate the results! Otherwise, we can say whatever fits our view of the way the world works. Was this paper even reviewed?

Result 3. “Even though the 12-h HFCS group gained significantly more body weight, they were ingesting fewer calories from HFCS than the sucrose group was ingesting from sucrose (21.3 ± 2.0 kcal HFCS vs. 31.3 ± 0.3 kcal sucrose; F(1,16) = 12.14; P less than 0.01).” Again no attempt to control for Type I error rate. If this was the only comparison they were interested in, why do the other treatment levels? The F statistic here also suggests there are missing data, which supports my interpretation of the F(2,25) statistic above. The authors state that there were 10 rats per treatment level so there should be 18 df in the F test, not 16.

Missing Results.

The authors give almost no statistics about the other pairwise comparisons between the treatment levels in experiment 1 (that is other than the 12h-HFCS v. 12h Sucrose levels). They do give table one with endpoint weights and standard errors and the one asterisk for the 12h-HFCS vs. Control comparison, so I guess we can infer that the other sugar treatment levels do not differ from the Control but what about 24-h HFCS v. 12h Sucrose or 24-h HFCS v. 12-h HFCS? Where are these statistics? Again, there is no attempt to control for Type I error. Above, I give a table of all pairwise comparisons for endpoint weight, but the authors don’t give the data for me to do something similar for sugar calories or total calories. The relevant statistics are simply not given. Importantly, the authors discuss some of the pairwise comparisons in the discussion but the discussion is not based on any statistical evidence presented in the results!

It is important to point out that the measured effect of HFCS effect seems to occur only in the 12-h HFCS rats and not the 24-h HFCS rats, which, if real, would raise some interesting physiological issues. It's also important to point out the opposite pattern is found in the long term experiment. The authors fail to report either of these paradoxes/contradictions, even though it jumps out of Table 1. Marion Nestle also raises this concern. Hoebel, the PI, responds to Nestle with the statement "Actually, it is well known that limited access to foods potentiates intake. There have been several studies showing that when rats are offered a palatable food on a limited basis, they consume as much or more of it than rats offered the same diet ad libitum, and in some cases this can produce an increase in body weight. So, it is incorrect to expect that just because the rats have a food available ad libitum, they should gain more weight than rats with food available on a limited basis.” First, why didn't the authors discuss this in the paper, since they have an obviously paradoxical pattern? Second, if it's so well known, why wasn't this a prediction or hypothesis from the beginning? Third, Hoebel doesn't address the contradictory, opposite patterns between the short term and long term results (which unfortunately also is confounded by sex, since we cannot compare the long term Sucrose males since this treatment was ignored).

Flaws in the design and statistical analysis of Experiment 2

Design of the experiment.
Flaw 1. The authors do not have a 12-h sucrose treatment level in the males “since we did not see effects of sucrose on body weight in Experiment 1 with males.” Uh, there was also no treatment effect for the 24-h HFCS male rats but these were included in experiment two. And presumably there is a reason for conducting a longer term experiment, such as finding a long term effect in 12-h Sucrose males, but this possibility is precluded without including the level in the design. And, since the goal of the paper is to show that HFCS has different consequences than Sucrose, and no Sucrose treatment was done in Experiment two male rats, the whole experiment (on males at least) becomes irrelevant to the paper.

Flaw 2. In the female experiment two, there are four treatment levels but the access to Chow differs between these levels (in two, rats have 24-h access to chow but in the other two, rats have only 12-h access to chow). So both sugar and chow are being manipulated but within a single factor design.

Statistical Analysis

In General. Again, the authors fail to control for type I error using Tukey HSD or similar method. I’ve re-analyzed the enpoint weight data from Table 1 (table above) using Tukey HSD, the more powerful Ryan’s Q, and the Game-Howell modification of Tukey to account for variance heterogeneity. I also add a simple t-test calculated assuming no other groups other than the two under comparison.

Experiment 2 males.
Result 1. The authors report a difference in both HFCS treatment levels and the Control treatment. Interestingly, they only give one statistic to support this (F(1,14) = 5.07; P less than 0.05), so I’m not sure which comparison this applies to. The df are correct and suggest there are no missing data. Note also that in contrast to Experiment one, the 24-h HFCS rats gain more weight than the 12-h HFCS rats (this is not statistically significant).

Result 2. “The difference in body weight was significant by week 3 (F(2,21) = 4.44; P less thant 0.05).” Again the authors report a single statistic for what should be two comparisons. The F statistic with 2 and 21 df suggests that this was based on an ANOVA with all three groups in the model (in which case the df are correct). Again, this P-value does not tell us which of the groups differ from which.

Result 3. “As an indication of obesity, the rats with 24-h or 12-h HFCS had significantly heavier fat pads than control rats (F(4,35) = 13.01; P less than 0.01; Fig. 4).” Although all fat pads were heavier, this effect was most pronounced in the abdominal region (F(4,35) = 8.36; P less than 0.05; Fig. 2). It is not clear to me what test the authors did to get F with 4 and 35 df. Again, no post-hoc tests are done.

Experiment 2 females

Result 1. “female rats with 24-h access to HFCS for 7 months gained more body weight than chow- and sucrose-fed controls (F(1, 14) = 8. 74, P less than 0. 01).” Again there are two comparisons but one statistic. For the 24-h HFCS v. Sucrose, I did not find a statistically significant difference even using a simple t-test that doesn’t account for inflated Type I error (my table above). Using Tukey’s HSD, neither comparison is significant (my table above).

Result 2. The authors do not report on other comparisons of weight and, importantly, ignore the result that the 12-h HFCS rats gained less weight than both Control and 12-h Sucrose (not statistically significant using any method). This pattern is the opposite of that in Experiment 1, as discussed above.

Result 3. For the comparison of fat pad weight, I still do not know what test they used to get F with 4 and 35 df.

Repeated misrepresentation of results in the Discussion.

The authors state in the introduction to the Discussion that the experiment 1 “male rats with access to HFCS drank less total volume and ingested fewer calories in the form of HFCS (mean = 18.0 kcal) than the animals with identical access to a sucrose solution (mean = 27.3 kcal), but the HFCS rats, never the less, became overweight.” Actually the HFCS rats didn’t ingest fewer calories in the form of HFCS than the Sucrose rats because the latter had none, by design. If the authors meant the HFCS rats ingested fewer calories from the sugar-water, they should have said this. More importantly, the take home message is that the HFCS rats became overweight but the Sucrose rats did not. If we fail to account for multiple tests, this is true only for the 12-h HFCS rats; indeed the 24-h HFCS rats gained less weight on average than the 12-h Sucrose rats. If we account for multiple tests, then none of the treatment levels had a statistically significant effect. But the authors are not simply cherry picking but are explicitly misrepresenting their results: “In these [experiment 1] males, both 24-h and 12-h access to HFCS led to increased body weight.” Actually, no. Even the author’s own table 1 shows this is not true for the 24-h HFCS treatment level. More disturbingly, this practice of cherry picking and misrepresentation is a consistent pattern throughout the discussion.

“In Experiment 2 (long-term study, 6–7 months), HFCS caused obesity greater than that of chow in both male and female rats.” This is true only for the male rats. In fact, the 12-h HFCS female rats gained statistically insignifanctly less weight than the Control (chow only). Again, the authors select only the results that are consistent with one conclusion. Interestingly the authors do not compare the HFCS treatments to the sucrose treatment here. Except that they did in the originally published version, which had the same sentence but with “sucrose” instead of “chow”. Yes, in the originally published paper, the authors stated that the long-term HFCS male rats gained more weight than the long-term Sucrose rats, even though the latter weren’t even in the experimental design. The substitution of “chow” for “sucrose” avoids the embarrassing misrepresentation but still is empirically incorrect.

“Rats with HFCS access gain more weight than sucrose-consuming rats, even when ingesting fewer calories from their respective sugars”. Again more selective cherry picking of the results.

Closing thoughts

This paper has an unusually rich number of errors in statistical design and analysis, selective picking of results that match what can only be a preferred outcome, and outright misrepresentation of the design and results. The senior authors, the editor handling the paper, the editor-in-chief, the reviewers, the Princeton University press release team, and any science blogs and journalists that uncritically parroted the press release should simply be ashamed.