Wednesday, October 18, 2006
Iraqi Death Survey Wrap-Up
1. The construction of the survey
2. The conduct of the survey
3. The analysis of the results
(I also add, here, that Iraq Body Count's criticism of the results of the study, based on what I would describe as face validity, seem to me to be very compelling. I won't address those issues, as I'm in no way competent to offer an analysis that could compete with that of Iraq Body Count.)
1. The construction of the survey:
A) Steven Moore makes much of the fact that 47 clusters were used, and this is far too small, given the extremely non-uniform distribution of violent deaths in Iraq. He may well be right; he’s certainly more experienced in surveying techniques than am I. Several opposing voices have pointed out that Mr. Moore himself has used only 75 clusters in similar situations, and others have used 150, or whatever. This sort of analysis quickly gets out of the realm of statistics and into polemics and name-calling. My problem with the number of clusters has to do with the assumed stratification of the population.
This survey was, at the top level, a stratified survey. Iraq was divided into its Governorates, and the number of clusters chosen per Governorate was decided by population. Evidently, the authors had reason to believe that there might be significant differences in death rates between Governorates (which was confirmed by their own results.) Unfortunately, in all but two Governorates, three or fewer clusters were selected from that Governorate. In several cases, only a single cluster was selected. How can one possibly control for the possibility of getting a very unrepresentative cluster, when a sample of a single cluster is used? The authors say that they did comparisons between clusters and within clusters. Within clusters, I’ll grant you. But between clusters? Evidently between clusters from different strata. This makes no sense. If you stratify a population, it is because you are assuming, a priori, that there may be significant differences between strata (Governorates.) You can’t then turn around and compare between strata to attempt to identify, or compensate for, a single cluster as being representative or unrepresentative of that particular stratum. In the famous words of Kwame Nkrumah upon his removal by coup as the first President of Ghana, “You can’t compare one thing.”
When a stratified sampling is used, it is common practice to use a large enough sample to get several draws (even if each draw consists of a cluster of individual samples) from each stratum.
B) The method of selecting named main streets, followed by named cross streets is certainly not random, and quite possibly not representative.
I don’t know much about what proportion or which streets are named in Iraq. But I have had several experiences which lead me to question whether the distribution of officially named and recorded main streets and cross streets is uniform enough to use as a basis for a random selection procedure.
How many streets are named? In rural America, where I now live, the answer is “almost all.” But even ten years ago, the answer was “most in some places, none in others.” The change came about due to the 911 emergency calling system. Here in my town of 300, the locals laugh that a UPS guy can find your house from the address, but none of the citizens could. My official address is XYZ 3rd St. (has been officially so named for about 8 years), but if I want to tell anyone where I live, I have to say “the old Hoffman house.” I would suspect that much of Iraq still doesn’t have streets (main or cross) that would be listed in an official directory.
Back when I lived in Liberia, we conducted health and demographic surveys in conjunction with the national vaccination campaigns. I was just a foot soldier, working the villages of Grand Cape Mount County, and have no idea how the cluster selection process was done, but I can guarantee that it wasn’t done by street name. Outside of Robertsport, there wasn’t a named street in all of the county. At that time, I would guess that fewer than 50 communities in the entire nation had named streets, and that, even in the most advanced places, like Monrovia, fewer than 50% of the population lived on a named street, and those who did, were distinctly non-representative on many levels. On the urban side, the most densely populated part of Monrovia was an area called West Point. I lived there for about two months. I estimated the population at the time at about 30,000 – and there wasn’t a single street, named or otherwise, in the entire slum. I’m thinking Sadr City, Baghdad looks a lot like West Point, Monrovia in that respect.
My guess is that the systematic selection of only named cross streets to named main streets as listed in an official directory will systematically exclude broad segments of the Iraqi population, namely the rural, the urban poor, and the internally-displaced refugees. Whether this systematic bias skews the survey results up or down or is neutral, I don’t know, nor does anyone else (if they did, then why the hell would anyone be doing the survey?) The fact is, that it’s bad statistical methodology to use a systematic selection method that consistently biases against particular large demographics.
Several posters to Tim Blair’s blog (see link below) brought up a further problem with the selection process. Although it’s a bit unclear from the description in the articles, it appears that all of the clusters were chosen from a named cross street, at a distance fairly close to a main street. The report says that “a house was chosen at random,” but does not specify how that random starting place was selected, or what the maximum distance from the main street the starting point would be. They (the blog posters) suggest that violence might be concentrated nearer the main streets and that therefore the incidence of violent deaths would be higher close to a main street. Again, there’s no way to know that, but, again, systematic bias is bad practice, and can lead to results far outside of the “margin of error” (which, of course, is constructed assuming an absolute absence of non-sampling, or “methodological,” error.)
2. The conduct of the survey:
Here’s where my analysis gets a bit dicey, but I’ve got to point out what I see, and here’s where my experience of many years as a statistics teacher, supervising and grading student projects, leads me to grave doubts about what the survey teams actually did and didn’t do.
A) The response rates reported by the teams, in terms of their success at finding a head of household or spouse at home and willing to participate are just amazingly, extremely, insanely, unbelievably high, especially given the fact that the teams never once paid a second visit to a household, due to the dangers they were facing, working in a war zone, and apparently worked throughout the day, rather than confining their visits to times when respondents would be likely to be home (see the time constraint concern, below.)
The authors report that in 99%+ of the households, someone was at home. They also claim that in only one cluster were any empty households found among the 40 adjacent households surveyed. They phrase it so as to insinuate that they are minimizing the death estimates:
“Households where all members were dead or had gone away were reported in only one cluster in Ninewa and these deaths are not included in this report.”
This quote has become a favorite among some blogs as showing that, in fact, the real numbers must be much higher than those in the survey, since any annihilated households have been discounted. It's definitely true that the phrase “were dead or had gone away” followed by the “these deaths” clearly implies that the researchers have reason to believe that the former occupants of the households in question were all dead (without explicitly stating so.) But what bothers me is the implication that vacant houses were supposedly encountered in only one cluster in all of Iraq, and, by insinuation, none vacated by emigration. With estimates of over a million recent emigrants from Iraq entirely, and up to that many again internally displaced persons (together pushing 7% of the total population of Iraq), one would have expected to see more, and more widely distributed, vacant houses – even if no entire households had been annihilated. The next question becomes, what’s the difference between a ‘household where all members were dead or had gone away’ and the ‘16 (0.9%) dwellings [where] residents were absent.’ The latter evidently includes households where the surveyors believed that someone was still living, but no one was home when they knocked (or so I’m guessing.)
In any event, the fact that 7% or more of Iraqis have vacated their homes (for other parts of Iraq, or other parts of the world, or Heaven,) and yet that less than 1% of the households surveyed found no one at home, is very suspicious to me. If nothing else, it calls into question the representativeness of the sampling. The <1% "not at home," even absent the contributing concerns, raises all kinds of red flags for me.
B) Among those that were at home, only “15 (0.8%) households refused to participate.” Given the purported methodology of the survey, this must also include any households where some members were home, but not the head of household or spouse, since we’re guaranteed that the head of household or spouse were the only ones questioned. So we’re left (combining A and B) with the astounding result that in more than 98% of the attempted contacts, the head of household or spouse was at home and willing to participate in the survey, and this, without a single call-back, since attempting a second contact with a household was deemed too dangerous for the survey teams. What makes these results doubly surprising is that the surveys must have been conducted throughout the day, in order to accomplish 40 surveys per day (see below.) So, somehow, a total of 15 or fewer “Dad’s at work (or the Mosque or wherever), Mom’s at the market” responses in over 1700 attempts.
It’s quite possible that 15+ years of teaching Introductory Statistics and similar courses has left me a bit jaded, but I know that I’d be calling these survey teams into my office, with some serious questions about what they actually did or did not do, before accepting any of their results.
C) The time spent per survey belies the notion that great care was taken to insure the interviewees' comprehension of the questions and the interviewers' assurances of accuracy in the answers. According to the article, the survey teams “could typically complete a cluster of 40 households in 1 day.” The survey teams reportedly consisted of 4 members each, two male and two female. It is not stated how or whether the teams split up in conducting the interviews. From what I’ve heard about Islamic culture, it would seem likely that they would not have gone out individually, given that some women might be reluctant to speak to a single man, and vice versa. So, if we assume that they split up into two pairs of one male, one female doctor each, then each pair was interviewing 20 households in a day. Even assuming 8 hours per day for fieldwork, this leaves less than 24 minutes per interview (less than because it takes some time to walk from house to house.) Based on my own experiences with face-to-face interviewing, this would be maybe 15 minutes for the actual survey questioning (there’s always a cushion for formalities, pleasantries, getting settled and whatnot.) Somewhere in there, also, the interview teams had the time to reassure the interviewees of their honesty and good intentions, and double-check any questionalbe results. Read what the article claims went on at each interview:
“The survey purpose was explained to the head of household or spouse, and oral consent was obtained. Participants were assured that no unique identifiers would be gathered. No incentives were provided. The survey listed current household members by sex, and asked who had lived in this household on January 1, 2002. The interviewers then asked about births, deaths, and in-migration and out-migration, and confirmed that the reported inflow and exit of residents explained the differences in composition between the start and end of the recall period. …. Deaths were recorded only if the decedent had lived in the household continuously for 3 months before the event. Additional probing was done to establish the cause and circumstances of deaths to the extent feasible, taking into account family sensitivities. At the conclusion of household interviews where deaths were reported, surveyors requested to see a copy of any death certificate and its presence was recorded. Where differences between the household account and the cause mentioned on the certificate existed, further discussions were sometimes needed to establish the primary cause of death.”
Now, it’s tough to compare different cultures, but when I worked on the health surveys in Liberia, we’d figure on maybe 3 or 4 good interviews per day, by the time we were satisfied that the interviewees were understanding the questions correctly and we were understanding the answers correctly (and we always had at least one interviewer who was a native speaker of the dialect.) Canned political surveys in America tend to take over 5 minutes each, even though the interviewees pretty well know what to expect in terms of the questions, and the surveyors have no need to verify things like death certificates.
So, 15 minutes or so per survey? I guess it’s possible, since some might be very easy (“All six of us have lived here for many years, and no one has died”), but I’m suspicious whether the implied care was actually taken in the interview process.
D) Due to safety concerns, the survey teams were apparently allowed great latitude in changing the pre-determined cluster to a more convenient one.
In terms of statistical validity, this point is crucial. The article states:
“Decisions on sampling sites were made by the field manager. The interview team were given the responsibility and authority to change to an alternate location if they perceived the level of insecurity or risk to be unacceptable.”
The authors give us no information about how often these changes were forced to be made, and absent that information, this survey is, simply, worthless. No amount of advanced statistical massaging can fix a sampling of convenience. So, did the violence in Iraq force one change, two changes, forty changes? We don’t know. But what we do know is that there is a clear admission of selection bias in the sampling. Given the sectarian tensions in Iraq, even granting the alleged professionalism of the canvassing teams, it is impossible to tell the impact of these biases. The implication in the report is clearly that more deadly areas were underrepresented, but were more distant (possibly safe) areas also selected against because of the level of risk required to reach them? Were teams of Shia (resp. Sunii) doctors afraid to enter areas where they thought themselves unwelcome? Did coalition roadblocks or bombing campaigns lead to certain areas of the country being off-limits? I find it very troubling that while the authors of the article go out of their way to mention anecdotes like the fact that households where everyone was killed were not counted, and that some interviewees may have been afraid to admit that they have had family members killed, this essential bit of information (“how often were the survey teams forced to deviate from the pre-determined cluster, and what procedures did they implement to attempt to insure that an equally representative cluster was selected”) was left out of both the article, and the appendices (at the versions I’ve found. I’d appreciate it greatly if someone could point me to this information, if it’s published.)
Once again, it would be easy to jump to the conclusion that any deviations would lead the estimates to be low (this is clearly the authors’ implication in their wording: “if they perceived the level of insecurity or risk to be unacceptable"), but any deviations of this sort remove the survey from the realm of statistical science, into the realm of conjecture, anecdote or advocacy.
E) Given the freedom apparently allowed the survey teams to deviate from pre-selected cluster sites, and to determine the starting point for the cluster (on the street), as well as the above-mentioned concerns about veracity, the fact that there were only two survey teams involved in the entire survey, and that these teams had only two days of training, leads to the fact that any selection bias introduced by the survey teams will skew the results greatly, all in the same direction. If there were many teams, we might expect that some might be making selections that (consciously or unconsciously) minimize the reported number of deaths, while others might be making selections that maximize them, and others might be making selections that were essentially neutral. Given that there were only two teams, these biases have much less chance of canceling each other out, and much greater risk of increasing the actual margin for error.
Dr. Donald Berry, the chairman of the Applied Statistics and Bioscience Department at the University of Texas-Austin, put it this way:
“Selecting clusters and households that are representative and random is enormously difficult. Moreover, any bias on the part of the interviewers in the selection process would occur in every cluster and would therefore be magnified. The authors point out the possibility of bias, but they do not account for it in their report.” (see link below)
3. The analysis of the results:
Here’s where the article is apparently on its most solid ground, but since none of the methodology involved in the analysis has been published, it’s hard to say. I’m guessing that this would be what any peer review would concentrate on, and given the quality of statistical software, it’s hard to make mistakes in statistical analysis. I’d be surprised if there were any grave flaws in analysis that I could uncover made by a PhD in statistics, which I definitely am not. As Mark Chu-Carrol notes at GoodMath/BadMath, it’s surprisingly in this area where most of the attacks have concentrated, and, consequently why most of the attacks can be dismissed as failures on the critics’ part to understand statistics.
Be that as it may, the authors do provide enough detail for me to find one criticism in their analysis, which they themselves allude to, but attempt to minimize:
"The population data used for cluster selection were at least 2 years old, and if populations subsequently migrated from areas of high mortality to those with low mortality, the sample might have over-represented the high-mortality
“[I]nternal population movement would be less likely to affect results appreciably [than emigration from Iraq.]”
As I pointed out in an earlier post (Part VI), the effects of faulty population estimates (due to massive internal migration) can have considerable impact on the extrapolations, because they have essentially double impact – first in making some more violent areas more likely to be sampled than their current populations would warrant, and then again, in calculating the estimates, since the same inflated numbers would be used to multiply out the projected values.
Supporters of this study have latched onto this criticism, accepting it (for the sake of the argument) and then pointing out that even still, it would only lower estimates by a few percent (even lowering it by 25% would still leave the estimates many times higher than others, after all), and so I was a bit hesitant to bring it up, as providing an opportunity to ignore the other concerns that, if valid, go to the heart of any legitimacy whatsoever for this study. But, since I noticed it, and it seems a true potential for error, I’m pointing it out, again.
My purpose, throughout this critique, is not to claim that particular errors in the study would lead the reported results to be too high, or too low, or balance out. As I always tell my students, if you really knew the effect of a bias, there’d be no need to do the study to begin with. You could just use your amazing reasoning powers and puzzle out how many deaths there really have been in Iraq, due to the Coalition’s actions, and then yell at everybody about how smart you are, and then they'd all believe you. My purpose is to point out the places where this study failed to use good statistical methodology, and to show up evidence that leads me to conclude that the survey teams’ reports, themselves, are suspect. These suspicions (about the survey teams) are not necessarily grounds to deduce intentional bias. From personal experience, I know that many amateur data collectors under-report the difficulties they have in obtaining responses, believing that a higher non-response rate reflects negatively on their own skills, and over-report things like how many deaths they were able to validate by certificate for the same reason. Further, the less thorough the interview process, and I’ve pointed out how quickly they must have been conducted, the more the interviewers’ biases influence the reported responses, even when the survey teams believe they are recording the results fairly. Finally, given the extremely high sectarian tensions in Iraq, it would seem unlikely that a mere two teams of 4 physicians each, given a high degree of selection autonomy, would produce unquestionably unbiased results under the hectic conditions they were experiencing in Iraq.
Two final notes about those death certificates: 1. The authors, it seems to me, do a good job of explaining why we would expect to see a high discrepancy between the number of death certificates that family can produce, and the number of officially recorded deaths at the national level. What they do not address, is why no attempt was made to double-check the totals locally, at least in areas of less chaos. This would have provided a good check on the representativeness of their sample. Steven Moore makes a similar point, regarding basic demographic information (which a bunch of his critics in the blogs have misunderstood entirely.) Had the surveyors conducted a basic survey of demographic data per household (men, women, old, young, whatever), this could have been used to compare with the other population estimates to get a check on whether their particular clusters were, at least in these respects, representative. Instead, they were left, far too often, with no legitimate means of checking for representativeness of a particular cluster.
2. The question must be addressed as to whether there would be any incentive for Iraqis to falsify death certificates. I don’t know the answer to this. But it is an important question. Given the corruption, chaos and confusion that is a fact of life in Iraq today, it would be very easy to forge death certificates, and if there is any market for such forged documents, we must assume that they exist in great numbers. Are coalition forces making cash payments for collateral deaths? Are families hiding members for their own safety by falsifying death records? I don’t know the answer to these questions, but it would be foolish to accept the validity of the certificates without an analysis of the incentive to falsify them. Again, I speak from experience with West African nations, where the levels of corruption and confusion are probably not as high as currently in Iraq, but where, if there’s a need for an official document, it can always be created, for the right price.
I noticed how the response rate issue got addressed over at Tim Lambert's, namely along the lines of, ah well, many surveys in Iraq come up with very high respnse rates, one of them even's got one of 100%.
That may be due to people being much more willing to answer questions, or, ..., after reading what you had to say, it may tell us a lot more about those doing the surveying in Iraq than those being surveyed.
If you choose a random 100 households, and ask for the number of washing machines or who they voted for at the last election, you'll get say 70 saying they've got a washing machine and 25 saying that the head of household voted for a Republican.
But if you ask how many had a baby die in 2002, you'll get either 0,1 or 2 as the answer, with 0 being by far the most likely result.
To get an accurate estimate of infant mortality clearly requires asking a lot more people than asking for the number of washing machines, or how many people think al Maliki is doing a good job.
Very interesting posts you've done on the Lancet study. Do you have any thoughts on the above?