Conflicts between Dan Ariely's statement and Footnote #14 (DataColada #98)

So cool that another fraudulent paper was discovered and outed. I noticed that there were conflicts between the author's statement (he seems to blame his industry partner?) and other facts of the case. I just wanted to highlight the conflicts here because these are things that we need explained better if we are going to trust this author going forward. The author is Dan Ariely by the way. This refers to Data Colada #98.

First of all, let's look at Dan Ariely's statement:

The data were collected, entered, merged and anonymized by the company and then sent to me. This was the data file that was used for the analysis and then shared publicly. I was not involved in the data collection, data entry, or merging data with information from the insurance database for privacy reasons. [link]

But what are the conflicts with Dan's statement?

  1. According to Excel meta data, Dan Ariely both created the Excel file and was the last person to modify it before sending it in its fraudulent form to coauthor Nina Mazar (she has the email still; footnote 14). Update: A member of Dan's lab who joined years after the incident and has no inside information on what happened accurately pointed out that if a company sends you a .csv, and you save it as an .xls file, you will show as the author. Update 2: Apparently the csv file saved as xls doesn't explain the problem because csv files don't save font information and the fraudster used two different fonts (cambria and calibri) Thanks to Krzysztof Cipora for pointing that out
  2. Dan Ariely admitted to miscoding a variable: "First, the effect observed in the data file that Nina received was in the opposite direction from the paper’s hypothesis. When Nina asked Dan about this, he wrote that when preparing the dataset for her he had changed the condition labels to be more descriptive and in that process had switched the meaning of the conditions, and that Nina should swap the labels back. Nina did so." (footnote 14). You can't miscode a variable if someone else does all the data work and you didn't touch it.
And a logic problem:

Why on earth would an insurance company fabricate data in such a way as to support Dan Ariely's hypotheses? 

Elaboration on #2 above:

This is all from footnote 14: Dan Ariely emailed an excel file to Nina Mazar in which the effect was opposite from what they expected (see right). When she contacted him about it, he said, "when preparing the dataset for her he had changed the condition labels to be more descriptive and in that process had switched the meaning of the conditions, and that Nina should swap the labels back." That means that Dan Ariely had gone through and mislabeled every single value of column A (13,488 values). Of course with Excel there are ways to do it in bulk. He didn't literally go through and do it one at a time, but the point remains that he admits to changing every single value of the variable, and changing it so that it's incorrect. He manipulated the data and he failed to make a note of that in his statement. Perhaps he didn't remember. Perhaps there is a harmless explanation. For now, I still consider it to be a conflict.

Comments

  1. "Why on earth would an insurance company fabricate data in such a way as to support Dan Ariely's hypotheses?" Maybe it was to save time rather than to support his hypothesis. Sounds like it might have taken about an hour to fabricate the data, but it would have taken very much longer to collect, enter and check genuine data.

    ReplyDelete
    Replies
    1. Why would they care to? They can just tell him to bugger off and that they don't have the data.

      Why would they spend the effort to collect data for one of the variables and stop short for the other one?

      And even if they were lazy and didn't want to collect it, why would they fabricate the data only on an excel file instead of on their database?

      Delete
    2. If the insurance company was too lazy to collect the data and instead just generate it, it seems reasonable you would have got data that supports no result. To generate data that supports Dan's hypothesis u need to: 1. understand it. 2. generate the data in way that supports it. It is much easier to just generate data uniformly at random and this would give no result.

      Delete
    3. However, if the insurance company had done the manipulation and sent it over as .csv all fonts would still be similar. CSV files just don't share data on font-types, so excel will create (a single) font type on import.

      So, even if they'd meddled with the data, that would not explain the different fonts.

      Delete
  2. Great points and, IMO, you asked the million dollar question about motives (under "And a logic problem:"). I wonder if there is an insurance company in this case.

    ReplyDelete
    Replies
    1. Thanks! It's interesting because it is a question of motives but I feel like there is a bit more to it than that. There is also a question of awareness and competency. Like, how did you know what Ariely wanted and how did you know enough about it to give it to him? I think that awareness is a lesser hurdle to overcome than the motives question but it all adds up and it all points in one direction, evidence-wise.

      Delete
  3. Having an Excel carry your name while hosting someone else's data is as easy as a copy+paste from another sheet, which would be commonly done if you received data from someone else, and want to do statistical analyses on it. In addition, the researcher did not claim they didn't "touch it" - they said they were not involved with collection, entry or merging. Changing labels is far from any of that, and makes complete sense when preparing statistical tests, which are separate from the data.

    ReplyDelete
  4. Why when you received the data wouldn't you at least check its statistical properties? Wouldn't you expect this mileage data to be close to being normally distributed? So wouldn't you check with a histogram and some simple tests to see if the data are skewed or have fat tails? And when the histogram looks like a uniform distribution, and not at all like a normal, wouldn't you wonder why the mileage data exhibits such an unexpected shape? At that point wouldn't a competent researcher conclude that either something very odd is going on or the data are bogus?

    ReplyDelete

Post a Comment

Popular posts from this blog

About that metaphor priming paper...

RETRACTED ARTICLE: Why money meanings matter in decisions to donate time and money