Inner Join vs. Outer Join, and Why My Brain Sucks
This afternoon I have been having a bit of a group discussion on the perils of INNER JOINs in SQL. It started with a tweet that Inner Join's are dangerous and you should use them with caution. A bunch of people asked me what exactly I meant, and of course it's difficult to get that across in chunks of 140 characters or less, which brings us to the present and this blog post.
Consider that you have a relational database setup to store survey responses. Each respondent creates 1 ResponseSet, which gets a ResponseSetId, and then from there you can build up responses to the various questions. There are N questions. Each question can have M responses, all of them free-form text. There are probably a hundred ways to lay out the database to support this, but in the case of the database that inspired the tweet, I've got a Response table, with a foreign key to the ResponseSet table, which has a foreign key to the Survey table. A response belongs to a response set, which belongs to a survey.
In your mind, write some SQL to join these tables and get back the questions and answers for a given ResponseSet for a report (say, ResponseSetId 473240432, if that helps you think). Here is the first thing that pops into my head:
SELECT QuestionId , ResponseText FROM tblResponseSet rs INNER JOIN tblResponse r ON rs.ResponseSetId = r.ResponseSetId INNER JOIN tblQuestion q ON q.QuestionId = r.QuestionId WHERE rs.ResponseSetId = @responseSetId GROUP BY q.QuestionId ORDER BY q.QuestionSort
That's great, right? Unfortunately, no. At least, not in my case. Even assuming that all questions are mandatory (1+ responses required), this does not account for other buggy software I've written (we all do it!) that saves, copies, or modifies the data that may result in blank responses or no responses for a given question. Rather to the point, what I would rather see in the case of this poorly-maintained theoretical data (as modified by my buggy code) is the rows I expect, but with NULL values for missing responses.
In the event of the survey above, it would be more helpful to the developer trying to debug the application to know that NULLs are coming back as survey responses, than it would be to know that the question isn't being included in the result set. The join for the question data is fine — at least in as much as it isn't the cause of this theoretical row being dropped — it's the join on responses that's causing the problem. In effect, by eliminating the row altogether, we're creating a false-negative, or the appearance of a bug that doesn't exist. (The question table should be joined via outer join as well.)
Using this method of querying the data, each row in the result set will give the question and 1 of its responses. The question will repeat with each different response, making it easy to use the group attribute of ColdFusion's cfoutput tag to display each question with the various responses to it for the given Response Set. However, with the above query, if there are no responses for ResponseSetId 473240432 and QuestionId 42, then QuestionId 42 is dropped from the result set. That's a Bad Thing (TM)!
In essence, plan for the worst but hope for the best. Whenever I write a join condition, instead of asking myself "How is this data relationship supposed to work?" I should be asking myself "What do I want to happen if some data turns up missing?" By writing an inner join, I'm answering the second question with "drop the row". The answer I wanted to provide for that question was to "show nulls", which is written in SQL with an OUTER JOIN.
That is the inherant danger of INNER JOINs of which I was speaking.
Posted in Best Practices | Databases | 4 Responses
I do certainly agree it is a great way to test that table constraints and other data validation is working. I do think it is probably a matter of preference more than anything as I just hate null values in any table but others don't mind and there is really no right answer.
Thanks for the article!
But yeah, Adam, I see how we shoot ourselves in the foot with the habitual use of inner join when a well-placed outer join would be the better choice. Thanks for the reminder.