fusiongrokker

Entries Tagged as Databases

Inner Join vs. Outer Join, and Why My Brain Sucks

This afternoon I have been having a bit of a group discussion on the perils of INNER JOINs in SQL. It started with a tweet that Inner Join's are dangerous and you should use them with caution. A bunch of people asked me what exactly I meant, and of course it's difficult to get that across in chunks of 140 characters or less, which brings us to the present and this blog post.

Consider that you have a relational database setup to store survey responses. Each respondent creates 1 ResponseSet, which gets a ResponseSetId, and then from there you can build up responses to the various questions. There are N questions. Each question can have M responses, all of them free-form text. There are probably a hundred ways to lay out the database to support this, but in the case of the database that inspired the tweet, I've got a Response table, with a foreign key to the ResponseSet table, which has a foreign key to the Survey table. A response belongs to a response set, which belongs to a survey.

In your mind, write some SQL to join these tables and get back the questions and answers for a given ResponseSet for a report (say, ResponseSetId 473240432, if that helps you think). Here is the first thing that pops into my head:

SELECT 
    QuestionId
    , ResponseText
FROM
    tblResponseSet rs
    INNER JOIN tblResponse r ON rs.ResponseSetId = r.ResponseSetId
    INNER JOIN tblQuestion q ON q.QuestionId = r.QuestionId
WHERE
    rs.ResponseSetId = @responseSetId
GROUP BY
    q.QuestionId
ORDER BY
    q.QuestionSort

That's great, right? Unfortunately, no. At least, not in my case. Even assuming that all questions are mandatory (1+ responses required), this does not account for other buggy software I've written (we all do it!) that saves, copies, or modifies the data that may result in blank responses or no responses for a given question. Rather to the point, what I would rather see in the case of this poorly-maintained theoretical data (as modified by my buggy code) is the rows I expect, but with NULL values for missing responses.

In the event of the survey above, it would be more helpful to the developer trying to debug the application to know that NULLs are coming back as survey responses, than it would be to know that the question isn't being included in the result set. The join for the question data is fine — at least in as much as it isn't the cause of this theoretical row being dropped — it's the join on responses that's causing the problem. In effect, by eliminating the row altogether, we're creating a false-negative, or the appearance of a bug that doesn't exist. (The question table should be joined via outer join as well.)

Using this method of querying the data, each row in the result set will give the question and 1 of its responses. The question will repeat with each different response, making it easy to use the group attribute of ColdFusion's cfoutput tag to display each question with the various responses to it for the given Response Set. However, with the above query, if there are no responses for ResponseSetId 473240432 and QuestionId 42, then QuestionId 42 is dropped from the result set. That's a Bad Thing (TM)!

In essence, plan for the worst but hope for the best. Whenever I write a join condition, instead of asking myself "How is this data relationship supposed to work?" I should be asking myself "What do I want to happen if some data turns up missing?" By writing an inner join, I'm answering the second question with "drop the row". The answer I wanted to provide for that question was to "show nulls", which is written in SQL with an OUTER JOIN.

That is the inherant danger of INNER JOINs of which I was speaking.

Posted in Best Practices | Databases | 4 Responses February 02 2010

Modifying Illudium to return the ID of an inserted row in CF8

I've been working on a project where we're using Illudium to generate some objects (beans, dao's, and gateways) to speed up our development process. We also happen to be using ColdFusion 8, which means we have access to great new features, like the primary key return value from an insert query.

During a presentation on how to use Illudium and how it saves us time and effort, my team saw that I was: (1) Inserting a record, then (2) Creating a bean with all of the properties of that record except its missing id value, and then (3) searching for the matching record — so that I could use its ID in related code; and asked if there was any way to improve that. That's when it dawned on me that CF8 returns the primary key value (in most cases — Oracle just returns the "RowID" which can be used to look up the primary key), for just this reason. By the end of the day I had modified the XSLT template for the Create function in the DAO object that's packaged with Illudium to return this value instead of true or false indication of success — and with no prior knowledge or experience with XSLT.

It was simple to extend my code so that it wouldn't error on earlier versions of ColdFusion (instead of the primary key value, it will return 1/-1 for success/failure), and also to account for other database types. All types documented by Adobe are supported here.

Here's the updated code for create.xsl. Note that the majority of what I changed is at the end:

[viewcode] src="create.xsl.txt" showsyntax=no geshi=xml scroll="yes" scrollheight="300px" link="yes" [/viewcode]

And here's an example create function I made from the ART table of the CFArtGallery example datasource:

[viewcode] src="fn_create.txt" showsyntax=no geshi=cfm scroll="yes" scrollheight="300px" [/viewcode]

Brian Rinaldi, the creator of Illudium, offered to help me out if I got stuck, but honestly he's done such a good job architecting Illudium to be easily extended that it was a piece of cake. Thanks again for the awesome code generator, Brian!

Posted in ColdFusion | ColdFusion 8 | Databases April 28 2008

Free license for Aqua Data Studio (OS X) for O.S. Developers

Since switching to my MacBook last December I have had but one regret: The lack of MS SQL Server Enterprise Manager. I quickly found Aqua Data Studio, but as a hobbyist I can't justify the cost of a personal license, and as one of few Apple nerds at my company, I couldn't convince them to buy the license for me. I have been able to get by using phpMyAdmin (gasp! php!) for MySQL databases, and a Windows laptop when I absolutely must, for MS SQL Server.

Today I saw Ray Camden's post on how Aqua Fold gave him a free license because he develops Open Source software, and I fired off an email of my own. Within the hour (I bet this is also reflected in their tech support), not only did they reply asking for my personal information to fill out the license registration, but they offered me the beta of version 6.5 of ADS (which should be a public-beta in a few weeks, and launched around September).

There has been some recent discussion in the CFBloggers discussion group about advertising in your blog, and in particular, giving the impression that you are presenting an objective view, even when the possibility exists that your opinion has been biased (as it might if you were given a free license). So to keep this as objective as possible, here are some cold, hard, facts:

Draw from that whatever you want. For me, it means more productivity, and one new icon in my dock.

Posted in Apple | Databases July 30 2007