Adam Tuttle

Entries Tagged as ColdFusion

Introducing Zoidbox

A couple of episodes of CFHour ago, Scott made a joke that Luis (of ColdBox, TestBox, ..., *Box fame) should name his next product ZoidBox -- it didn't matter what it did, he just wanted a *Box product named in his likeness (Scott goes by BoyZoid in many places)...

I jumped at the chance, and created created Zoidbox, the IRC bot that attempts to bring the knowledge and wit of @boyzoid to ##coldfusion on freenode. (It's an IRC bot that does helpful and funny things in the chat.)

Over the next week or two other people have gotten involved, and what once was a shoddy script with a mish-mash of messy code has become something —dare I say— beautiful.

We refactored the core down to just a few lines of code and setup a system for plugins to be easily added. It's hosted on a free-tier heroku cloud instance, and uses continuous deployment: Any time one of us merges a pull request or pushes to the master branch, zoidbox is re-deployed automatically.

It's funny how the things that start out as a joke can become interesting, complex, and yes, beautiful. I'm really proud of the code we've put together for Zoidbox so far; and the work will probably continue for the foreseeable future. Not only is it fun to have the bot in the chat, it's fun to work on him.

Just one more reason you should join us in the ##coldfusion IRC channel!

Published 2014-10-13 @ 09:21 in ColdFusion Community

REALLY?! CFSpreadsheet Throws Exception When Column Name Contains a Comma

I don't even know where to start with this one. So let's just look at some code, shall we?

<cfspreadsheet action="read" src="someFile.xlsx" headerrow="1" />

Pretty simple, innocuous code, right?

Now, in your spreadsheet, create a column name with at least 1 comma in it. In my case: "Committees, Chapters, & Networks Selected"

If you're getting data from non-technical people, you're likely to get stuff like this — and really, there's no good reason not to support this behavior.

What does CF do with this?

[Table (rows 3 columns Username, Password, <snip>, Committees, Chapters, & Networks Selected, Education, <snip> ] is not indexable by COMMITTEES

It's pretty obvious what's going on here. Someone had the bright idea to use a comma-delimited list of column names when converting the imported spreadsheet file into a Query object; and my single column name "Committees, Chapters, & Networks Selected" now appears to be 3 distinct columns, and it can't find the data for them.

I am simultaneously gobsmacked and awestruck.

Raised 3825042.

This also happens to remind me of a bug I filed back when CFSpreadsheet was in beta before the release of CF8 (785,511 bug reports ago), that the header row should not be included in the resulting data. At the time the answer I got from the engineering team was that it was "functioning as designed." But if there's anything I think we should have learned from (the late) Adam Cameron, it would be not to take crappy responses like that lying down. So I think it's time to put the spotlight back on that one.

I found it in the bug database (3039531) but it's marked as "withdrawn" as a duplicate of ... something. The bug number of the supposed original is provided, but it's not found. I'm going to try to draw attention to this existing bug instead of submitting a new one.

Update: Rupesh was kind enough to point out, on that older bug, that an attribute named "excludeHeaderRow" was added to resolve this second bug. I seem to recall hearing about this before, now that it's mentioned, but I obviously had long forgotten it. So at least that one's resolved.

Please vote accordingly, if you agree.

Published 2014-09-16 @ 08:20 in ColdFusion

HQL Keywords

I was recently bitten by the fact that apparently "Member" is a reserved keyword in Hibernate's HQL. I was further disgusted to find that, apparently, no canonical list of HQL keywords exists. All we have to go on is references to individual keywords in the documentation ("after the keyword...") and complaints of people that have accidentally run into them.

I'm not the first to run into this issue and hope to raise awareness by blogging about it, but I have been known to read that blog from time to time and this still bit me, so I figured it was worth reverberating it a bit.

As far as I can find, all of the words on this list are reserved keywords, either as reported by someone that ran into an issue using one as a table/entity name, or by references to them as keywords in the documentation.

  • Member
  • Event
  • Class
  • State
  • Extends ("when using the extends keyword")
  • From ("the FROM keyword is optional")
  • Versioned ("adding the VERSIONED keyword")
  • Update ("after the UPDATE keyword")
  • As ("the AS keyword is optional")
  • With ("the hql WITH keyword")
  • Join ("the JOIN keyword")
  • Distinct ("the DISTINCT keyword")
  • All ("the ALL keyword")
  • True
  • False ("the keywords TRUE and FALSE")
  • Connect ("the CONNECT keyword")

Presumably, additional SQL-ish words like SELECT, WHERE, GROUP, ORDER, LIKE, etc are also reserved (or at least troublesome). I would avoid any of the above for Entity/Table names, if I were you.

In my case, I had a table and entity named Member, which is a pretty logical thing to do when you're modeling an organization with Members.

Since my system has been running fine with an entity named Member for a while now, and it would be a real pain to go back and change everything, I tried to find the easiest possible solution. Here's what I changed:

  • Renamed Member.cfc (entity) to xMember.cfc
  • Updated entity relationships from cfc="Member" to cfc="xMember"
  • Updated any HQL references from from Memberto from xMember
  • Updated any EntityLoad("Member") references to EntityLoad("xMember")

So far, so good.

Published 2014-09-08 @ 08:30 in ColdFusion Hibernate

In Search of Performance

One of the more fun tasks in the new product I'm building has been a method to find the intersection (or in SQL terms, Inner Join) of an arbitrary set of arrays.

While this task would be well suited for SQL, some of our requirements prohibit that from being an option. Not all of the data will come from a database; some will come from web service calls. So I need a method that returns an array that contains only the elements that were present in all of the input arrays, and satisfies this interface:

array function intersection( array arrays ){}

Whatever the solution is, it needs to be fast because each of these input arrays could theoretically contain several hundred thousands of elements. The only other requirement was that it needs to run on Adobe ColdFusion 10 (as opposed to 11, which I will come back to in due time).

Counting

As you can imagine, there are dozens of ways to skin this cat. The first that came to mind was to count the instances of each element and then build a new array of only the elements whose count matched the number of input arrays:

array function arrayInnerJoinByCounting( array arrays ){
    var ht = {};
    var reqCount = arrayLen( arrays );
    var result = [];
    //count instances of each array value
    for (var arr in arrays){
        //if any array is empty, the result will be an empty array
        //so just short-circuit the whole process
        if (arrayLen(arr) == 0){
            return [];
        }
        for (var i in arr){
            if (!structKeyExists(ht,i)){
                ht[i] = 1;
            }else{
                ht[i]++;
            }
        }
    }
    //make a new array of values that appear the minimum number of times
    for (var k in ht){
        if (ht[k] == reqCount){
            arrayAppend(result, k);
        }
    }
    return result;
}

This method works, but as you probably expect, is not the most performant option. Counting is easy to understand, but turns out to be pretty unperformant inperformant imperformant slow.

Pairing Off

My next thought was to iterate over the set of arrays, reducing them a pair at a time. If I start with the shortest array, it should help reduce memory usage and the number of insertion operations on my hashtable. Note that I'm not sorting the actual arrays, I'm just arranging them in order of shortest array to longest array.

array function arrayInnerJoinByCFHashtableSorted( array arrays ){
    arraySort(arrays, function(a, b){ return (arrayLen(a) < arraylen(b) ? -1 : 1); });

    var left = arrays[1];
    var right = 0;
    for (var cursor = 2; cursor <= arrayLen(arrays); cursor++){
        right = arrays[cursor];
        left = arrayIntersectionByCFHashtable( left, right );
    }
    return left;
}
array function arrayIntersectionByCFHashtable( array left, array right ){
    var ht = {};
    var result = [];
    var smaller = larger = [];
    if (arrayLen(left) > arrayLen(right)){
        smaller = right;
        larger = left;
    }else{
        smaller = left;
        larger = right;
    }

    for (var i in smaller){
        ht[i] = 1;
    }
    for (var i in larger){
        if ( structKeyExists( ht, i ) ){
            arrayAppend( result, i );
        }
    }
    return result;
}

I also put together a benchmarking harness to measure the performance of these methods with different array sizes. I wanted something semi-realistic in terms of amount of items in the arrays and their expected overlap, and I also made sure to run each method 10 times and average their scores, to help smooth out any differences due to other system load. I'll share the benchmarking harness at the end of the post, but here are the results of comparing these two methods with ~200k items. The values are all times in milliseconds. The arrays are the times for each iteration, and the root named keys are their averages.

first benchmark results

In order to save a little bit of space, I'm going to suppress the details of the benchmark iterations and only show the averages, but the benchmark methodology will be the same: average of 10 iterations.

So this new method was faster than my original counting approach by a small but consistent margin — but I wondered if it could be beaten. I also wondered if the sorting was costing more than its benefit, so I made a copy that did the same thing with unsorted arrays:

Pairing Off (Unsorted)

array function arrayInnerJoinByCFHashtable( array arrays ){
    var left = arrays[1];
    var right = 0;
    for (var cursor = 2; cursor <= arrayLen(arrays); cursor++){
        right = arrays[cursor];
        left = arrayIntersectionByCFHashtable( left, right );
    }
    return left;
}

Is sorting worth it? Not in this case!

At least with these inputs, sorting did in fact cost ever so slightly more than it was worth. However, when cranked up to ~400k items in each array, the tables turned ever so slightly in favor of sorting. I'm calling the difference here statistically insignificant.

Java RetainAll

Then my coworker Steve turned me on to Java's retainAll() method, and I figured that was worth a shot. The code was certainly simpler! If you're not familiar, retainAll will remove items from the original array that don't appear in the argument.

array function arrayInnerJoinByJavaRetainAll( array arrays ){
    var shortestIx = 1;
    var minLength = arrayLen(arrays[1]);
    for ( var i=2; i <= arrayLen( arrays ); i++ ){
        var length = arrayLen( arrays[i] );
        if ( length < minLength ){
            minLength = length;
            shortestIx = i;
        }
    }
    var shortest = arrays[ shortestIx ];
    for (var i=1; i<=arrayLen(arrays); i++){
        if ( i != shortestIx ){
            //java voodoo
            shortest.retainAll( arrays[i] );
        }
    }
    return shortest;
}

With this method, it made sense to start with the shortest array, but a full sort was not necessary. To my surprise, retainAll() was extremely slow! I figured that being closer to the metal would give it an unfair advantage, but that doesn't seem to be the case. I would guess that it is designed with smaller datasets in mind.

Java does not have an unfair advantage here

In fact, I had to turn down the number of items in the array to 10k instead of the usual 200k just to get this benchmark to run. As you can see, retainAll() gets blown out of the water.

Java HashSet

By this time, I was talking about the fun I was having in IRC, too, and was getting a few more suggestions there. One particularly good one came from Ryan Guill: A fork of my pairing-off technique but using a Java Hashset class instead of a CF structure. If I understand it correctly, it works very similarly to my CF structure, except that there's no "value" associated with the key. Either the key is in the hash table or it is not. (Please do correct me if this understanding is incorrect!)

array function arrayInnerJoinByJavaHashset( array arrays ){
    var shortestIx = 1;
    var minLength = arrayLen(arrays[1]);
    for ( var i=2; i <= arrayLen( arrays ); i++ ){
        var length = arrayLen( arrays[i] );
        if ( length < minLength ){
            minLength = length;
            shortestIx = i;
        }
    }
    var shortest = arrays[ shortestIx ];
    for (var i=1; i<=arrayLen( arrays ); i++){
        if ( i != shortestIx ){
            shortest = arrayIntersectionByJavaHashset( shortest, arrays[ i ] );
        }
    }
    return shortest;
}
array function arrayIntersectionByJavaHashset( required array shorter, required array longer ){
    var hs = createObject("java", "java.util.HashSet").init( arrayLen(shorter) );
    for (var i in shorter){
        hs.add(i);
    }
    var result = [];
    for (var i in longer){
        if (hs.contains(i)){
            arrayAppend( result, i );
        }
    }
    return result;
}

And the results:

Java Hashset results

As it turns out, the Hashset is significantly faster for these inputs. (Who knows how performance will change with more/less and larger/smaller arrays?)

So far, this has been the best approach I've been able to find. (Thanks, Ryan!)

Use SQL Anyway

Another approach that I thought of, and that was suggested by others as well, was to inject the data into a temp table in my database and let SQL do the heavy lifting in the joining department. It seemed like a decent enough idea, so I took a stab at creating a file that I could use with MySQL's LOAD DATA LOCAL directive, which I've had success with in other tasks; because running almost a million insert statements before doing the join would have no chance of winning. (If I were using MSSQL, I would consider something like fnSplit)

I tried using a StringBuilder to put together one string per array, and append each of them to a file, but CF couldn't even get that done for reasonably small sets of data without choking and eventually timing out. For that reason I never finished this approach, but I thought it was worth sharing how far I got so that people can point out any flaws they might see.

array function arrayInnerJoinBySQL( array arrays ){
    //didn't get any further than this because CF has been choking either on the string
    //building or the file writing. Either way, it has not been a successful avenue thus far.
    var namespace = CreateUUID();
    var filePath = sanitizePath(getTempDirectory()) & '/iq-arrayjoin-data-load-#namespace#.txt';
    if ( fileExists( filePath ) ){
        fileDelete( filePath );
    }
    fileWrite( filePath, "" );
    var fileObj = fileRead( filePath, "append" );
    for (var a = 1; a <= arrayLen( arrays ); a++){
        fileWrite( fileObj, arrayToString( arrays[a], a ) );
    }
    fileClose( fileObj );
}
string function arrayToString( array data, numeric ix ){
    var sb = createObject("java", "java.lang.StringBuilder").init("");
    for(var i in data){
        sb.append(ix);
        sb.append(',');
        sb.append(i);
        sb.append(chr(10));
    }
    return sb.toString();
}
function sanitizePath( path ){
    path = replace( path, "\", "/", "ALL" );
    if ( right( path, 1 ) == '/' ){
        path = left( path, len( path ) - 1 );
    }
    return path;
}

Honorable Mention: CF 11 Closure Functions

This approach was suggested by Adam Cameron and looks pretty slick, if for no other reason than its brevity and use of Closure. Unfortunately most of these functions are only available in CF11, even ignoring the member-functions aspect; for example, there is no arrayReduce() method in CF10. But I thought it worth mentioning here, for anyone that might come along later and not have the same constraint as I do. (Bear in mind, you should benchmark it against other method(s) before going ahead with it!)

array function arrayInnerJoinByCFClosures( array arrays ){
    var result = arrays.reduce( function( reduction, current, index ){
        if ( index == 1 ) return reduction;
        return current.filter( function( element ){
            return reduction.contains( element );
        });
    }, arrays[1]);
    return result;
}

Benchmarking Harness

I promised I would share my code for getting these benchmarks, so here you go. It's a little bit of a mess as I was changing things around a good bit in order to get the right screen shots for this post. You will, of course, just have to uncomment various blocks and copy in the functions from above in order to run it.

<cfsetting requesttimeout="600" />
<cfscript>

    variables.size = 200000;
    variables.samples = 10;
    variables.data = [ [], [], [], [] ];

    /* GENERATE TEST DATA */
    for ( i = 1; i <= size; i++ ){
        arrayAppend(variables.data[1], i);
    }
    for ( i = 1; i <= size*2; i = i+2 ){
        arrayAppend(variables.data[2], i);
    }
    for ( i = 1; i <= size*2; i = i+3 ){
        arrayAppend(variables.data[3], i);
    }
    for ( i = 1; i <= size*4; i = i+4 ){
        arrayAppend(variables.data[4], i);
    }
    /* END DATA GENERATION */

    // for ( i = 1; i <= samples; i++){
    //  start("counting");
    //  arrayInnerJoinByCounting( data );
    //  end("counting");
    // }

    // for ( i = 1; i <= samples; i++ ){
    //  start("Java Hashset");
    //  arrayInnerJoinByJavaHashset( data );
    //  end("Java Hashset");
    // }

    // for ( i = 1; i <= samples; i++ ){
    //  start("retainAll");
    //  arrayInnerJoinByJavaRetainAll( data );
    //  end("retainAll");
    // }

    // for ( i = 1; i <= samples; i++){
    //  start("cf hashtable unsorted");
    //  arrayInnerJoinByCFHashtable( data );
    //  end("cf hashtable unsorted");
    // }

    // for ( i = 1; i <= samples; i++ ){
    //  start("cf hashtable sorted");
    //  arrayInnerJoinByCFHashtableSorted( data );
    //  end("cf hashtable sorted");
    // }

    //===============================================

    function start( string metricName ){
        param name="request.metrics" default={ _detail: {}, _instances: {} };
        request.metrics._detail[ metricName ] = getTickCount();
    }

    function end( string metricName ){
        var tick = getTickCount();
        if ( !structKeyExists( request.metrics._detail, metricName ) ){
            throw(message="Metric named `#arguments.metricName#` hasn't been started.");
        }
        if (!structKeyExists(request.metrics._instances, metricName)){ request.metrics._instances[ metricName ] = []; }
        arrayAppend(request.metrics._instances[ metricName ], (tick - request.metrics._detail[ metricName ]) );
    }

    function computeAverages(){
        structDelete( request.metrics, "_detail" );
        for (var k in request.metrics._instances){
            var total = 0;
            var count = 1;
            for (count; count <= arrayLen(request.metrics._instances[k]); count++){
                total += request.metrics._instances[k][count];
            }
            request.metrics[k] = round( total / count );
        }
        structDelete( request.metrics, "_instances" );
    }

</cfscript>

<cfset computeAverages() />
<cfdump var="#request.metrics#" label="metrics" />

Conclusion

For now, I'll be using the Hashset-based algorithm. If you think you can do better, by all means, post proof!

Published 2014-09-03 @ 01:00 in ColdFusion