That’s right, I’m still using DataTables. Why? Because they best fit the requirements for what I am doing. I’m also taking advantage of parallel processing via the TPL. A lot of the operations I need to do would be much easier via PLINQ and POCO’s such as joins, grouping and searching.

Because I’m working with many disparate DataTables and the join predicates are complex it just isn’t feasible (performance wise) to use DataRelation. To get a working prototype I decided to just use nested loops to find matching DataRows but I’ve found that nested loops when using the TPL results in poor performance and is generally a bad idea when working with large item sets anyway.

Normally I would build a HashTable and create a key from each row based on the columns in the predicate, then use that to match rows with a single loop. However, when using parallel loops, one has to take care to avoid concurrency/multi-threading issues that arise with non thread safe collections. Dictionary<K,V> is normally satisfactory when using them, but they are not thread safe. ConcurrentDictionary<K,V> would seem like an obvious candidate to replace the normal dictionary, but it buckles under the pressure. Memory consumption goes through the roof even for simple, small data processing and performance takes a serious hit.

Another issue with my row key by column approach is that the code has to be written each time and normally I use string concatenation for the keys. Instead, I wanted a nice way to build an index that I can easily query, but I also wanted a generic solution so any key could be used.

I went with the following method

public static Dictionary<TKey, List<DataRow>> BuildIndex<TKey>(this DataTable source, Func<DataRow, TKey> processRow)
{
    ConcurrentBag<Index<TKey, DataRow>> indexBag = new ConcurrentBag<Index<TKey, DataRow>>();

    Parallel.For(0, source.Rows.Count, new Action<int>((i) =>
        {
            indexBag.Add(new Index<TKey, DataRow>()
            {
                Key = processRow(source.Rows[i]),
                Row = source.Rows[i]
            });

        }));

    var result = (from i in indexBag
                    group i by i.Key into g
                    select g
                        ).ToDictionary(c => c.Key, d => d.Select(e=>e.Row).ToList() );

    return result;
}
private class Index<TKey, URow>
{
    public TKey Key { get; set; }
    public URow Row { get; set; }
}

This method  takes in a generic type TKey to define the index key and then requires a predicate to build the key from each row. The performance of this is excellent. The generic Index model is simply a nice way to organize the keys as things are being built. I chose to return a dictionary because I’m satisfied with the performance and functionality and since it’s only being used for reading and not writing, we don’t need a thread safe collection. I could have easily went with a IEnumerable<Index<TKey,URow>> but why bother?

Since this ends up as an extension method, using it is pretty simple.

private void Loop(DataTable table1, DataTable table2)
{
    //Build index for table2
    var idx = table2.BuildIndex<Tuple<int, long, int>>((dr) =>
    {
        return new Tuple<int, long, int>(
            dr.Field<int>("userid"),
            dr.Field<long>("orderid"),
            dr.Field<int>("domainid"));
    });

    //Loop over table1
    Parallel.For(0, table1.Rows.Count, new Action<int>((i) =>
    {
        var rowKey = new Tuple<int, long, int>(
            table1.Rows[i].Field<int>("userid"),
            table1.Rows[i].Field<long>("orderid"),
            table1.Rows[i].Field<int>("domainid"));

        if (idx.ContainsKey(rowKey))
        {
            //Exists
        }
        else
        {
            //Doesn't exist
        }
    }));
}

private void Join(DataTable table1, DataTable table2)
{
    //Build index for table1
    var idx1 = table1.BuildIndex<Tuple<int, long, int>>((dr) =>
    {
        return new Tuple<int, long, int>(
            dr.Field<int>("userid"),
            dr.Field<long>("orderid"),
            dr.Field<int>("domainid"));
    });

    //Build index for table2
    var idx2 = table2.BuildIndex<Tuple<int, long, int>>((dr) =>
    {
        return new Tuple<int, long, int>(
            dr.Field<int>("userid"),
            dr.Field<long>("orderid"),
            dr.Field<int>("domainid"));
    });

    //Do a join
    var joined = from i in idx1
                    join j in idx2 on i.Key equals j.Key
                    select new { Key = i.Key, t1Rows = i.Value, t2Rows = j.Value };

}

Tuples are great to use as keys because they’re immutable and have a smaller footprint than strings (generally speaking).

Conclusion: If I could use POCOs instead of DataTables I certainly would but for now I do what I can. I find the indexing method I’ve come up with a great way to work around performance and threading issues (even when only reading) of DataTables.

I’m interested in hearing your thoughts.

Advertisements