when the British use the term sorting, they talk about fixing something. e.g. “I’ll sort your out.” Or “I’ll get it sorted”. If you are an American, when you hear the term Sorting, you probably have visions of cleaning your room. Or perhaps more specifically, trying to find that one Lego piece you need to complete your masterpiece.
When we talk about sorting in an analytic context we are talking about structuring large amounts of information to help with further analysis. Sorting in this context is a preparatory function.
This method is effective when the raw information can be broken down into definable broad categories. These broad categories can be further broken down into sub categories. If you have done work in database creation or administration, this is likely a familiar concept to you. And if your thinking that this method lends it’s self to use in a program such as excel or access you are correct.
When things are lined up “dress right dress” it is easy to see differences, or trends not easily noticed when the analyst is “zoomed in” on one piece of the puzzle.
The danger is of course improperly categorizing data. This will take a small problem and magnify it.
So lets look at an example.
Let me pause right here and say I’m proficient with excel I don’t see my self as a power user though. If you are reading this and have a better way of doing things I’m always interested in hearing about it.
I’ve stumbled on a site called Data.gov that has a crazy amount of information free to download. I’m not going to vouch for the reliability or credibility of this data, but for the purpose of this exercise it will work.
I found a dataset that contained crimes reported to the City of Chicago from 2001 to present. This satisfied the requirement for a large data subset. In fact the data is so large it can’t be opened in excel. (it contains more rows than is allowed in excel).
So the first task is to get the data into a manageable size to use with the tools I have available. Like anything else there are multiple ways to do this. I’m going to use the Data model method described here. I didn’t import it into a pivot table, just the actual worksheet.
So just looking at the sheet, we see several things that stick out. Most notably the columns that have data that looks to be corrupt. In rows 46, 76, 153 and others we see that the use of quotation marks screwed up the import from the CSV. Zooming out on the sheet you can see that each time this happens there is data in the x and y column. So lets sort the sheet to see if there are any other glitches like this.
Luckily excel makes this easy. Just select the whole sheet then click sort and filter then filter. Looking at the resulting drop down there are two instances where quotation marks are used. So lets select both of them.
The resulting sort shows that our hunch about the quotation marks was correct. Every time a quotation mark was used it messed up the import, so lets see if we can fix that. There may be an easier way, but I sorted column H again to just show the “School entries. Unfortunately that resulted in 24488, results. So I just replaced all of column H with Public school building, deleted the contents of column I and J and then moved the data in columns k through y over. I repeated this for the other set of data with quotes.
The data is now sorted, with consistent terminology in each category, or column.
Depending on the question you are attempting to answer, this may be as far as you need to go.
Over all it’s a simple concept. But as Old Carl said, everything in war is simple and yet difficult to do.
We will keep this dataset to use in future Analytic techniques.