It has been a busy week for data cleaning. The data for one project was spread across several different files and formats. It takes a lot of time and patience to reorganize data and it is easy to get lost or spin your wheels at this stage.
To keep things on track I take a modular approach to extracting and aligning data. I divide each step (or small series of steps) into separate queries and then use them to build new queries. The advantage to this approach is that you both document the methodology (business rules, decisions and criteria) and you create opportunities to see how the data is changing with each business rule or criteria you apply. For example,
Table_Data contains all the data records
Query_1 selects a subgroup from Table_Data
Query_2 joins the query_1 subgroup to another data set or new business rule
This becomes very useful if you are challenged on the accuracy or relevance of the data table. Not only can you show how the data was handled but you can also show how the outcomes change as different criteria and business rules are applied. Any changes you make to the queries iterate through the rest and updates the data table as quickly as you can click on it.
When cleaning data I tend to work in SQL (Structured Query Language) . SQL is supported in many different software and database systems. For those who are familiar with SPSS, SQL is similar to the syntax view where you can view the underlying text that describes the actions and sequences that are applied to the data.
One of the things I needed to do with the this week’s data was to extract a small group of records from 4 different years of assessments (contained in 4 different files). To do this I broke the process down into two steps – extraction and aggregation. First, I created an extraction query for each of the data sets (4). To avoid confusion and accommodate the possibility of aggregating by year, I needed to add a column in each query that would identify the year of the assessment data. Adding a constant (something I hadn’t done before) is handled in a very straight forward way in the selection line:
SELECT location, group, 2006 AS YEAR, testscore_1, testscore_2
I wanted the constant to appear between the group field and the first testscore field. In this example the constant is the year 2006 and the field will be labelled “Year” in the table. A query was also created for years 2007, 2008 and 2009.
With the data from all four assessments contained in 4 queries the next step was to create a new table for the aggregated data. Using “UNION ALL” the records were appended into a single table:
SELECT * FROM qry_Year2006
SELECT * FROM qry_Year2007
SELECT * FROM qry_Year2008
SELECT * FROM qry_Year2009;
In this SQL segment, the * is a wildcard that means “Select all the columns”. It is important to keep in mind that this will only work if all the queries have the same number of columns. It also assumes that the format (column order, data types) are the same in each query.
While this is far from an efficient approach (I can imagine DBA’s and Computer Science students screaming in horror reading this) it saves a lot of time later when requests come in to modify the construction of the table. What it loses in elegance, it gains in documentation and facilitates reviews (and changes) of methodology.