Making EQAO data easyR to work with

Academic data, just like every other data set, usually consumes more time with cleaning and reshaping than analyzing and visualizing. One of the appeals of R is the ability to re-use code and it is in that spirit that I’ve written the following function – to make my life (and hopefully the lives of a few other education researcher s) a little easier with basic re-coding tasks.

IEPs are a common category for grouping records and in EQAO records they reside in separate columns.  The following function works with any dataframe that contains all of the SIF columns (works with both Primary and Junior records): IEP.EQAO(dataframe)

The dataframe is returned with a new column that identifies, in plain language, the IEP that was assigned to each record.

IEP.EQAO <- function(x){
 x$IEP <- paste0(x$SIF_IEP, 
x$IEP <- ifelse(x$IEP == "0000000000000", "No IEP",
          ifelse(x$IEP == "1000000000000", "IEP no IPRC",
           ifelse(x$IEP == "1100000000000", "Behaviour",
            ifelse(x$IEP == "1010000000000", "Autism",
             ifelse(x$IEP == "1001000000000", "Deaf",
              ifelse(x$IEP == "1000100000000", "Language",
               ifelse(x$IEP == "1000010000000", "Speech",
                ifelse(x$IEP == "1000001000000","Learning",
                 ifelse(x$IEP == "1000000100000","Giftedness",
                  ifelse(x$IEP == "1000000010000","MildIntellectual",
                   ifelse(x$IEP == "1000000001000","Developmental",
                    ifelse(x$IEP == "1000000000100","Physical",
                     ifelse(x$IEP == "1000000000010","Blind",
                      ifelse(x$IEP == "1000000000001","Multiple","BadCode")

The code is also available on Github here and is the beginning of what I hope will collaboratively evolve into an EQAO Package.

Future development will include re-coding for Secondary data files. Any comments or interest in collaboration are always welcome.

*Update May 29: Code has been modified to work with any ISD file (3,6,9,10) going back to 2011.

Posted in R | Tagged , | 2 Comments

Find out….and teach it.

For any group interested in engaging in “Results Based Accountability” one of the first steps is to establish a common language.  The development of a common language is not to benefit people “on the inside”.  Instead, it is the first act of accountability that makes the work transparent to everyone “on the outside”.  No acronyms, no complex terms, no jargon.  Language that is easily understood by anyone you may be chatting with.

In a speech to the British Institute of Management in 1977, Kingman Brewster Jr (an educator, president of Yale University, and American diplomat) commented that “Incomprehensible jargon is the hallmark of a profession.”  There is no doubt that education is a profession but it left me wondering, what would “Edu speak” look like if it was re-written into straightforward, common, language?

  • Differentiated Learning: Find out what each student doesn’t know and teach it to them.
  • Gap Closing: Find out what groups of students don’t know (but would be expected to know) and teach it to them.
  • InterventionsTeaching students things they do not know (but need to know).
  • Inquiry based learning: Find out what a student(s) doesn’t know/wants to know and explore the answer alongside them.
  • Diagnostic Assessment: Find out what a student doesn’t know.  Use this information to know what to teach them.
  • Formative Assessment: Find out what a student still doesn’t know.  Use this information to know what to teach them.
  • Summative Assessment: Find out what a student knows.  Give this information to the next teacher so they know what to teach them.

The pattern is easy to see with these examples.  Have I made it too simple? There is the action of “finding out” and the act of arriving at new knowledge by “teaching them” (which includes exploration, inquiry etc.).

After years of academic study, practical experience and in-services, educators are quickly drawn to the second action of “teaching them”.  Hours are devoted to long range plans and lesson plans.  But who is better off if those plans do not relate to an actual student need? It is the “finding out” (often called assessment or evaluation) that takes time and, more importantly, determines what action is going to be most meaningful.

Lewis Carroll is often attributed with the statement “If you do not know where you are going, then any road will get you there.”  However, the exchange this misquotation is based on is more interesting:

“Would you tell me, please, which way I ought to go from here?”
“That depends a good deal on where you want to get to,” said the Cat.
“I don’t much care where–” said Alice.
“Then it doesn’t matter which way you go,” said the Cat.
“–so long as I get SOMEWHERE,” Alice added as an explanation.
“Oh, you’re sure to do that,” said the Cat, “if you only walk long enough.”

Teaching will get you somewhere.  There will be lots of hours in class, lots of plans written, lots of expectations of students and you will always arrive at a new year with a new class.  But what is needed by the students in the class right now? If you hear the statement “You should know this by now” you can be certain that the person saying it recognizes a need/a learning gap.  The only question left is whether action will follow.

Knowledge and action.  You shouldn’t have one without the other.

Knowledge without action is trivia.
Action without knowledge is busywork.

At the end of the day, it should be easy to see, easy to understand, and easy to explain in common language.


Posted in Assessment Literacy | Tagged , , | Leave a comment

Highlights from OERS16

The 11th annual Ontario Education Research Symposium was held from February 9th to February 11th with the theme “Networking & Partnerships: The Core of Achieving Excellence in Education. Over 500 people from networks, organizations and stakeholders across the Education sector participated in the event which featured:

  • 27 speakers,
  • 18 workshops,
  • 6 Mobilizing sessions,
  • 4 Provocative Speaker sessions and
  • 1 Fireside chat
  • 1 Spectacular student jazz band
  • Students as symposium attendees

Throughout the conference, participants were active on Twitter using #OERS16. As with previous years, I have compiled the tweets over the course of the Symposium using ‘s  TAGS 6.0 utility (click here to access the tweet archive).

In an attempt to make the tweets more useful and accessible, this year I used R to extract links that were shared (click here for more information on the process) and then created a series of pdf resources that compile the shared links:

Tweeting Trivia

At the time of this summary there were 2,031 tweets from 298 different people. Using the interactive viewer with TAGS 6.0, we can see what this kind of networking looks like:


As you can see, the majority of tweets are isolated with only a few key people connecting and interacting through Twitter (largest names with the most connecting lines), though that isn’t to say that twitter hasn’t facilitated face-to-face interactions.

Over the course of the first day of the conference, I took a series of snapshots of a sentiment analysis (using an online utility “Sentiment Viz: Tweet Sentiment Visualization” developed at NC State University).  I created an animated gif to see how sentiment changed at five points in the day (morning, morning break, lunch, afternoon break, evening):

OERS animated sentiment

The left half of the oval (blue dots) represent tweets with “unpleasant” terms and the right half of the oval (green dots) represent tweets with “pleasant” terms.  Dots closer to the top of the oval represent tweets with “active” terms and dots closer to the bottom of the oval represent tweets with “subdued” terms.

Throughout the entire day, the tweets were very positive and those that are more negative on the sentiment analysis were tweets relaying the challenges that many of the speakers were addressing through their work.  (Note: in the interactive version you can highlight a dot and see the underlying tweet with the terms highlighted that have been coded as part of the sentiment analysis)

Top Tweeters

This year, the top 10 tweeters from #OERS16 were:

  • @DrKatinaPollock     (182)
  • @CarolCampbell4      (157)
  • @avanbarn                   (151)
  • @ResearchChat          (109)
  • @naturallycaren         (101)
  • @Jan__Murphy          (98)
  • @GregRousell               (75)
  • @KNAER_RECRAE   (66)
  • @OISENews                 (50)
  • @HeidiSiwak                (41)

However, of those 2,031 tweets, 47% were retweets (tweets that begin with RT) leaving 1,102 original tweets. Considered from the perspective of original tweets vs. retweets, the top tweeters begin to look very different:


This adjustment highlights two different but important approaches to the use of social media.  On the one hand, @avanbarn’s generation of so much “original content” is an example of using social media for note-taking (paraphrasing presenters, highlighting speaking points, sharing links to referenced material, sharing reflections and questions inspired by a presenter).  On the other hand, @DrKatinaPollock and @CarolCampbell4’s high level of retweets are examples of cross-network dissemination. As these two approaches work in tandem, the key messages of the symposium presenters reach far beyond the room of attendees and broadens opportunities for discussion and additional inquiry.

Adjusted for the percentage of original tweets, the top ten tweeters now becomes:

  • @avanbarn                   147 (97% )
  • @ResearchChat           106 (97%)
  • @GregRousell                63 (84%)
  • @KNAER_RECRAE     48 (73%)
  • @Jan__Murphy            65 (66%)
  • @HeidiSiwak                  25 (61%)
  • @naturallycaren            61 (60%)
  • @DrKatinaPollock        65 (36%)
  • @CarolCampbell4        54 (34%)
  • @OISENews                    11 (22%)

@GregRousell has also been archiving tweets from #OERS16 using the R package twitteR. In an upcoming post on the Data User Group, we will be sharing a detailed overview for each of our approaches and share the benefits and challenges of each approach.

Posted in Data Visualization, Twitter | Tagged , | Leave a comment

R as GIS: Working Out Projections

In addition to the convenience that R offers for data cleaning, analysis and automating reporting, it also has the capacity to complete a variety of mapping (GIS) tasks.  Following are a few R snippets to help get started using the example of plotting schools (as a point file) within their catchment areas (boundaries described by polygons).

Shapefiles: Polygons

The package maptools has a couple of useful functions that will load ESRI shape files into R.  The first is readShapePoly() which, like read.csv, loads polygon .shp files into R as an object.  The second is proj4string() which defines the projection of the shape file:

#load the library


#load the School_Boundary.shp file into School.bndry object

School.bndry <- readShapePoly("School_Boundary") #loads School_Boundary.shp

#attach the projection used by the School_Boundary.shp file

proj4string(School.bndry) <- "+proj=longlat +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +no_defs"

This last step, attaching the projection, requires you to:

  1. Know the projection that was used in the creation of the shape file
  2. Have this projection in proj4 format

If your .shp file also has a.prj file, you can use QGIS to get the proj4text by:

  • Opening the shapefile in QGIS
  • Right clicking on the boundary file and selecting properties
  • Clicking on “Metadata”
  • Scrolling to the bottom of the window you will find the proj4 text.  Copy and paste it into R

Working without projection information

If your .shp file does NOT have a .prj file things are a little more challenging.  Here are a few suggestions:

  1. If you have another file created by the same organization, check to see what projection it uses.  Organizations tend to be consistent in their file creation and will likely use the same projections from project-to-project and data product-to-data product.
  2. Go to the following website, zoom in to your location on the map and click on “PROJ4”:
  3. If you are in Southern Ontario, most of the files I have come across work with UTM NAD 83 zone 17 which is the following in proj4:
    +proj=utm +zone=17 +ellps=GRS80 +datum=NAD83 +units=m +no_defs
  4. Brute Force: Open a shape file in QGIS that has a projection that is defined.  Open the “mystery” shape file and change it’s projection “on the fly” until you find one that lines up correctly with your first file.  Whichever one lines up is likely the projection you need to use.

Shapefiles: Points

The point file works in a similar manner.  This time the shape file is loaded using the maptools function readShapePoints()

School.point <- readShapePoints("School_points")   #loads School_points.shp

proj5string() is used again to define the projections:

proj4string(School.point) <- "+proj=utm +zone=17 +ellps=GRS80 +datum=NAD83 +units=m +no_defs "

Dealing with different projections

If the polygon and point file had the same projection I would be ready to prepare for creating a map.  However, the two files have different projections and need to be transformed to have the same projection (whichever one I choose).  In this instance, I’ll create two new objects that transform the files to NAD83 (both are done here for illustrative purposes but the boundary file is already in NAD83):

School.bndry.83 = spTransform(School.bndry,CRS("+init=epsg:26917"))

School.point.83 = spTransform(,CRS("+init=epsg:26917"))

Plotting the Maps with ggplot2

If you are familiar with the ggplot2 package you will be pleased to know that in addition to plotting histograms, scatterplots etc. it can also plot maps with all the same features. However, to take advantage of ggplot, there is one additional step required with the point file. It needs to be converted to a dataframe:

School.point.83.df <- #Plotting a Map with ggplot2

With the shape files loaded, projections defined and the point file available as a dataframe the data is now ready to be plotted with ggplot:

ggplot(School.bndry,83) +            #Use the school boundary data 
     aes(long,lat, group=group) + 
     geom_polygon() +                #to draw the polygons 
     geom_path(color="white") +      #make the border of the polygons white 
     geom_point(, #add the points to the map 
                aes(X, Y, group=NULL, fill=NULL), 
                alpha=I(8/10)) #make each point blue with some transparency 

Full Code:


#load the library


#load the School_Boundary.shp file into School.bndry object
School.bndry <- readShapePoly("School_Boundary")

#attach the projection used by the School_Boundary.shp file
proj4string(School.bndry) <- "+proj=longlat +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +no_defs"

School.point <- readShapePoints("School_points")   #loads School_points.shp
proj4string(School.point) <- "+proj=utm +zone=17 +ellps=GRS80 +datum=NAD83 +units=m +no_defs "

School.bndry.83 = spTransform(School.bndry,CRS("+init=epsg:26917"))
School.point.83 = spTransform(,CRS("+init=epsg:26917"))

School.point.83.df <-

ggplot(School.bndry,83) +              #Use the school boundary data
     aes(long,lat, group=group) +
     geom_polygon() +                  #to draw the polygons
     geom_path(color="white") +        #make the border of the polygons white
     geom_point(,   #add the points to the map
           aes(X, Y, group=NULL, fill=NULL),
           alpha=I(8/10))      #make each point blue with some transparency
Posted in R | Tagged , | Leave a comment

AERO 2015: Making Shared Twitter Links Useful with R

On Friday December 4th, AERO hosted its annual fall conference at the Old Mill.  The speakers included:

  • Dr. Joe Kim, McMaster University, “The Science of Durable Learning”
  • Don Buchanan, Hamilton Wentworth  DSB , E-BEST, “Putting education in ‘educational’ apps: Lessons from the science of learning”
  • Dr. Daniel Ansari, Western University, “Building blocks of mathematical abilities: Evidence from brain and behaviour”

Twitter was again a staple at the conference (#AEROAOCE) with backroom discussions and sharing/extending resources and articles highlights by the speaker. As with previous years, an archive of the social media exchanges was created using Martin Hawksey’s TAGS 6.0  utility.  Twitterfall was also used as a live twitter feed so everyone could see what was resonating.

Although the compilation of tweets is straight forward, it is seldom in a format that I would share with other stakeholders.  To facilitate the cleaning process, I use a small R file that extracts the shared URLs and then expands them from the or formats. Following are the code snippets with descriptions of each step.  If you are more interested in the resources that were shared rather than the process to clean them, scroll down to the bottom of this post.

The following code is saved as the file twittercleaner.r   Each time I use it I change the name of the dataframes to reflect the conference tweets that have been compiled (in this case AERO).  The file begins by loading the three packages dplyr, stringr and long url.


Load the data file containing the tweets (a csv extract from the TAGS 6.0 archive):

AERO <- read.csv("C:/00_Data/AERO2015_Enduring_Learning.csv")

Identify the characters that may be contained in a url:

url_pattern <- "http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+"

Use the stringr package to create a new column in the dataframe and extract the urls into it:

AERO$ContentURL <- str_extract(AERO$text, url_pattern)

Using dplyr to create a new dataframe and then (%>%),
remove the null values (! and then (%>%),only keep the column with the URLs:

AEROurl <- AERO %>%
filter(! %>%
select (ContentURL)

Remove the duplicate URLs (keep unique URLs):

AEROurl <- unique(AEROurl$ContentURL)

Remove the rownames from the table:

attr(AEROurl, "rownames") <- NULL

Up to this point the URLs included in tweets have been shortened using or  The following step uses the longurl package to expand the URLs:

AEROExpanded <- expand_urls(AEROurl, check=TRUE, warn=TRUE)

Remove URLs that could not be expanded (and result in a Null value):

AEROExpanded <- filter(AEROExpanded, !

Create a .csv file containing the extracted and expanded URLs:

write.csv(AEROurl, "C:/AEROurl.csv")

Full version:


AERO <- read.csv("C:/00_Data/AERO2015_Enduring_Learning.csv")

url_pattern <- "http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+"

AERO$ContentURL <- str_extract(AERO$text, url_pattern)
AEROurl <- AERO %>%
        filter(! %>%
        select (ContentURL)

AEROurl <- unique(AEROurl$ContentURL)
attr(AEROurl, "rownames") <- NULL

AEROExpanded <- expand_urls(AEROurl, check=TRUE, warn=TRUE)

AEROExpanded <- filter(AEROExpanded, !

write.csv(AEROurl, "C:/AEROurl.csv")

Results (sorted):

Posted in R, Twitter | Tagged , , | 1 Comment