I have a lookup table that only has 4 columns (bool, date, number, string) with 9,998,081 rows.
The CSV is relatively small (285mb), but it "blows up" when I upload it to CartoDB, I suspect, because of the other columns that CartoDB adds (the_geom, created_at, updated_at, etc).
It eats up the quota of my account and the table becomes 819.62 MB.
Would it be possible to tell CartoDB that this is just a lookup table that I'll use in joins, and I won't need these extra columns?
I had a similar problem where datasets in CartoDB would double or triple in size. The solution was to run the command
VACUUM FULL in the SQL console.
Hypermap registry: an open source, standards-based geospatial registry and search platform
On the web there is a large number of useful geospatial datasets available, exposed via web map services using open standards or open protocols. Just as web search engines enable users to reliably search and find relevant documents, a similar capability is needed to return the most useful and reliable geospatial datasets.
Hypermap Registry is an open source platform, developed by the Center for Geographic Analysis (CGA) of Harvard University, which attempts to address the general problem of geospatial data search and discovery.
5 Programming for GIScience and Spatial Analysis
This week is, again, heavily practical oriented - with our practical taking up the majority of our time this week.
You’ll find in this practical, many additional explanations of key programming concepts - such as selection, slicing and pipes - integrated within it.
As always, we have broken the content into smaller chunks to help you take breaks and come back to it as and when you can over the next week.
Week 5 in Geocomp
This week’s content introduces you to the foundational concepts associated with Programming for Spatial Data Analysis, where we have three new areas of work to focus on:
- Data wrangling in programming (using indexing, selection and slicing)
- Using spatial libraries in R to store and manage spatial data
- Using visualisation libraries in R to map spatial data
This week’s content is split into 4 parts:
This week, we have 2 lectures (15 mins and 40 mins), and an additional instructional video to help you with the completion of this week’s practical.
A single Key Reading is found towards the end of the workshop.
After promising to set a Mini-Project during Reading Week, I appreciate the delivery of this material is late, so I will not be setting the Project as promised. Instead, I would like you to spend time going through the practical and experimenting with the visualisation code at the end. There is also an extension that I would like you to complete, if possible over Reading Week.
Part 4 is, as usual, the main part of analysis for our Practical for this week - all programming this week is within Part 4, which is a little longer than usual to account for this.
If you have been unable to download R-Studio Desktop or cannot access it via [email protected] Anywhere, you will have access to our R-Studio Server website instead. Instructions on how to access this are provided in the previous week’s workshop.
By the end of this week, you should be able to:
- Understand how spatial analysis is being used within data science applications
- Recognise the differences and uses of GUI GIS software versus CLI GIS software
- Understand which libraries are required for spatial analysis in R/R-Studio
- Conduct basic data wrangling in the form of selection and slicing
- Create a map using the tmap visualisation library
We will continue to build on the data analysis we completed last week and look to further understand crime in London by looking at its prevalence on a month-by-month basis but this time, from a spatial perspective.
Spatial Analysis for Data Science Research
Over a decade ago, when I first became involved in the GIScience world, the term “data science” barely existed - fast-forward to today, and it doesn’t go a day without hearing the phrase and the hubris surrounding its potential to help solve the many grand challenges the modern world faces.
Whilst there is much hubris (and not a huge amount of evidence) of data science’s ability to “save the world”, on a more fundamental level, data science, and the community of practice associated with it, is having a transformational impact on how we think about and “do” data-focused (and primarily quantitative) research.
For us geographers and geographically-minded analysts, our traditional use of GIScience and spatial analysis is most certainly not immune to this transformation - many of the datasets assicated with data science do have a locational component and thus we have seen an increasing interest in and entry into the spatial analyis field from more “generalised” data analysts or data scientists.
Furthermore, the increasing popularity of data science amongst ourselves as geographers is also having a signficant impact on how we “do” spatial anaysis.
We have, as a result, seen a greater focus on the use of programming as a primary tool within spatial analysis, concomitant to a new prioritisation of openness and reproducibility in our research and documentation of our results.
Hence why, a decade later, an Undergraduate module on GIScience now focues on “Geocomputation”, a precursor to spatial data science, rather than a more generalised understanding of the GIS industry and the traditional applications of GIS and spatial analysis, such as:
- Supply Chain Management
- Generalised Urban Planning
- Environmental modelling
Whilst these traditional applications and industries still utilise GIS software (and there is substantial potential to build careers in these areas, particularly through the various Graduate Schemes offered by related companies such as Arup, Mott MacDonald, Esri, to name a few), with data science emerging as a dominant area of growth in spatial analysis, it is important to prioritise the skills you will need to complete in the relevant sectors that are hiring “spatial data scientists”, i.e. learning to code effectively and efficiently.
Once you have acquired these skills, the outstanding question becomes: how will I apply them in my future career?
Whilst the majority of spatial analysis using programming is not exactly too different from spatial analysis using GIS software, the addition of programming skills have opened up spatial analysis to many different applications and, of course, novel datasets.
Within academia and research itself, we see the use of spatial analysis within data science research for:
1. Analysis of distributions, patterns, trends and relationships within novel datasets
The most basic application of spatial analysis - but one that now utilises large-scale novel datasets, such as mobile phone data, social media posts and other human ‘sensor data’.
To get a better understanding of the various applications, a key recommendation is to look at Carto’s introduction video to their Spatial Data Science conference held this year, where they highlighted how spatial data science has been used for various applications within COVID-19.
As a commerical firm, they seem to have a bit of cash to make great videos, but I’d also recommend looking at the various talks held at the conference this year that show the diversity of applications using spatial data science from the various participants.
Carto’s take on the use of spatial data science for COVID-19
2. Supplementing the analysis of traditional datasets for augmented information
Adding a ‘spatial data science’ edge to traditional analysis, supplementing “small” datasets with big data (or vice versa) to provide new insights into both datasets.
An example of this is the recent combination of geodemographic classification (Week 9) with big data information on mobility (e.g. mobile phone data, travel card data) to understand different types of commuter flows and thinking through how this can inform better urban planning policy.
A recent paper that did just such is from Liu and Cheng (2020), with the following abstract:
Plentiful studies have discussed the potential applications of contactless smart card from understanding interchange patterns to transit network analysis and user classifications. However, the incomplete and anonymous nature of the smart card data inherently limit the interpretations and understanding of the findings, which further limit planning implementations. Geodemographics, as ‘an analysis of people by where they live’, can be utilised as a promising supplement to provide contextual information to transport planning. This paper develops a methodological framework that conjointly integrates personalised smart card data with open geodemographics so as to pursue a better understanding of the traveller’s behaviours. It adopts a text mining technology, latent Dirichlet allocation modelling, to extract the transit patterns from the personalised smart card data and then use the open geodemographics derived from census data to enhance the interpretation of the patterns. Moreover, it presents night tube as an example to illustrate its potential usefulness in public transport planning.
(Yunzhe Liu & Tao Cheng (2020) Understanding public transit patterns with open geodemographics to facilitate public transport planning, Transportmetrica A: Transport Science, 16:1, 76-103, DOI: 10.1080/23249935.2018.1493549)
We’ll be looking at this in a little more detail in Week 9.
3. Creation of new datasets from both traditional and novel datasets
Opening up spatial analysis to novel datasets has enabled many researchers to identify opportunities in the creation of new datasets that can ‘proxy’ certain human behaviours and characteristics that we currently do not either have data for, or the data is old/insufficient/not at the right scale.
A good example of this is my previous research group at the University of Southampton: Worldpop.
Worldpop create population and socio-economic datasets for every country across the world utilising (primarily) bayesian modelling approaches alongside both census data and more innovative datasets, such as mobile phone data or tweets.
You can watch this incredibly cheesey but informative video made by Microsoft about the group below:
What does Worldpop do?
There are plenty of examples in recent GIS and spatial analysis research where new datasets are/have been created for use in similar applications. Another example is Facebook, who is using a lot of their socila network data to create mobility and social connectivity datasets with their ‘Data For Good’ platform (see more here).
4. Creation of new methods and datasets
Finally, the intersection of data science and spatial analysis has also seen geographers adapt data science techniques to create new methods and analytical algorithims to puruse the creation of more new datasets and/or new insight.
An example of this is the increased use and adaptation of the DB-Scan algorithm (Week 7) within urban analytics, seen within the various papers:
Xinyi Liu, Qunying Huang & Song Gao (2019) Exploring the uncertainty of activity zone detection using digital footprints with multi-scaled DBSCAN, International Journal of Geographical Information Science, 33:6, 1196-1223, DOI: 10.1080/13658816.2018.1563301
Arribas-Bel, D., Garcia-López, M. À., & Viladecans-Marsal, E. (2019). Building (s and) cities: Delineating urban areas with a machine learning algorithm. Journal of Urban Economics, 103217.
Jochem, W. C., Leasure, D. R., Pannell, O., Chamberlain, H. R., Jones, P., & Tatem, A. J. (2020). Classifying settlement types from multi-scale spatial patterns of building footprints. Environment and Planning B: Urban Analytics and City Science. https://doi.org/10.1177/2399808320921208
Beyond these research-oriented applications, we can also think of many ‘data sciencey’ applications that we use in our day to day lives that use spatial analysis as a key component.
From the network analysis behind route-planning within mapping applications to searching travel apps for a new cafe or restaurant to try, not only does spatial analysis underline much of the distance and location-based metrics these applications rely on, it also helps to integrate many of the novel datasets - such as traffic estimations or social media posts - that augment these distance metrics and become invaluable to our own decision-making.
Applications of Spatial Analysis with ‘Data Science’ Applications
A short blog piece by Esri on the insight that can be derived from spatial analysis can be found here.
Spatial Analysis Software and Programming
This week - and the previous - is your first introduction in our module to using R-Studio for the management and and analysis of spatial data. Prior to this, we’ve been using traditional GIS software in the form of QGIS.
As we’ve suggested above, the increasing popularity of data science is having a signficant impact on how we “do” spatial anaysis, with a shift in focus to using programming as our primary tool rather than traditional GIS-GUI software.
GUI-GIS software still has its place and purpose, particularly in the wider GIScience and GIS industry - but when we come to think of data science, the command line has become the default.
Behind this shift in focus, alongside the need to have a tool that is capable of handling large datasets, has been a focus on improving openness and reproducibility within spatial analysis research.
As Brunsdon and Comber (2020) propose:
Notions of scientific openness (open data, open code and open disclosure of methodology), collective working (sharing, collaboration, peer review) and reproducibility (methodological and inferential transparency) have been identified as important considerations for critical data science and for critical spatial data science within the GIScience domains.
(Brunsdon, C., Comber, A. Opening practice: supporting reproducibility and critical spatial data science. J Geogr Syst (2020). https://doi.org/10.1007/s10109-020-00334-2)
As part of this move towards openness and reproducibility within spatial data science, we can look to the emerging key principles of data science research to explain why programming is becoming the primary tool for spatial analysis research.
Key principles of data science research
When thinking about spatial analysis, we can identify the key principles of data science as:
1. Repeatability: the idea that a given process will produce the same (or nearly the same) output given similar inputs. Instruments and procedures need to be consistent.
2. Reproducibility: There are three types of reproducibility when we think of data science research.
- Statistical reproducibility: an analysis is statistically reproducible when detailed information is provided about the choice of statistical tests, model parameters, threshold values, etc.
- Empirical reproducibility: an analysis is empirically reproducible when detailed information is provided about non-computational empirical scientific experiments and observations. In practice, this is enabled by making data freely available, as well as details of how the data was collected.
- Computational reproducibility: an analysis is computationally reproducible if there is a specific set of computational functions/analyses (in data science, almost always specified in terms of source code) that exactly reproduce all of the results in an analysis.
3. Collaboration: an analysis workflow that is easy to share work with others and collaborate, preferably in real-time, alongside easy integration with version control.
4. Scalability: at its most basic, an analysis that can re-run the same processing easily, with simple adjustment of variables and parameters to include additional data at an intermediate level, the analysis and workflow can be easily expanded to include larger datasets (which require more processing requirements) at the most advanced, the workflow is suitable for distributed/multiple core computing.
We can use these principles to review the different tools/software available to us for spatial analysis, in order to be confident moving forward, that we use the appropriate tools for the tasks we have at hand.
A Review of Spatial Analysis Software
Spatial Analysis in R-Studio
We have now seen that for us, to work towards completing spatial analysis research that adheres to these data science pricinples, we need to focus on using programming tools, such as R and R-Studio, rather than the traditional GIS GUI software.
But the question is, how do we use R and R-Studio as a piece of GIS software?
As you’ll already have seen, there are quite a few aesthetic differences between R-Studio and Q-GIS - for one, there is no "map canvas area where we’ll see our data as we load it.
There are also quite a few other differences in terms of how we:
spatial data and our spatial analysis outputs.
To help you understand these differences, the following longer lecture (approximately 40 minutes) provides you with a thorough introduction into how we use R-Studio as a GIS software:
Using spatial data in R/R-Studio
Practical 4: Analysing Crime in 2020 in London from a spatial perspective
Now we’ve had our introduction to using R-Sutdio as a GIS software, it’s time to get started using it ourselves for spatial analysis.
As outlined earlier, we’ll be completing an analysis of our crime dataset in London, but rather than solely looking at crime change of time - we’re going to add in a spatial component to our analysis, and understanding how crime has changed across our wards over the year.
To do this, what we’ll first do is head back to our script from last week, run our script - and then write our all_theft_df to a csv file.
If you had saved your environment from last week, keeping your variables in the memory, theoretically you won’t need to export the dataframe as you should have access to this variable within your new script - but it would be good practice to write out the data - and then load it back in.
We’re going to be adding in and using a few additional libraries into our script today - but we’ll explain them as and when we use them for now, just add them into our library section of our script when instructed to below.
Overall, our workflow will be:
Take our all_theft_df and wrangle it to produce a dataframe with a ward per row with a crime count for each month in our fields.
Join this dataframe to our ward_population_2019 shapefile (in your working folder) and then produce a crime rate for each month, for each ward.
Create a map for January 2020 using the tmap library.
Extension: Create a new dataframe that represents crime from a quarterly perspective and create four maps ready for export.
Write out / export our dataframe from last week
Open up R-Studio (Server or Desktop), and make sure you open up your GEOG0030 project.
Next open your script from Week 4 - it should be saved as: wk4-csv-processing.r and should be visible in your files from your GEOG0030 project.
First check your Environment box - if you have a variable in your Global Environment with the name all_theft_df then you do not need to run your script. If you do not have a variable saved, go ahead and run your script to and including the code that filters our large all_crime_df to only the all_theft_df :
We should all now have an all_theft_df variable in our environment that we’re ready to export to a csv.
Remember, if using a Windows machine, you’ll need to submit your forward-slashes ( / ) with backslashes, and in this case, within R, it will need to be two backslashes ( ).
You should now see a new csv within your raw crime data folder (data -> raw -> crime).
Setting up your script
Open a new script within your GEOG0030 project (Shift + Ctl/Cmd + N) and save this script as wk5-crime-spatial-processing.r .
At the top of your script, add the following metdata (substitute accordinlgy):
As you’ll have heard in our lecture, we’ll be using sf to read and load our spatial data, use the tidyverse libraries to complete our data wrangling and then use the tmap library to visualise our final maps.
The here library enables easy reference to our working drive, janitor cleans the names of our data frame, whilst magrittr allows us to use the pipe function ( %>% ) which we’ll explain in a bit more detail below.
Loading our datasets
We’re going to load both of the datasets we need today straight away: 1) the all_theft_2020.csv we have just exported and 2) the ward_population_2019.shp we created in Week 3.
- First, let’s load our all_theft_2020.csv into a dataframe called all_theft_df . You should see we use the same read_csv code as last week.
- For those of use with the variable still stored in your Environment, you can still add this code to your script - it will simply overwrite your current variable (which essentially stores the same data that is contained in the csv).
We can double-check what our csv looks like by either viewing our data or simply calling the head() function on our dataframe.
You should see these rows display in your console. Great, the dataset looks as we remember, with the different fields, including, importantly for this week, the LSOA_code which we’ll use to process and join our data together (you’ll see this in a second!).
Next, let’s load our first ever spatial dataset into R-Studio - our ward_population_2019.shp . We’ll store this as a variable called ward_population and use the sf library to load the data:
You should now see the ward_population variable appear in your Environment window.
As this is the first time we’ve loaded spatial data into R, let’s go for a little exploration of how we can interact with our spatial data frame.
Interacting with spatial data
The first thing we want to do when we load spatial data is, of course, map it to see its ‘spatiality’ (I’m going to keep going with that word..) or rather how the data looks from a spatial perspective.
To do this, we can use a really simple command from R’s base library: plot() .
As we won’t necessarily want to plot this data everytime we run this script in the future, we’ll type this command into the console as a “one-off”.
You should see your ward_population plot appear in your Plots window - as you’ll see, your ward dataset is plotted ‘thematically’ by each of the fields within the dataset, including our POP2019 field we created last week.
Note, this plot() function is not to be used to make maps - but simply as a quick way of viewing our spatial data.
We can also find out more information about our ward_population data.
We should see our data is an sf dataframe , which is great as it means we can utilise our tidyverse libraries with our ward_population .
We can also use the attributes() function we looked at last week to find out a little more about our “spatial” data frame.
We can see how many rows we have, the names of our rows and a few more pieces of information about our ward_population data - for example, we can see that the specific $sf_column i.e. our spatial information) in our dataset is called geometry .
We can investigate this column a little more by selecting this column within our console to return.
You should see new information about our geometry column display in your console.
From this selection we can find out the dataset’s:
- geometry type
- bbox (bounding box)
- CRS (coordinate reference system)
And also the first five geometries of our dataset.
This is really useful as one of the first things we want to know about our spatial data is what coordinate system it is projected with.
As we should know, our ward_population data was created and exported within British National Grid, therefore seeing the EPSG code of British National Grid - 27700 - as our CRS confirms to us that R has read in our dataset correctly!
We could also actually find out this information using the st_crs() function from the sf library.
You’ll see we actually get a lot more information about our CRS beyond simply the code using this function.
This function is really important to us as users of spatial data as it allows us to retrieve and set the CRS of our spatial data (the latter is used in the case the data does not come with a .proj file but we do know what projection system should be used).
To reproject data, we actually use the st_transform() function - but we’ll take a look at this in more detail in Week 7.
The final thing we might want to do before we get started with our data analysis is to simply look at the data table part of our dataset, i.e. what we’d call the Attribute Table in Q-GIS, but here it’s simply the table part of our data frame.
To do so, you can either use the View() function in the console or click on the ward_population variable within our enviroment.
Processing our crime data to create our required output data frame
Now we have our data loaded, our next step is to process our data to create what we need as our final output for analysis: a spatial dataframe that contains a theft crime rate for each ward for each month (of available data) in 2020.
But wait - if we look at our all_theft_df , we do not have a field that contains the wards our crimes have occured in.
We only have two types of spatial or spatially-relevant data in our all_theft_df :
- The approximate (“snap point”) latitude and longitude of our crime in WGS84.
- The Lower Super Output Area (LSOA) in which it occured.
From Week 3’s practical, we know we can map our points using the coordinates and then provide a count by using a point-in-polygon (PIP) operation.
However to do this for each month, we would need to filter our dataset for each month and repeat the PIP operation - when we know a little more advanced code, this might end up being quite simple, but for now, when all we’re trying to do is some basic table manipulation, surely there must be a quicker way?
Adding Ward Information to our all_theft_df dataframe
Yes, there is! All we need to do is figure our which Ward our LSOAs fall within and then we can add this as an additional attribute or rather column to our all_theft_df - so how do we do this?
From a GIScience perspective, there are many ways to do this - but the most straight forward is to use something called a look-up table.
Look-up tables are an extremely common tool in database management and programming, providing a very simple approach to storing additional information about a feature (such as a row within a dataframe) in a separate table that can quite literally be “looked up” when needed for a specific application.
In our case, we will actually join our look-up table to our current all_theft_df to get this information “hard-coded” to our dataframe for ease of use.
To be able to do this, we therefore need to find a look-up table that contains a list of LSOAs in London and the Wards in which they are contained.
Lucky for us, after a quick search of the internet, we can find out that the Office for National Statisitcs provides this for us in their Open Geography Portal.
They have a table that contains exactly what we’re looking for: Lower Layer Super Output Area (2011) to Ward (2018) Lookup in England and Wales v3.
As the description on the website tells us, "this file is a best-fit lookup between 2011 lower layer super output areas, electoral wards/divisions and local authority districts in England and Wales as at 31 December 2018.
As we know we are usig - but the LSOAs are still from 2011 within the police data, we know this is the file we’ll need to complete our look-up.
In addition, the description tells us what field names are included in our table, which we can have a good guess at which we’ll need before we even open the data: LSOA11CD, LSOA11NM, WD18CD, WD18NM, WD18NMW, LAD18CD, LAD18NM.
(Hint, it’s the ones beginning with LSOA and WD!)
We therefore have one more dataset to download and then load into R.
Move this file in your data -> raw -> boundaries folder and rename to “data/raw/boundaries/lsoa_ward_lookup.csv”.
Now we have our lookup table, all we are going to do is extract the relevant ward name and code for each of the LSOAs in our all_theft_df .
To do so, we’re going to use one of the join functions from the dplyr library.
Joining data by fields in programming
We’ve already learnt how to complete Attribute Joins in Q-GIS via the Joins tab in the Propeties window - so it should come of no surprise that we can do exactly the same process within R.
To conduct a Join between two dataframes (spatial or non-spatial, it does not matter), we use the same principles of selecting a unique but matching field within our dataframes to join them together.
As we have seen from the list of fields above - and with our knowledge of our all_theft_df dataframe - we know that we have at least two fields that should match across the datasets: our lsoa codes and lsoa names.
We of course need to identify the precise fields that contain these values in each of our data frames, i.e. LSOA11CD and LSOA11NM in our lsoa_ward_lookup dataframe and lsoa_code and lsoa_name in our all_theft_df dataframe, but once we know what fields we can use, we can go ahead and join our two data frames together.
But how do we go about join them in R?
Within R, you have two options to complete a data frame join:
- The first is to use the Base R library and its merge() function:
- By default the data frames are merged on the columns with names they both have, but you can also provide the columns to match separate by using the parameters: by.x and by.y .
- E.g. your code would look like: merge(x, y, by.x = "xColName", by.y = "yColName") , with x and y each representing a dataframe.
- The rows in the two data frames that match on the specified columns are extracted, and joined together.
- If there is more than one match, all possible matches contribute one row each, but you can also tell merge whether you want all rows, including ones without a match, or just rows that match, with the arguments all.x and all.
- The second option is to use the Dplyr library and one of their mutate() -based join() functions:
- dplyr uses SQL database syntax for its join functions.
- There are four types of joins possible (using this SQL syntax) with the dplyr library.
- inner_join() : includes all rows in x and y.
- left_join() : includes all rows in x.
- right_join() : includes all rows in y.
- full_join() : includes all rows in x or y.
So which approach should I choose?
In all cases moving forward, we will use the one of the dplyr join approaches.
There are three reasons for using the dplyr approach:
- The base merge() function does not always work well with data frames and can create errors in your joining.
- With the dplyr code built on SQL , joins run substantially faster and very well on dataframes.
- All tidyverse functions use NAs as a part of data, because it should explain some aspects of information that can’t be explained by “identified” data and will not drop NAs during processing, which, if this happens without your realisation, can affect your data and its reliability quite substantially.
- When using the tidyverse, we often need to use a specific function to drop NA values, e.g. na.omit() or find ways of replacing NAs, as we’ll see later.
One thing to note is that there is a new package entering the “game” of data wrangling in R, called data.table . We won’t look into this package now, because its best suited for really large datasets but one to quickly make a note about if you end up dealing with datasets for your dissertations that have millions of entries.
Joining our two tables using the left_join() function from dplyr
Now we know what set of functions we can use to join our data, let’s go ahead and join our lsoa_ward_lookup dataframe to our all_theft_df dataframe so we can get our ward information.
We’re going to need to make multiple joins between our tables as we have multiple entries of crime for the same LSOA - as a result, we’re going to need to use a function that allows us to keep all rows in our all_theft_df dataframe, but we do not need to keep all rows in our lsoa_ward_lookup if those wards are not within our dataset.
Let’s have a look in detail at how the four different types of joins from dplyr work:
It looks like we’re going to need to use our left_join() function as we want to join matching rows from our lsoa_ward_lookup dataframe to our all_theft_df dataframe but make sure to keep all rows in the latter.
- Within your script, create a join between our two dataframes and store as a new variable:
Let’s go ahead and check our join - we want to check that our LSOA codes and names match across our new dataframe.
You should now see that you have with 19 variables: 12 from all_theft_df , plus 7 from lsoa_ward_lookup .
Note, the join does not keep the ‘join key’ fields from both dataframes by default. It keeps only the field from the “left” dataframe - hence we are now missing LSOA11CD .
To keep both fields in future, we would need to add the keep parameter into our code, and set this to TRUE as so:
Do not add this to your script, it is just provided as an example!
Now we have our joined dataset, we can move forward with some more data wrangling.
The thing is, our data frame is getting quite busy - we have duplicate fields and some fields we just won’t need.
It would be good if we could trim down our dataframe to only the relevant data that we need moving forward, particularly, for example, if we wanted to go ahead and write out a hard copy of our theft data that now contains the associated ward.
To be able to “trim” our data frame, we have two choices in terms of the code we might want to run.
First, we could look to drop certain columns from our data frame.
Alternatively, we could create a subset of the columns we want to keep from our data frame and store this as a new variable or simply overwrite the currently stored variable.
To do either of these types of data transformation, we need to know more about how we can interact with a data frame in terms of indexing, selection and slicing.
Data Wrangling: Introducing Indexing, Selection and Slicing
Everything we will be doing today as we progress with our data frame cleaning and processing relies on us understanding how to interact with and transform our data frame - this interaction itself relies on knowing about how indexing works in R as well as how to select and slice your data frame to extract the relevant cells, rows or columns and then manipulate them - as we’ll be doing in this practical.
Whilst there are traditional programming approaches to this using the base R library, dplyr is making this type of data wrangling easier by the year!
If you’ve not used R before - or have but not familiar with how to index, select and slice, I would highly recommend watching this following video that explains the process from both a base R perspective and using the dplyr library - it also includes a good explanation about what our pipe function , %>% , does.
I’d love to have time to make this video for you all myself, but this is currently not possible - and this video provides a very accessible explanation. I’ll add some detailed notes as and when we use these functions in the next section of the practical, but for an audio/visual explanation, I’d highly recommend watching this video.
Selection and slicing in R
As you can see from the video, there are two common approaches to selection and slicing in R, which rely on indexing and/or field names in different ways.
The following summarises the above video, for ease of reference during the practical:
Base R approach to selection and slicing (common programming approach)
The most basic approach to selecting and slicing within programming relies on the principle of using indexes within our data structures.
Indexes actually apply to any type of data structure, from single atomic vectors to complicated data frames as we use here.
Indexing is the numbering associated with each element of a data structure.
For example, if we create a simple vector that stores three strings:
R will assign each element (i.e. string) within this simple vector with a number: Aa = 1, Bb = 2, Cc = 3 and so on.
Now we can go ahead and select each element by using the base selection syntax which is using square brackets after your element’s variable name, as so:
Which should return the first element, our first string containing “Aa”. You could change the number in the square brackets to any number up to 7 and you would return each specific element in our vector.
However, say you don’t want the first element of our vector but the second to fifth elements.
To achieve this, we conduct what is known in programming as a slicing operation, where, using the  syntax, we add a : (colon) to tell R where to start and where to end in creating a selection, known as a slice:
You should now see our 2nd to 5th elements returned. You’ve created a slice!
Now what is super cool about selection and slicing is that we can add in a simple - (minus) sign to essentially reverse our selection.
So for example, we want to return everything but the 3rd element:
And with a slice, we can use the minus to slice out parts of our vector, for example, remove the 2nd to the 5th elements (note the use of a minus sign for both):
This use of square brackets for selection syntax is common across many programming languages, including Python, but there are often some differences you’ll need to be aware of if you pursue other languages.
- Python always starts its index from 0! Whereas we can see here with R, our index starts at 1
- R is unable to index the characters within strings - this is something you can do in Python, but in R, we’ll need to use a function such as substring() - more on this another week.
But ultimately, this is all there is to selection and slicing - and it can be applied to more complex data structures, such as data frames. Let’s take a look.
Selection and slicing in data frames
We apply these selection techniques to data frames, but we will have a little more functionality as our data frames are made from both rows and columns.
This means when it comes to selections, we can utilise an amended selection syntax that follows a specific format to select individual rows, columns, slices of each, or just a single cell:
[ rows, columns]
There are many ways we can use this syntax, which we’ll example below using our lsoa_ward_lookup data frame.
First, before looking through and executing these examples (in your console) familiarise yourself with the lsoa_ward_lookup data frame:
To select a single column from your data frame, you can use one of two approaches.
First we can follow the syntax above carefully and simply set our column parameter in our syntax above to the number 2:
You should see your second column display in your console.
Second, we can actually select our column by only typing in the number (no need for the comma).
By default, when there is only one argument present in the selection brackets, R will select the column from the data frame, not the row:
Note, this is different to when we “accessed” the properties of the column last week using the $ syntax - we’ll look at how we use this in later practicals.
To select a specific row, we need to add in a comma after our number - this will tell R to select the relevant row instead:
You should see your second row appear.
Now, to select a specific cell in our data frame, we simply provide both arguments in our selection parameters:
What is also helpful in R is that we can select our columns by their field names by passing these field names to our selection brackets as a string.
Or for many columns, we can supply a combined vector:
This approach to selecting multiple columns is also possible using the indexing, but in this case we use the slicing approach we saw earlier (note, you cannot slice using field names but need to provide each individual field name within a vector as above).
To retrieve our 2nd - 4th columns in our data frame, we can use either approach:
We can also apply the negative
If you do not want a slide, we can also provide a combined list of the columns we want to extract:
We can apply this slicing approach to our rows:
As well as a negative selection:
(Note we have fewer rows than we should in the original data frame!)
And if it’s not a slice you want to achieve, you can also provide a list of the rows (akin to our approach with the columns above)!
And of course, for all of these, we can store the output of our selection as a new variable or pipe it to another function.
That’s obviously what makes selection and slicing so useful - however it can be at times a little confusing.
Dplyr approach to selection and slicing (making our lives easy!)
We’re quite lucky, therefore, as potential data wranglers that the dplyr library has really tried to make this more user-friendly.
Instead of using the square brackets  syntax, we now have functions that we can use to select or slice our data frames accordingly:
For columns, we use the select() function that enables us to select a (or more) column(s) using their column names or a range of “helpers” such as ends_with() to select specific columns from our data frame.
For rows, we use the slice() function that enables us to select a (or more) row(s) using their position (i.e. similar to the proess above)
For both functions, we can also use the negative / - approach we saw in the base R approach to “reverse a selection”, e.g.:
We’ll be using these functions throughout our module, so we’ll leave our examples there for now!
In addition to these index-based functions, within dplyr , we also have:
- filter() that enables us to easily filter rows within our data frame based on specific conditions (such as being a City of London ward). This function requires a little bit of SQL knowledge, which we’ll pick up on throughout the module - but look further at in Week 6.
In addition, dplyr provides lots of functions that we can use directly with these selections to apply certain data wrangling processes to only specific parts of our data frame, such as mutate() or count() .
We’ll be using quite a few of these functions in the remaining data wrangling section below - plus throughout our module, so I highly recommend downloaded (and even printing off!) the dplyr cheat sheet to keep track of what functions we’re using and why!
One thing to note is that in either the base R or dplyr approach, we can use the magrittr pipe - %>% - to ‘pipe’ the outputs of our selection into another function. This is explained in more detail in another section.
As we have seen above, whilst there are two approaches to selection using either base R library or the dplyr library, we will continue to focus on using functions directly from the dplyr library to ensure efficiently and compatibility within our code.
Within dplyr, as you also saw, whether we want to keep or drop columns, we always use the same function: select .
To use this function, we provide our function with a single or “list” (not a programmatical list, just a list) of the columns we want to keep - or if we want to drop them, we use the same approach, but add a - before our selection. (We’ll use this drop approach in a litte bit).
Let’s see how we can extract just the relevant columns we will need for our future analysis - note, in this case we’ll overwrite our all_theft_ward_df variable.
- In your script, add the following code to extract only the relevant columns we need for our future analysis:
You should now see that your all_theft_ward_df data frame should only contain 9 variables - you can go and view this data frame or call the head() function on the data in the console if you’d like to check out this new formatting.
Improving efficiency in our code
Our current workflow looks good - we now have our data frame ready for use in wrangling… but wait, could we not have made this whole process a little more efficient?
Well, yes! There is a quicker way - particularly if I’m not writing out explanations to you to read through - but generally, yes, we coud make our code way more “speedy” by using the pipe function, %>% , introduced above, which for those of you that remember, we used in our work last week.
As explained above, a pipe is used to pipe the results of one function/process into another - when “piped”, we do not need to include the first “data frame” (or which data structure you are using) in the next function. The pipe “automates” this and pipes the results of the previous function straigt into this function.
It might sound a little confusing at first, but once you start using it, it really can make your code quicker and easier to write and run - and it stops us having to create lots of additional variables to store outputs along the way. It also enabled the code we used last week to load/read all the csvs at once - without the pipe, that code breaks!
Let’s have a think through what we’ve just achieved through our code, and how we might want to re-write our code.
- Joined our two data frames together
- Remove the columns not needed for our future analysis
Let’s see how we can combine this process into a single line of code:
Option 1: Original code, added pipe
You should see that we now end up with a data frame akin to our final output above - the same number of observations and variables, all from one line of course.
We could also take another approach in writing code, by completing our selection prior to our join, which would mean having to write out fewer field names when piping this output into our join.
Option 2: New code - remove columns first
You’ll see in this approach, we now have 14 variables instead of the 9 as we haven’t really “cleaned” up the fields from the original all_theft_df - we could drop these fields by piping our output into another select() function, but we may end up creating even more coding work for ourselves this way.
What these two options do show is that there are multiple ways to achieve the same output, using similar code - we just need to always think through what outputs we want to use.
Pipes help us improve the efficiency of our code - the one thing however to note in our current case is that by adding the pipe, we would not be able to check our join prior to the selection - so sometimes, it’s better to add in this efficiency, once you’re certain that your code has run correctly.
For now, **we’ll keep our original all_theft_ward_df data frame that you would have created prior to this info box - but from now on, we’ll use pipes in our code when applicable.
Go ahead and remove the speedy variables from your environment: rm(all_theft_ward_df_speedy_1, all_theft_ward_df_speedy_2) .
We now FINALLY have our dataset for starting our last bit of data wrangling: aggregating our crime by ward for each month in 2020.
Aggregate crime by ward and by month
To aggregate our crime by ward for each month in 2020, we need to use a combination of dplyr functions.
First, we need to group our crime by each ward and then count - by month - the number of thefts occuring in each ward.
To do so, we’ll use the group_by() function and the count() function.
The group_by() function creates a “grouped” copy of the table (in memory) - then any dplyr function used on this grouped table will manipulate each group separately (i.e. our count by month manipulation) and then combine the results to a single output.
If we solely run the group_by() function, we won’t really see this effect on its own - instead we need to add our summarising function -in our case the count() function, which "counts the number of rows in each group defined by the variables provided within the function, in our case, month .
- Pipe our grouped table into the count function to return a count of theft per month for each Ward in our all_theft_ward_df data frame:
To understand our output, go ahead and View() the variable.
We have 3 fields - with 4490 rows.
You should see that we’ve ended up with a new table that lists each ward (by the WD18CD column) eleven times, to detail the number of thefts for each month - with the months represented as a single field.
But does this table adhere to the Tidyverse principles we read about this and last week?
Not really - whilst it is just about usable for a statistical analysis - if we think about joining this data to our ward_population dataset, we are really going to struggle to add each monthly count of crime in this format.
What we would really prefer is to have our **crime count detailed as one field for each individual month, i.e. 2020-01 as a single field, then 2020-02 , etc.
To do this, we need to figure out how to transform our data to present our months as fields - and yes, before you even have a chance to guess it, the Tidyverse DOES have a function for that!
Do you see why using the Tidyverse is an excellent choice to our R-Studio learning… )
This time, we look to the tidyr library which has been written to quite literally:
“help to create tidy data, where each column is a variable, each row is an observation, and each cell contains a single value. ‘tidyr’ contains tools for changing the shape (pivoting) and hierarchy (nesting and ‘unnesting’) of a dataset, turning deeply nested lists into rectangular data frames (‘rectangling’), and extracting values out of string columns. It also includes tools for working with missing values (both implicit and explicit).”
And even in our explanation of the tidyr library, we may have found our solution in tools for changing the shape (pivoting).
To change the shape of our data, we’re going to need to use tidyr’s pivot functions.
Note, do not get confused here with the traditional sense of pivot in data processing in terms of pivot tables. If you’ve never use a pivot table before in a spreadsheet document (or R-Studio for that matter), they are primarily used to summarizes the data of a more extensive table. This summary might include sums, averages, or other statistics, which the pivot table groups together in a meaningful way.
In our case, the application of the word pivot is not quite the same - here, our pivot() functions will change just the shape of our data (and not the values).
In the tidyr library, we have the choice of two pivot() functions: pivot_longer() or pivot_wider() .
- pivot_wider() “widens” data, increasing the number of columns and decreasing the number of rows.
- pivot_longer() “lengthens” data, increasing the number of rows and decreasing the number of columns.
Well, our data is already pretty long - and we know we want to create new fields representing our months, so I think we can make a pretty comfortable guess that pivot_wider() is the right choice for us.
We just need to first read through the documentation to figure out what parameters we need to use and how.
You should now see the documentation for the function.
We have a long list of parameters we may need to use with the function - but we need to figure out what we need to use to end up with the data frame we’re looking for from our data:
If we read through the documentation, we can figure our that our two parameters of interest are the names_from and values_from fields. We use the names_from parameter to set our month column as the column from which to derive ouput fields from, and the values_from field as our n field (count field) to set our values.
As we do not have a field that uniquely identifies each of our rows, we can not use the id_cols parameter.
We will therefore need to state the parameters in our code to make sure the function reads in our fields for the right parameter.
Have a look at the resulting data frame - does it look like you expect?
Trial and error your code
When you come across a new function you’re not quite sure how to use, I can highly recommend just trialling different inputs for your parameters until you get the output right.
To do this, just make sure you don’t overwrite any variables until you’re confident the code work.
In addition, always make sure to check your output against what you’re expecting.
In our case here, we can check our original theft_count_month_ward data frame values against the resulting theft_by_ward_month_df dataframe - for example, do we see 30 thefts in January for ward E05000026?
As long as you do, we’re ready to move forward with our processing.
One final thing we want to do is clean up the names of our fields to mean a little more to us. Let’s transform our numeric dates to text dates (and change our WD18CD in the process).
And we’re now done! We have our final data frame to join to our ward_population spatial data frame. Excellent!
Let’s just do one final bit of data management and write out this completed theft by ward by month table to a new csv for easy reference/use in the future.
Join our theft data frame to our ward population data frame
We’re now getting to the final stages of our data processing - we just need to join our completed theft table, theft_by_ward_month_df to our ward_population spatial data frame and then compute a theft crime rate.
This will then allow us to map our theft rates per month by ward - exactly what we wanted to achieve within this practical.
Luckily for us, the join approach we used earlier between our all_theft_df and our lsoa_ward_lookup is the exact same approach we need for this, even when dealing with spatial data.
Let’s go ahead and use the same left_join function to join our two data frames together - in this case, we want to keep all rows in our ward_population spatial data frame, so this will be our x data frame, whilst the theft_by_ward_month_df will be our y .
To double-check our join, we want to do one extra step of “quality assurance” - we’re going to check that each of our wards has at least one occurence of crime over the eleven months.
We do this by computing a new column that totals the number of thefts over the 11 month period.
By identifying any wards that have zero entries (i.e. NAs for each month), we can double-check with our original theft_by_ward_month_df to see if this is the correct “data” for that ward or if there has been an errors in our join.
We should actually remember from last week, that only those wards in the City of London (that are to be omitted from the analysis) should have a total of zero.
We can compute a new column by using the mutate() function from the dplyr library. We use the rowsums() function from the base library to compute the sum of rows, which we use the across() function from the dplyr library to parse.
This code is actually a little complicated - and not wholly straight-foward to identify from reading through dplyr documentation alone.
And believe it or not, I do not know every single function available within our various R libraries - so how did I figure this out?
Well, just through simple searching - it might take a few attempts to find the right solution, but the great thing about programming is that you can try things out easily and take steps back.
You can find the original post where I found this code on Stack Overflow and what you’ll notice is that there is a variety of answers to try - and believe me, I certainly did! Luckily the final answer provided a good solution to what we needed.
- Summarise all thefts for each ward by computing a new totals column using the mutate() and rowsums() functions:
You can now View() our updated all_theft_ward_sdf spatial data frame - and sort out columns to see those with a theft_total of 0.
What you should see is that we have approximately 20 City of London wards without data, whilst we do indeed have 10 additional wards without data.
The question is why? Do we have errors in our processing that we need to investigate? Or do these areas simply have no theft?
If we had not complete this analysis in Q-GIS prior to this week’s practical, we would need to conduct a mini-investigation into the original theft dataset and search for these individual wards within the dataset to confirm to ourselves that they are not present within this original dataset. Luckily, having done the practical two weeks before, I can very much confirm that these wards do not have any records of thefts within them.
We can therefore move forward with our dataset as it is, but what we’ll need to do is adjust the values present within these wards prior to our visualisation analysis - these should not have “NA” as their value but rather 0. In comparison our City of London wards should only contain “NAs”.
To make sure our data is as correct as possible prior to visualisation, we will remove our City of London wards that do not have any data (crime or population), and then convert the NAs in our theft counts to 0.
- Filter out the City of London wards with a theft count of 0 and then replace the NAs in our theft columns with 0.
The final thing we need to do before we can map our theft data is, of course, compute a crime rate per month for our all_theft_ward_sdf data frame.
We have our POP2019 column within our all_theft_ward_sdf data frame - we just need to figure out the code that allows us to apply our calculation that we’ve used in our previous practicals (i.e. using the Attribute/Field Calculator in Q-GIS: value/POP2019 * 10000) to each of our datasets.
Once again, after a bit of searching, we can find out that the mutate() function comes in handy - and we can follow a specific approach in our code that allows us to apply the above equation to all of our columns within our data frame.
Now this is certainly a big jump in terms of complexity of our code - below, we are going to store within our crime_rate variable our own function that calculates crime rate on a given value, currently called x , but will be (through our second line our code) each individual cell within our all_theft_ward_sdf spatial data frame contained within our month columns (using the mutate_at() function).
How this code works - for now - is not something you need to worry about too much, but it shows you that a simple task that we completed easily in Q-GIS can, actually, be quite complicated when it comes to writing code.
What is great is that you now have this code that you’ll be able to refer to in the future if and when you need it - and you can of course trial and error different calculations to include with the function.
For now, let’s get on with calculating our theft crime rate.
We’re going to create a new dataframe to store our crime rate as when we apply our calculation to our current data frame, we are actually transforming the original values for each month and not creating a new column per se for each month.
Have a look at your new theft_crime_rate_sdf spatial data frame - does it look as you would expect?
Complexities of coding
These last few chunks of code are the most complicated pieces of code we have come across so far in this module - and not something I would expect you to be able to write on your own.
And to be honest, neither have I.
Much of programming is figuring out what you need to do - trying out different approaches and if you get stuck, searching online for solutions - and then copy and pasting!
You then use trial and error to see if these solutions work - and if not, find a new option.
What is important is to recognise what inputs and outputs you need for the functions you are using - and starting from there. This is only knowledge you’ll gain from programming more, so do not worry at this stage if this feels a little overwhelming, because it will.
Just keep going and you’ll find in six weeks time, you’ll be able to re-read the code above and make a lot more sense out of it!
Now we have our final data frame, we can go ahead and make our maps.
Making Maps in R-Studio: Grammar of Graphics
Phew - we are so nearly there - as we now have our dataset ready, we can start mapping.
This week, we’re going to focus on using only one of the two visualisation libraries mentioned in the lecture - and we’ll start with the easiest: tmap .
tmap is a library written around thematic map visualisation. The package offers a flexible, layer-based, and easy to use approach to create thematic maps, such as choropleths and bubble maps.
It is also based on the grammar of graphics, and resembles the syntax of ggplot2 and so provides a reasonable introduction into understanding how to make maps in R.
What is really great about tmap is that it comes with one quick plotting method for a map called: qtm - it quite literally stands for quick thematic map.
We can use this function to plot the theft crime rate for one of our months really really quickly.
Let’s create a crime rate map for January 2020.
- Within your script, use the qtm function to create a map of theft crime rate in London in January 2020.
In this case, the fill argument is how we tell tmap to create a choropleth map based on the values in the column we provide it with - if we simply set it to NULL , we would only draw the borders of our polygon (you can try this out in your console if you’d like).
Within our qtm function, we can pass quite a few different parameters that would enable us to change specific aesthetics of our map - if you go ahead and search for the function in the Help window, you’ll see a list of these parameters.
We can, for example, set the lines of our ward polygons to white by adding the borders parameter.
Yikes - that doesn’t look great! But at least we tried to change our map a little bit.
Setting colours in R
Note, when it comes to setting colours within a map or any graphic (using ANY visualisation library), we can either pass through a colour word, a HEX code or a pre-defined palette when it comes to graphs and maps.
You can find out more here, which is a great quick reference to just some of the possible colours and palettes you’ll be able to use in R but we’ll look at this in more detail in the second half our module.
For now, you can use the options I’ve chosen within my maps - or if you’d like, experiment a little bit and see what works!
We can continue to add and change parameters in our qtm function to create a map we are satisfied (we just need to read the documentation to figure out what parameters do what).
The issue with the qtm function is that it is extremely limited in its functionality to:
Change the classification breaks used within the Fill parameter
Add additional data layers, such as an underlying ward polygon layer to show our City of London wards that are missing.
Instead, when we want to develop more complex maps using the tmap library, we want to use their main plotting method which uses a function called tm_shape() , which we build on using the ‘layered grammar of graphics’ approach.
Using the `tm_shape() function and the “grammar of graphics”
The main approach to creating maps in tmap is to use the “grammar of graphics” to build up a map based on what is called the tm_shape() function.
Essentially this function, when populated with a spatial data frame, takes the spatial information of our data (including the projection and geometry/“shapes” of our data) and creates a spatial “object”. This object contains some information about our original spatial data frame that we can override (such as the projection) within this function’s parameters, but ultimately, by using this function, you are instructing R that this is the object from which to “draw my shape”.
But to actually draw the shape, we next need to add a layer to specify the type of shape we want R to draw from this information - in our case, our polygon data. We need to add a function therefore that tells R to “draw my spatial object as X” and within this “layer”, you can also specific additional information to tell R how to draw your layer.
You can then add in additional layers, including other spatial objects (and their related shapes) that you want drawn on your map, plus a specify your layout options through a layout layer. Hence the “layered” approach of making maps mentioned in the lecture.
This all sounds a little confusing - and certainly not as straight-forward as using the Print Layout on Q-GIS.
However, as with Everything In Programming, the more times you do something, the clearer and more intuitive it becomes.
For now, let’s see how we can build up our first map in tmap .
Building a map: theft in January 2020
To get started with making a map, we first need to specify the spatial object we want to map.
In our case, this is our theft_crime_rate_sdf spatial data frame, so we set this to our tm_shape() function.
However, on it’s own, if you try, you’ll see that we have “no layer elements defined after tm_shape”.
For the following lines of code, I want you to build on the first line by adding the extra pieces of code I’ve added at each step. DO NOT duplicate the entire code at each step (i.e. copy and paste below one another!). In the end you only want ONE CHUNK of code that plots our map.
- Set our tm_shape() equal to our theft_crime_rate_sdf spatial data frame. Execute the line of code and see what happens:
We therefore need to tell R that we want to map this object as polygons.
To do so we use the tm_polygons() function and add this function as a layer to our spatial object by using a + sign:
As you should now see, we have now mapped the spatial polygons of our theft_crime_rate_sdf spatial data frame - great! A step in the right direction.
However, this is not the map we want - we want to have our polygons represented by a choropleth map where the colours reflect the theft crime rate in January, rather than the default grey polygons we see before us.
To do so, we use the col= parameter that is within our tm_polygons() shape.
The col= parameter is used to “fill” our polygons with a specific fill type, of either:
- a single color value (e.g. “red”)
- the name of a data variable that is contained in shp. Either the data variable contains color values, or values (numeric or categorical) that will be depicted by a specific color palette. In the latter case, a choropleth is drawn.
- “MAP_COLORS”. In this case polygons will be colored such that adjacent polygons do not get the same color.
Let’s go ahead and pass our jan_2020 column within the col= parameter and see what we get.
Awesome! It looks like we have a choropleth map.
We are slowly getting there.
But there are two things we can notice straight away that do not look right about our data.
The first is that our classification breaks do not really reflect the variation in our dataset - this is because tmap has defaulted to its favourite break type: pretty breaks, whereas, as we know, using an approach such as natural breaks, aka jenks, may reveal better variation in our data.
So how do we state our classification breaks in tmap ?
To figure this out, once again, we need to visit the documentation for tm_polygons() and read through the various parameters to find out what we might need…
..Hmm, if we scroll through the parameters the three that stick out are: n , style , and breaks . It seems like these will help us create the right classfication for our map:
- n : state the number of classification breaks you want
- style : state the style of breaks you want, e.g. “cat”, “fixed”, “sd”, “equal”, “pretty”, “quantile”, “kmeans”, “hclust”, “bclust”, “fisher”, “jenks”, “dpih”, “headtails”, and “log10_pretty”.
- breaks : state the numeric breaks you want to use when using the fixed style approach.
There are some additional parameters in there that we might also want to consider, but for now we’ll focus on these three and specifically the first two today.
Let’s say we want to change our choropleth map to have 5 classes, determined via the jenks method - we simply need to add the n and style parameters into our tm_polygons() layer.
- Add the n and style parameters into our tm_polygons() layer. Note we pass the jenks style as a string.
We now have a choropleth that reflects better the distribution of our data - but I’m not that happy with the classification breaks used by the “jenks” approach - they’re not exactly as readable as our pretty breaks.
Therefore, I’m going to use these breaks, but round them down a little to get a compromise between my two classification schemes.
To do so, I need to change the style of my map to fixed and then supply a new argument from breaks that contains these rounded classification breaks.
That looks a little better from the classification side of things.
We still have one final “data-related” challenge to solve, before we start to style our map - and that is showing the polygons for City of London wards, even though we have no data for them.
We always want to create a map that contains as much information as possible and leave no room in interpretation error - by leaving our CoL wards as a white space within our current map, we are not telling our map readers anything other than there is a mistake in our map that they’ll question!
We therefore want to include our CoL wards in our map, but we’ll symbolise them differently so we will be able to explain why we do not have data for the CoL wards when, for example, we’d write up our analysis or present the map on a poster. This explanation is something you’d add into a figure caption or footnotes, for example, depending on how long it needs to be!
For now, we want to add these polygons to our map - and the easiest way to do so is to simply add another spatial object to our map that symbolises our polygons as grey (/“gray”, alas US-centric programming here ) wards.
Let’s go ahead and try this out using the “layered” approach of graphic making, where we simply rinse and repeat and add our layer (and spatial object) as another addition to our map.
This time, we’ll use our original spatial data frame that we loaded into our script, ward_population , so we do not get confused between our layers.
As per out tm_polygons() layer, we simply add this using the + sign.
- Create a new tm_shape(), equal to our ward_population spatial data frame and draw as grey polygons.
We have our grey polygons of our ward - and what appears to be the legend for our choropleth - but no choropleth map?
What could have gone wrong here?
Maybe this is to do with the LAYERED approach - ah! As we have added our new shape-layer, we have simply added this on top of our original map.
So it seems how we order our tmap code is really important!
As we build our maps, we need to be conscious of the order in which we layer our objects and polygons. Whatever comes first in our code is drawn first, and then the next layer is drawn on top of that and so on!
This should be simple fix - and requires just a little rearranging of our code.
- Re-arrange our code to have our grey CoL wards first, and then our Jan 2020 theft crime rate choropleth map:
Now we have our data displayed, we want to go ahead and start styling our map.
Styling maps using tmap
And now things start to get even more complicated…
As you’ve seen, getting to the point where we have a choropleth map in R takes a lot of knowledge about how to use the tmap library successfully.
Whilst ultimately it is only four functions so far, it is still A LOT to learn and understand to make a good map, compared to, for example, the Q-GIS Print Layout.
To style our map takes even more understanding and familiarity with our tmap library - and it is only something you’ll only really learn by having to make your own maps.
As a result, I won’t go into explaining exactly every aspect of map styling - but I will provide you with some example code that you can use as well as experiment with to try to see how you can adjust aspects of the map to your preferences.
Fundamentally, the key functions to be aware of:
- tm_layout() : Contains parameters to style titles, fonts, the legend etc
- tm_compass() : Contains parameters to create and style a North arrow or compass
- tm_scale_bar() : Contains parameters to create and style a scale bar
To be able to start styling our map, we need to interrogate each of these functions and their parameters to trial and error options to ultimately create a map we’re happy with.
Here, for example, is a first pass at styling our above map to contain a title, change the colour palette of our map, plus change the position of the legend, add a north arrow and a scale bar - whilst also formatting the font:
Example code: feel free to implement/adjust:
Well this is starting to look a lot better, I’m still not happy with certain aspects.
For example, I think moving the legend outside of the map might look better - plus I’d prefer that the legend also has a different title that is more informatives.
Let’s see what small adjustments we can make.
- a title = argument into the tm_polygons() layer for the theft_crime_rate_sdf
- whilst adding legend.outside = TRUE, legend.outside.position = "right" to the tm_layout() layer.
More example code - feel free to add and implement:
Well I’m pretty happy with that!
There’s only a few more things I’d want to do - and that would be to add an additional legend property to state why the City of London wards are grey as well as our data source information.
Remember - all our maps contain data from the Ordnance Survey and Office for National Statistics and this needs to be credited as such (I’ve put this for now in our Acknowledgements section of the workshop).
This could all be added in an additional text box within our map using the tm_credits() function - but I’m not happy with the display that R creates (feel free to experiment with this if you’d like!). I haven’t quite figured out how to get the tm_credits() box to appear outside the main plotting area!
For now, I would add this in post-production or take the my next step in my own R map-making learning curve is to figure out how to make an additional box outside the map area. Let’s see what we get up to in the second half of term!
Exporting our final map to a PNG
Once we’re finished making our map, we can go ahead and export it to our maps folder.
To do so, we need to save our map-making code to a function and then use the tmap_save() function to save the output of this code to a picture within our maps folder.
We also want to export the rest of our hard work in terms of data wrangling that we’ve completed for this practical - so let’s go ahead and export our data frames so we can use them in future projects, where during GEOG0030 or beyond.
What we’ll do is export both the all_theft_ward_sdf spatial data frame and theft_crime_rate_sdf as shapefiles.
This means we’ll have both datasets to use in the future - you can, if you want, also export the all_theft_ward_sdf spatial data frame as a csv if you like.
And that’s it - we’ve made it through our entire practical - awesome work and well persevered!
You will have learnt a lot going through this practical that we’ll keep putting into action as we move forward in Geocomputation.
Therefore, as I always say, do not worry if you didn’t understand everything we’ve covered as we’ll revisit this over the next five weeks - and you’ll of course always have this page to look back on.
To consolidate our learnings, I have a small task for you to complete - as I’ve said earlier, I won’t set the mini-project I had planned, but what I would like you to do is complete a small assignment in time for our seminar in Week 6.
Assignment: Making maps for another month!
For your assignment for this week, what I’d like you to do is to simply make a map for a different month of 2020 and export this to submit within your respective seminar folder.
If you navigate to your folder from here, you’ll see I’ve added folders within each seminar for the different maps we’re making within our practicals.
What I’d like you to do is check this folder to see what months are already covered within the folder - and then make a map for the month that isn’t yet made!
To help, when you export your map, make sure to use the name of the month at the start of your title (i.e. as prescribed above!).
You’ll of course see that January 2020 is already taken - by me! But it’d be great to get maps for every single month of the year within each seminar folder.
But what if all the months are now done?
Please go ahead and make a duplicate map (not of January, of course!) - the more the merrier, and if you can look into different colour palettes and styling effects, even better!
Remember, you’ll need to really think about your classification breaks when you change to map a different map from January as my breaks are based on January’s distribution! We won’t worry about standardising our breaks across our maps for now - just make sure you represent the distribution of your data well!
If you have any issues with this, please get in touch!
Wow - that’s been a lot to get through, but over the last two weeks, you really have had a crash-course in how to use programming for statistical and spatial analysis.
In this week’s workshop, you’ve learnt about why we use programming for spatial anlaysis, including how the four key principles of data science have affected how we “do” spatial analysis.
You’ve then had a thorough introduction into how we use R and R-Studio as a GIS - and as we can see through our practical, in comparison to Q-GIS, there is a lot more to learn, as we need to know a lot about programming, particuarly to “wrangle” our data - before we even get to map-making.
Furthermore, when it comes to map-making in R, this isn’t even as straight-forward! We need to know all about this “grammar of graphics” and how to layer our data and what parameters do what, which, compared to drawing a few boxes etc. in Q-GIS, is a whole lot more complicated!
You can therefore see that Geocomputation really requires a combination of foundational concepts in GIScience, Cartography and Programming in order to understand precisely what you’re doing - and even when you’ve had this foundational introduction, it can still feel overwhelming and a lot to learn - and that’s because it is!
I do not expect you to “get this” all at once, but this workbook is here for you to refer to as and when you need to get your “Aha” moments, that you’ll get a) on this course and b) as you, for example, complete your own independent research projects, such as your dissertations.
Take this all in good time, and we’ll get there in the end - and I will revisit lots of the concepts we’ve looked at over the last two weeks time and time again!
What you should realise however is that once you have this code written - you can just come back to it and copy and paste from your scripts, to use in other scripts, for example, changing variables, data files and, of course, parameter settings.
And that’s how you end up building up a) your scripts in the first place but b) your understanding of what this code does!
If you’re concerned that need to know and understand every function – I can whole-heartedly say - no, you don’t. It takes time, experimenting and research to learn R.
For example, last week I had you clean the names of our crime dataset clean names manually - I found out this week there is a great package called janitor that has a function called clean_names() would do that all for us. We’ll use this in Week 6 for some data cleaning, so we won’t deviate now.
Ultimately programming - and increasing your “vocabulary” of packages and functions - is an iterative learning process and only one you’ll build upon by writing more and more scripts!
To help with all of this new learning, I recommend only one key reading for now:
Geocomputation with R (2020) by Robin Lovelace, Jakub Nowosad and Jannes Muenchow, which is found online here.
I’d recommend reading through Chapters 1, 2, 3 and 8.
We’ll continue to build on everything we’ve learnt over the last five weeks as we move into the second half of the module, where we focus more on spatial analysis techniques.
You’ll be probably happy to know we will focus less on programming concepts and more on spatial analysis concepts - and use what we know so far with programming to conduct the spatial analysis.
This should mean that our practicals will be a little shorter in terms of reading - and even more active in terms of doing!
Extension: Facet Mapping
So you’ve got this far and really want more work? Really? Are you serious?
Ok, well here we go! (And for those of you that don’t, do not worry, as we’ll be looking at this in more detail at Week 10!)
So how cool would it be if we could make a map for all 11 (12) months of data in an instant using code…?
Well that’s exactly what faceting is for!
According to Lovelace et al (2020):
Faceted maps, also referred to as ‘small multiples’, are composed of many maps arranged side-by-side, and sometimes stacked vertically (Meulemans et al. 2017). Facets enable the visualization of how spatial relationships change with respect to another variable, such as time. The changing populations of settlements, for example, can be represented in a faceted map with each panel representing the population at a particular moment in time. The time dimension could be represented via another aesthetic such as color. However, this risks cluttering the map because it will involve multiple overlapping points (cities do not tend to move over time!). Typically all individual facets in a faceted map contain the same geometry data repeated multiple times, once for each column in the attribute data. However, facets can also represent shifting geometries such as the evolution of a point pattern over time.
In our case, we want to create facet maps that show our theft rate over the 11 months and to do so, we need to add two bits of code to our original tmap approach.
First, in our tm_polygons() shape, we add all our months as a combined vector. + *We make this easy for ourselves by creating a month variable that stores these values from a selection of the names() function on our spatial data frame.
Second, we add a tm_facets() function that tells tmap to facet our maps, with a specific number of columns.
The code below shows how to create a basic facet map using this code.
What I’d like you to do is figure out how to make this facet map more aesthetically pleasing - including changing the location of the legend (or removing it?) as well as altering the colours etc.
If you manage to create a facet map you are happy with, please export this and upload it to your relevant seminar folder!
Users can take digital photos in the field and link them to GPS coordinates in the GIS database. This allows users to establish a visual record of important features and their precise locations. By comparing photos of the same location taken at different times, users can notice changes to the property. (This can be particularly helpful for monitoring easements and identifying potential violations.)
GPS allows users to document the coordinates of property boundaries. In the past, surveyors used landmarks (which can be destroyed or moved over time) to define boundaries. Since GPS uses exact coordinates rather than relational landmarks, it produces measurements that remain accurate no matter what happens to the surrounding land or physical objects used as landmarks. (Note that accurate surveying of property boundaries necessitates the use of survey-grade equipment see the heading &ldquoSurvey-Grade&rdquo below. Also, depending on the purpose of the survey, the law may require the work to be completed by a licensed surveyor.)
- Access US Census data through American Factfinder and navigate the download tool to extract two years of data
- Clean US Census data to isolate variables of interest in a spreadsheet and OpenRefine
- Create a database to conduct a join on the data
- Export Joined data to a spreadsheet and create basic chart visualizations of that data
- Let's begin with a discussion of a fascinating use of census data to visualize distributions of people by race in the USA. Investigate this "Racial Dot Map" tool created by researchers at the University of Virginia. Consider these discussion questions:
- Where did you explore first? What were your first impressions of this data?
- What makes this an effective data visualization tool? What are its limitations?
- Explore the researcher's data portal by clicking the "what am I looking at?" Link. What principles of good data analysis are exhibited?
- What additional layers of data would you like to add to this racial dot map? What conclusions or ideas would adding this data allow viewers to consider or conclude?
- Browse the American Community Survey data to find two geographies of interest over two time periods in which you can investigate change in some set of variables. You'll need to make sure that you have data on the same geography level for the tables you choose in both years.
- Download both tables and import them into a spreadsheet
- Clean the columns by developing sensible column names, deleting the columns you don't want (probably margin of errors for this practice activity). Remember--No strange columns and no spaces in variable names!
- Save this file and import into OpenRefine to clean up fields. Delete records in which there is very little data. Replace no-value markers with 0 so we can use numeric functions on the fields
- Export the data from Open Refine back into a spreadsheet
- Copy the cleaned data into LibreOffice base and create a master table with joined data on a key column for export
- Extract data from the database back into the spreadsheet for visualization
- Visualize the data you've gathered and do the write-up in the shared google doc located here.
- Prepare to give a short presentation on this data at the start of the next class.
Abdelrahman Mohammed Helmi is a Ph.D. candidate in Faculty of Computers and Information, Helwan university. He received the M.Sc. in information systems from Faculty of Computers and Information, Helwan University, Egypt, 2017. His research scopes on Cloud Computing, Geographical Infroomation Systems, Business Intellegince and Internet of things.
Marwa Salah Farhan holds a Ph.D. in Information Systems. She is a lecturer, information systems department, Faculty of Computers and information, Helwan University, Egypt. Her research interest focuses on Cloud Computing, Advanced Database Management, Big Data, Data Science, and Software Engineering.
Mona Mohamed Nasr is an Associate Professor and Ex-Vice Dean of Faculty of Computers and Information, Helwan University for Community Service and Environmental. Ex-Vice Dean at the Canadian International College, El Sheikh Zayed Campus. She received the M.Sc. in information systems from Faculty of Computers and Information, Helwan University, Egypt, 2000. Ph.D. in Information Systems from Faculty of Computers and Information, Helwan University, Egypt, 2006.
Reviews processed and recommended for publication to the Editor-in-chief by associate editor Dr. L. Bittencourt.
Census boundary data
Census area statistics provide counts of people or households for geographical areas broken down by socio-demographic characteristics such as age, gender or employment. Digitised boundary datasets (sometimes referred to as 'DBDs' or 'boundary data') are a digitised representation of the underlying geography of the census. They are often used within Geographical Information Systems (GIS) or Computer aided Designs (CAD) systems.
Figure 1: Digital boundary example
Copyright statement: Contains National Statistics data © Crown copyright and database right 2012. Contains Ordnance Survey data © Crown copyright and database right 2012.
The geography of the census consists of a hierarchical subdivision of UK local government areas of various types down to sub-authority areas, such as wards, to lower levels created specifically for census purposes such as enumeration districts in 1971, 1981 and 1991 or output areas in 2001 and 2011. The smallest units can then be aggregated to produce larger areas – for 2011 Census these include 'Super Output Areas' which come in two forms – Lower SOAs and Middle SOAs with the latter being the larger. New geographies also exist for the 2011 Census , specifically the Workplace Zones and Census Merged Wards. Readers should consult the ONS product guide for a fuller description.
For example, the geography of the 1991 Census for England consisted of a 4-level hierarchy: enumeration districts (EDs) at the lowest level nest within wards, districts and counties.
Figure 2: 2011 Census Geography hierarchy going from Country to Local Authority to Middle Layer Super Output Area (MSOA) to Lower Layer Super Output Area (LSOA) to Output Area
Copyright statement: Contains National Statistics data © Crown copyright and database right 2012. Contains Ordnance Survey data © Crown copyright and database right 2012.
The digitised co-ordinates (points, lines, areas) which make up these census geographies are available as digitised boundary datasets. These form the areal representation 'buckets' against which various census statistics e.g. counts of households, proportion of males:females etc. can be associated and subsequently visualised and analysed.
What can digitised boundary datasets tell us?
Census area statistics contain a pointer (generally a code such as 'E09000022' which represents the 2011 code for the London Borough of Lambeth), to the geographical census areas to which they relate. By linking census area statistics with the corresponding digitised boundary datasets for a specific census year, the census attributes can be visualised as a map. Mapping census datasets in this way allows for an exploration of the characteristics of census datasets geographically and may provide additional demographic, socio-economic and cultural insights into the census data.
As an example, it is possible to explore the patterns of housing tenure recorded in the census - such as the proportion of people who live in local authority housing. By linking the census statistics to DBDs of county boundaries or outputs areas within a specific region/area and producing a shaded choropleth map of the numerical values held in the census dataset, it can be shown how housing in one region/area differs from another and whether there are any interesting patterns in the geographical distribution of census variables.
Figure 3: Choropleth map showing proportion of people working more than 49 hours per week by South East England Local Authority as recorded by the 2011 Census.
Copyright statement: Contains National Statistics data © Crown copyright and database right 2012. Contains Ordnance Survey data © Crown copyright and database right 2012.
Using the census statistics and boundaries in a Geographical Information System (GIS) allows for spatial analysis of the census data and its combination with other non-census geographically referenced datasets.
Digitised boundary datasets can be used for:
- map production for research articles
- data synthesis and development of residential neighbourhoods
- geostatistical analysis of demographic or employment change
- small area analysis and deprivation studies
- health care research – incidence mapping and analysis
- historical demographic research
The geography of the decennial census is not fixed. For the same physical local area, the output geography used in the 1971, 1981, 1991, 2001 or 2011 Censuses may be quite different.
Figure 4a: 2001 Census Output Areas in Leeds city centre drawn on top of a 2011 Ordnance Survey map.
Contains National Statistics data © Crown copyright and database right 2012. Contains Ordnance Survey data © Crown copyright and database right 2012.
Figure 4b: 2011 Census Output Areas for the same location in Leeds city centre. Some 2001 Census Output areas have been split for 2011 to ensure Output Area population thresholds are retained given new urban housing developments between 2001 and 2011.
Significantly, different research questions may require mapping of the same census statistic at different scales and in different locations.
Figure 5a: 2011 Census population by all English and Welsh Local Authorities.
Figure 5b: 2011 Census population by all Lower Layer Super Output Areas within Leeds Local Authority. The same census statistic can be analysed at different geographic scales.
Copyright statement: Contains National Statistics data © Crown copyright and database right 2012. Contains Ordnance Survey data © Crown copyright and database right 2012
UK Data Service Census Support provides support for and access to a variety of facilities and tools by which users can use the full collection of digitised boundary datasets and supporting datasets, including geographic look-up tables. These datasets are available either pre-packaged or through dynamic user-driven interfaces permitting user-defined custom selection of boundaries and look up tables.
Quick access to the most regularly requested boundaries as ready-to-use national datasets is also provided.
Functionally more complex data extraction facilities allows users to select boundaries for any specific area required, for the census year required, and in a range of different data output formats. This flexibility allows users to download census output areas for several counties or for a specific ward or district. During the boundary selection process, the chosen boundaries can be previewed over a topographic back-drop map before finally being extracted in one of several data formats for use with different GIS and mapping packages.
The range of tools available include:
This facility lets users quickly download the most regularly requested census boundaries available in popular formats.
Boundary Data Selector
This facility lets users select the boundaries they want, for the areas they want, in the format they want.
This facility lets users obtain and manipulate complex geographical and postcode data in a straightforward way.
Postcode Data Selector
This facility allows users to download the set of postcodes that you want from postcode directories released between 2001 and the present day.
Postcode Directory Download
This facility allows users to download complete versions of current and historical postcode directories (sometimes referred to as look-up tables).
This facility allows users to create customised choropleth maps from their own statistical data.
WICID (Web-based Interface to Census Interaction Data)
This facility allows users to select and download migration and journey-to-work flow data collected by the Census of Population.
This facility contains boundary data bundled with census aggregate data for the 2001 and 1991 Censuses.
Many boundary types are available for England, Wales, Scotland and Northern Ireland (2001 and 2011 data only) including:
- Census boundaries e.g. 2011, 2001, 1991, 1981 and 1971 Census boundaries
- Administrative boundaries e.g. districts, unitary authorities, health boundaries
- Electoral boundaries e.g. wards, parliamentary constituencies
- Environmental boundaries e.g. national parks, urban footprints
- Postal boundaries and postcode-related boundaries
- Historical boundaries pre-1971 census and administrative boundaries from 1840 onwards
- Other boundaries e.g. synthetic neighbourhood localities
Important supporting datasets include geographic look-up tables. These include versions of the ONS Postcode Directory (ONSPD) from the Office of National Statistics which provides details of the locations of current and historic postcodes along with details of other geographic areas in which the postcode is located.
Such datasets provide a valuable means by which events or occurrences (such as disease, crimes, customer residence etc.) can be allocated from a postcode to another area such as an electoral ward or health area.
Access to certain digitised boundary datasets and look-up tables requires acceptance by registered users of additional 'special conditions'.
Principally these restrict use of the digitised boundary datasets and associated data to teaching and research purposes.
Integrated 2001 and 1991 Census Digitised Boundary Data
We also provide users with digital boundary data to accompany 2001 and 1991 census data in a combined form - that is, boundary data already joined with census statistical data. These downloads are provided in a range of standard GIS formats ready for mapping and spatial analysis.
UK Census geography
Rees P., Martin D.M. and Williamson P. (2002) The census data system, Chichester: Wiley.
Handling spatial data and GIS
Longley P.A., Goodchild M.F., Maguire D.J. and Rhind D.W. (2001), Geographic information systems and science, Chichester: Wiley.
Martin, D. (1996) Geographic information systems: socioeconomic applications, London: Routledge.
Monmonier, M. (1996) How to lie with maps, Chicago: University of Chicago Press.
Walford, N. (2002) Geographical data: characteristics and sources, Chichester: Wiley.
How to download Boundary data
View our video tutorial on how to download boundary data offering a range of digitised boundary data including boundaries designed for use with census data in several GIS geographic information system formats.
Back to top
We expect to run as normal a service as possible during this COVID-19 (Coronavirus) emergency. Please visit our COVID-19 page for the latest information.
Geocoding systems evaluated
Five desktop geocoding systems were evaluated. The geocoding systems used in this analysis were chosen from among the members of the Cooperative Research Centre for Spatial Information (CRC-SI). All 43 industrial partners of the CRC-SI were solicited to participate in this project through an expression of interest (EOI) process which requested information on the geocoding platforms provided by each partner. A set of conditions had to be met, the main one being that the platform had to be a stand-alone desktop system. Of those that responded, five were able to provide evaluation licenses and reference data that could be installed and tested as part of the evaluation. Four of the five systems represent state-of-the-art and well known commercial geocoding system offerings from companies that provide geocoding solutions for Australia and elsewhere in the world. All systems remain anonymous in this paper as per non-disclosure agreements and are indicated simply by the names “Geocoder A” through “Geocoder E” position in this list of five (A – E) was assigned randomly. Each geocoding system was tested using each applicable reference data source and input data combination.
Reference data sources
The reference data sources utilized in these experiments include the most up-to-date and accurate reference data files available for both the state of Western Australia (WA) and the entire country of Australia. The state-level files used were the Property Street Address (PSA) data files distributed by the Western Australian Land Information Authority (Landgate) . These files include digital parcel boundaries (polygons) and parcel centroids (points) for all addresses in WA. Also used was an extension to the PSA, called PSA + within this report, which included spatially referenced place names also known as “alias tables”. These files are updated continuously and are the official government land records of the state which include the current postal address associated with each property.
The national-level files used in this study were the Geocoded National Address File (G-NAF) maintained and distributed by the Public Sector Mapping Agency (PSMA) Australia Limited . These files are the nation-wide authoritative address data sources for the entire country of Australia. These data are collected from local, state, and national-level government agencies (including Landgate for WA), cleaned, integrated, and prepared for dissemination by PSMA. These data include the digital parcel boundaries (polygons) and parcel centroids (points) for nearly all addresses in Australia along with an associated current postal address associated with each property.
Input data sources
The input data used for this study were chosen to represent three tiers of data types. The three types of data include health service utilization data, administrative list data, and gold standard data. The quality of these data range from exceptionally clean data that have been manually corrected which all geocoding systems should be able to process correctly, to exceptionally dirty data that are known to contain high levels of challenging geocoding scenarios which should cause errors in all geocoding systems. These diverse sets of input data with varying quality were chosen in order to compare how each of the geocoding systems could handle differently input data qualities and tease out the differences in how the internal geocoder processing techniques added to or subtracted from the resulting geocode quality produced by each system. Data use agreements with the data stewards responsible for the collection, curation, and maintenance of the data sets (including the gold standard data) used in this evaluation preclude the naming of the data set or the government agencies that provided them.
Gold standard data
The gold standard data used for this study represent an exceptionally clean data set (data set A, n = 2,203) - a data source with no errors which should be correctly processed by all geocoding systems non-matches in this system would be considered false negatives. This data set contained address data drawn from a previous, larger study. Each of the records in this data set represented an address that was not capable of being successfully geocoded using an automated geocoding system. These records were manually reviewed and processed to improve their output quality by verifying and/or correcting postal address attributes and the true location of the geocoded point following a method similar to that presented in Goldberg et al. (2008) . The records were ground truthed using a variety of methods including aerial imagery, online “street view” software, contact of the parties responsible for the address to confirm address attributes, and linkage with official government records and public domain data sources. The result of these painstaking efforts was the construction of an input data set of addresses with attribute data (number, street name, suffix, locality, postcode, etc.) that were manually confirmed to be correct.
The administrative data set (data set B, n = 1,364,058) used for this study was drawn from official records of a large WA administrative database. These data contain the official addresses of a subset of residents of WA, and represent input address data that should be of fairly high quality. These data are representative of many administrative lists that are used to send out government mailings, confirm postal delivery addresses, and other essential government services.
Health service utilization data
The health service utilization data set (data set C, n = 1,264,941) used for this study was chosen to represent a data source with numerous errors in the input address which would be the most difficult to geocode and result in the highest number of non-matches, false positive matches (incorrect matches), and false negative non-matches (incorrect non-matches). These data were drawn from the health service utilization records of a specific Western Australian health agency and are representative of the quality of data that occur when data are collected through a patient-facing organization where the patient self-reports his/her postal address.
The primary challenges of these data were threefold –
Blank fields in addresses resulting in input data with limited input address fields, sometimes with just a locality and/or just a postcode
Named places such as prisons, nursing homes, and Aboriginal communities, instead of street addresses and
Historical data which includes many versions of data input systems all of which captured data in different ways ranging over a number of years.
Variations to data collection procedures through time include:
Truncations to save characters
Transposition and introduction of new fields as user interfaces were updated and
Use of various codes for unknown/missing information (e.g., entering postcode 9999 when the postcode was unknown versus leaving it blank or entering 0000).
These data included numerous types of other frequently occurring errors including misspellings to all components of the input address (number, street name, suffix, locality, postcode, etc.), the use of incorrect locality names and postcodes, and all combinations of missing attributes for all fields of the input address.
The experiments performed for this research attempted to apply the framework and metrics described above in the context of the Western Australia (WA) Department of Health (DoH) as a test-case for evaluating their applicability for comparing a set of available geocoding platforms. To do so, the characteristics of each geocoding system were assessed across each aspect of the evaluation framework presented earlier. Table 9 was constructed in consultation with the WA DoH as the features and capabilities of geocoding systems which were important to the organization. Each system was evaluated based on published literature and documentation of the geocoding systems. Additional communication with each vendor was necessary to determine all capabilities because not all vendors use the same terminology for all items.
The project team attempted to install each system 'out-of-the-box’ without customization as much as possible. This included importing reference data layers into some of the systems as necessary, i.e., those that did not include the reference data as part of the software, instead requiring a geocoding reference data layer to be constructed or specified. An exception to this is the programming required to install Geocoder A which is described below.
The three input data sets were batch-processed through each of the geocoding systems on the same team-member’s computer in sequence. No data filtering, data cleansing, address standardization, or address normalization operations were applied to any of the input data prior to geocoding being performed. All data were processed directly as received from the data custodians although the first step in most batch geocoding systems is to standardize and normalize the input data internally within the geocoding system .
The experiments performed controlled for differences in geocoding quality due to the three main components of geocoding systems: (a) input data quality (b) geocoding algorithms which include all components of the geocoding system that are beyond the control of a geocode user – address standardization and normalization, feature matching, and feature interpolation and (c) the reference data layers used. To do so, each of these three components was evaluated separately by constructing usage scenarios that attempted to vary one aspect and keep the other two constant. Each of these axes was tested by varying one and holding the other two constant.
For example, to test the effect of input data quality across each geocoding system, all three data sets where processed by each geocoder using the same reference data sources (as could be achieved based on different reference data set support per geocoder). Holding the reference data sets static and changing the input data set allowed for analysis of the overall effect of excellent (Gold Standard), moderate (Administrative), and poor (Health) quality data on each geocoding system. Similarly, the effect of reference data set usage was evaluated by holding the input data set constant and processing it with different combinations of reference data layers, per geocoding system.
Is there a way to upload non-spatial lookup tables in CartoDB? - Geographic Information Systems
National Stream Quality Accounting Network and National Monitoring Network Basin Boundary Geospatial Dataset, 2008-13 1.0 vector digital data
https://water.usgs.gov/lookup/getspatial?ds641_nasqan_wbd12 Nancy T. Baker
National Stream Quality Accounting Network and National Monitoring Network Basin Boundary Geospatial Dataset, 2008-13 1 U.S. Geological Survey Data Series Report Data Series Data Series 641
This dataset was created to assist in analysis and interpretation of the U.S. Geological Survey NASQAN water-quality data. Geospatial watershed data are often used in conjunction with water-quality data for modeling, calculation of loads, and calculation of selected basin characteristics.
Basin boundaries were obtained from 4 sources: 1) The majority of the basin boundaries were derived from the WBDHU12_03Feb2011_ArcGIS9.2_File.gdb 12-digit Hydrologic Unit National Watershed Boundary Dataset (WBD12) (U.S. Department of Agriculture, 2009) 2) The basin boundary from the NASQAN station location to the nearest WBD12 Subwatershed boundary was generated by interpretation of topographic information from digitizing the ArcGIS Online 1:24:000 scale USA topographic maps (National Geographic Society, 2011) 3) Basins extending into Canada farther than the WBD12 were derived from 1:1,000,000 "canadwshed_1m_v6-0_shp" data (Canadian Geospatial Data Infrastructure, 2009)and 4) The portion of the Rio Grande near Brownsville, TX (08475000) that extends into Mexico was obtained from Patino and others (2004). Additional information about closed and noncontributing basins were obtained from Seaber and others (1987) for the Arkansas River, and Lurry and others (1998) for the Rio Grande River. *************Special Note for Lower Mississippi River and Atchafalaya River***************** Basins stations 07381590 Wax Lake Outlet at Calumet, LA and 07381600 Atchafalaya River at Morgan City, LA should be considered together. The Atchafalaya River bifurcates below Melville, LA with a portion of the flow draining through the main channel to Morgan City and a portion draining into Wax Lake Outlet into the Gulf of Mexico. The basin boundary for 07381590 Wax Lake Outlet includes the only the area draining Bayou Teche and a small part of the Atchafalaya main stem near the Wax Lake Outlet. There is no reliable way to determine the areal portion of the Atchafalaya River Basin that drains through the Wax Lake Outlet and the portion that drains through the main stem. Users of the NASQAN water-quality data for either Wax Lake Outlet and/or the Atchafalaya River at Morgan City should consider both basin boundaries and combined water-quality data for both stations. In addition, a portion of flow on the lower Mississippi River is diverted through the Old River Outflow Channelto the Atchafalaya River. Affected stations on the Mississippi River side include 07373420 near St. Francisville, LA,07374000 at Baton Rouge, LA and 07374525 at Belle Chasse, LA. Affected stations on the Atchafalaya River side of the diversion include 07381495 at Melville, LA, 07381590 Wax Lake at Calumet, LA, and 07381600 at Morgan City, LA. All of these stations should be considered together when assessing water quality and flow data. Users of water-quality data for these stations should also consider flow records for U.S. Geological Survey station Old River Outflow Channel near Knox Landing, LA (07294800), Mississippi River at Tarbert Landing, MS (07295100), and Atchafalaya River at Simmesport, LA (07381490). **********************************Special Note for the Rio Grande River*************************** Much of the flow in the Rio Grande upstream from the Brownsville, TX station (08475000) is diveted into the Anzalduas Canal which flows through the Anzaldual Dam near Mission, TX and then into Arroyo Colorado. Almost all the water withdrawn from the Arroyo Colorado or the Rio Grande for irrigation and municipal purposes is returned to the Arroyo Colorado. The Arroyo Colorado drains into the Laguna Madre (WBD12 HUCs 1211020801--1211020809), which effectively becomes an estuary for the Rio Grande during spring and summer irrigation seasons. Users of the NASQAN water-quality data for the Rio Grande near Brownsville, TX should consider flow and water-quality data for U.S. Geological Survey station Arroyo Colorado at Harlingen, TX (08470400). The Laguna Madreis included in the basin delineation for station 08475000. The drainage area for this delineation of the Rio Grande River near Brownsville is 215,270 square miles with 177,415 square miles contributing area, 35,291 square miles noncontributing area, and 2,564 square miles for the Laguna Madre diversion basin. The drainage area published in U.S. Geological Survey NWIS records is 176,333 square miles. The value published in NWIS does not include the noncontributing area or the 2,564 square miles for the Laguna Madre basin. When these two factors are taken into account the resulting comparable drainage area is 177,415 for this delineation or a 0.61 percent difference between the NWIS published area and the comparable drainage area for this version. **********************************Special Note for the Colorado River****************************** Much of the flow upstream of the U.S. Geological Survey station Colorado River at the Northern International Boundary (NIB), AZ (09522000) is diverted into the All-American Canal. The All-American Canal diversion is downstream of station Colorado River above Imperial Dam, Arizona-California (09429490). Flow data from this station should be considered when assessing water-quality records for the NIB station. 2007 2011 This dataset is for NASQAN and NMN stations in the network as of October 2007: http://water.usgs.gov/nasqan/
As needed -127.87 -65.35 48.24 22.87
USGS Thesaurus inland waters hydrologic units watershed basin boundary drainage area WBD ISO 19115 Topic Category Geoscientific Information Inland Waters
Geographic Names Information System
Nancy T. Baker U.S. Geological Survey, Indiana Water Science Center GIS Specialist mailing and physical address 5957 Lakeside Blvd Indianapolis Indiana
United States 317-290-3333 x185 317-290-3313 [email protected] 7:00 am to 3:30 pm M-F
Water Boundary Dataset (WBD12) (U.S. Department of Agriculture, 2009) 03Feb2011_ArcGIS9.2 vector digital data
U.S. Department of Agriculture
Accessed March 2011 http://datagateway.nrcs.usda.gov online 2009 publication date WBD12 Spatial and attribute information--primary source for basin boundary linework National Geographic Society
USA Topographic Maps Not specified image files served interactively
ArcGIS website served by National Geographic Society
U.S. Geological Survey original publisher of topographic maps
Accessed May 2011 http://server.arcgisonline.com/ArcGIS/rest/services/USA_Topo_Maps/MapServer online 2009 publication date TOPO Spatial information--provided underlay of 1:24,000 topographic maps so that the drainage divide could be digitized from the WBD12 line to the basin mouth (U.S. Geological Survey gaging station) Patino, C., McKinney, D.C., Maidment, D.R.
Development of a Hydrologic Geodatabase for the Rio Grande/Bravo Basin, AWRA Spring Specialty Conference: Geographic Information Systems (GIS) and Water Resources III, Nashville, TN, May 17-19, 2004 1.0 vector digital data
University of Texas, Center for Research in Water Resources
Accessed August, 2011 http://www.crwr.utexas.edu/riogrande.shtml online 2004 publication date RG Spatial information for station b08475000 (basin boundary south of US/Mexican border) Lurry, D.L., Reutter, D.C., and Wells, F.C.
Monitoring the Water Quality of the Nation's Largest Rivers Rio Grande NASQAN Program None fact sheet
Accessed August, 2011 http://water.usgs.gov/nasqan/docs/riogrndfact/riogrndfactsheet.html online 1998 publication date RGFS Supplemental information for station b08475000 (basin boundary south of US/Mexican border) Canadian Geospatial Data Infrastructure
Canadian Watershed Boundaries v6 vector digital data
Canada Land Inventory Level-I Digital Data.
Accessed December, 2010 http://geogratis.cgdi.gc.ca online 2009 publication date CAWBD Spatial information for stations extending north of the Canadian border and north of the WBD12 boundaries in Canada--b04267331, b14246900, and b15565447. Seaber, P.R., Kapinos, F.P., and Knapp, G.L.
Hydrologic Unit Maps None paper maps and text
Water Supply Paper 2294 https://pubs.usgs.gov/wsp/wsp2294/ paper maps 1987 publication date WSP2294 Spatial information for stations extending north of the Canadian border and north of the WBD12 boundaries in Canada--b04267331, b14246900, and b15565447.
Step 1: Use the Selection by attributes tool in ArcMap to select all 12-digit hydrologic unit (HU) watersheds upstream of the NASQAN gaging station from the WBD12 dataset. The selected HUs were exported to .shp files.
Step 2: For watersheds in which the mouth (gaging station) did not fall on a 12-digit HU line, the mouth of the watershed was digitized to the nearest HU line. 1:24,000 scale USA Topographic online maps were used to determine the basin divide from the nearest HU to the mouth of the basin.
Step 3: Dissolved all internal HU lines so that only the basin boundary polygon remained.
Step 4: Projected all shapefiles into Albers Equal Area projection and converted into ArcInfo coverages and edited as needed
Step 5: For basins that extended into Canada farther than the WBD12, basin boundaries were delineated in the same manner desribed above using the 1:1,000,000 scale canadwshed_1m_v6-0_shp data (Canadian Geospatial Data Infrastructure, 2009). The Canadian portion of watersheds extending into Canada were then appended to the WBD12 derived polygons for the U.S. side of the basins using the ArcInfo Append command.
Step 6: The portion of the Rio Grande near Brownsville, TX (08475000) that extends into Mexico was obtained from the Rio Grande basin boundary generated by Patino and others, 2004. The Mexican portion of watershed was then appended to the WBD12 derived polygon for the U.S. side of the basin using the ArcInfo Append command.
Step 7: Converted all ArcInfo coverages into shapefiles for publication.
Step 8: Closed and noncontributing sub-basins were extracted from the WBD12 data.All sub-basins labeled "closed" in the WBD12 HUC_10_DS or HUC_12_DS attribute were selected. All sub-basins with NCONTRB_A greater than 0 and not labeled "closed" in the WBD12 HUC_10_DS or HUC_12_DS attribute were selected. The Intersect command in ArcGIS was then used to assign the closed or noncontributing sub-basins contained within each NASQAN or NMN basin. A separate shapefile containing all the closed and noncontributing sub-basins was generated for each NASQAN or NMN basin when those basins existed. The shapefile naming convention for these basins is the STAID preceeded by the letter "n").
Step 9: Additional closed sub-basins were added to the shapefiles generated in Step 8 where the WBD12 information is incomplete. Often sub-basins upstream of a closed system are not labeled as closed. These basins were determined by visual inspection and added to the appropriate basin file. Seaber and others (1987) was also used to identify Subwatersehds that should be included in the closed category.