Category Archives: Computing

Notes on processing large files with GDAL

Context

I have recently had a project where I was provided with a reasonable number of large aerial image files. These were 600-800MB each and there were 10-15 of them. I needed to cut out the area of interest (which covered some part of all of the images), mosaic the images into a single representation of the site and find a way to share the data with non-specialists.

The workflow in my head was Clip – Mosaic – Webmap

Things to note

The images were too slow to render for me to mess about trying to do this using a GUI system, and the GDAL functions in QGIS3 (for me at least) don’t seem to be up and running properly (milage may vary).

Certain things that I have found when doing this in GDAL are that it’s best to define a data type for the bands, and to set the data structure as cloud optimised. The easiest way to create cloud optimised geotiffs is to use the following:

gdal_translate in.tif out.tif -co TILED=YES -co COPY_SRC_OVERVIEWS=YES -co COMPRESS=DEFLATE

Initially I was going to use gdal_merge.py to mosaic the clipped images as that is what is used in QGIS. However, gdal_merge.py reads everything into memory and I quickly found out that 16GB of RAM wasn’t enough as my swap partition started to be used and the whole thing ground to a halt.

The trick is to use the CPU if you can. First make a virtual raster using gdalbuildvrt, and then use gdal_translate (which utilises the CPU) to change the vrt file to whatever format you want (e.g. cloud optimised geotiff). This was actually really fast, and used the structure found in the following commands:

gdalbuildvrt output.vrt /path/to/folder/of/*.tif
gdal_translate -of GTiff output.vrt mosaic.tif

This ends up with an image that will be relatively large (1.2 GB in my case) which is still tricky to share. But fear not! We can use gdal2tiles.py to create a whole load of tiles and automatically generate some leaflet code to allow you to see the data. Then it’s just a question of moving that folder onto a web server and sending your contacts the link.

A quick thing to note/remember is that the webmap zoom factors relate to the following:

  • 0 represents the whole world (1:500,000,000)
  • 19 is very close up (1:1000)

Workflow

The workflow that I used in the end was as follows:

First crop each aerial image to the site shapefile

maindir="/path/to/folder/of/incoming/aerial/files"
cd ${maindir}

for f in *.tif

do
gdalwarp -ot Byte -of GTiff -cutline siteBoundary.shp 
-crop_to_cutline -dstnodata 0.0 -overwrite ${f} 
outputfolder/${f}_cropped.tif-co TILED=YES -co COPY_SRC_OVERVIEWS=YES
 -co COMPRESS=DEFLATE
done

Then build the virtual image

gdalbuildvrt output.vrt /path/to/folder/of/*.tif

Output the virtual image to a cloud optimised geotiff

gdal_translate -ot Byte -of GTiff -co TILED=YES 
-co COPY_SRC_OVERVIEWS=YES -co COMPRESS=DEFLATE output.vrt mosaic.tif

Create the web tiles and map file using the following command

gdal2tiles.py -s EPSG:27700 -z 11-19 mosaic.tif

Sources

https://lostingeospace.blogspot.com/2011/04/rapid-mosaicking-with-gdal.html

http://www.cogeo.org/

Advertisements
Tagged , , ,

Putting together the Scene From Above podcast

In December 2017 I have started the Scene From Above podcast with my co-host Andrew Cutts. The following are a series of notes on how I host the podcast. As ever with this blog site, these notes are written for my own needs, but if they help anyone else then that’s great.

I’m hosting the podcast episodes on Amazon’s S3 service because it is cheap. The S3 service is effectively a series of online folders, although I know that a) isn’t what AWS call them (they are buckets) or b) isn’t what they are from a technology standpoint (but I’m skipping that because we just need to think of them as folders for this use case).

I’m going to assume that we have access to an AWS account and are able to set up a new bucket with public permissions. First up, to make the S3 bucket web ready, I uploaded an HTML file called index.html that included the following code:

indexhtml

This redirects traffic from the S3 bucket back to the podcast page on my company website. To make this work, and for the RSS feed to work, I enabled static hosting on the S3 bucket by clicking Properties | Static Website Hosting | Enable website hosting and listing the index.html file as both the Index Document and Error Document.

However, navigating to the hosted website will likely result in a permission error, so under Properties | Permissions I chose Add bucket policy and added the following (where name.of.the.bucket was changed to be just that!):

policy

I must confess that I’m not entirely sure what this does, but the instructions I was following called for this, and it looks like it’s allowing all (*) actions on the bucket.

As per the instructions I was following I created a data folder, images folder, and rss folder and made everything public. The podcast.xml file sits in the rss folder and links to the appropriate data files in the other folders.

Now to the bit that caused the biggest headache – creating the podcast.xml file.

The Scene From Above podcast XML feed can be found here for reference, and the feed it was built on can be found here. I tried a number of different example feeds but this one worked best for me.

Each episode is encapsulated within tags and has the following mix of standard and iTunes specific tags:

episode

Although all these tags seem to be required, the most important tag here is the tag as that provides the information about the episode location, length and format. To get this information I perform a bit of a convoluted process. I take the link for the episode mp3 file and paste it into a blog post I call Test (making sure I don’t share on social media). In WordPress I then go to WP Admin for my website, and choose Settings | Media | Podcasts and open the podcast feed in another tab. All being right with the world, the enclosure tag should be in there – I copy it and past it into the episode details in the updated podcast.xml file. I then delete the Test post.

The reason we use the podcast.xml file is to enable finer control over the information in the feed. For this to work, it assumes that the instructions at this address have been followed: https://en.support.wordpress.com/audio/podcasting/

The elements that describe the podcast itself, rather than each episode, are laid out below:

channel

Yet again, there are iTunes specific and generic tags. I think that Rawvoice is a different aggregator, similar to the iTunes store.

Apparently Apple can be quite strict in their review process. so I needed to check that the feed worked. I used Cast Feed Validator, PodBase and W3C Validator to check my feed file. It took me a while to get the feed working, although in many cases the validators accepted the xml file whilst the iTunes desktop player didn’t (File | Subscribe To Podcast and enter the path to the xml file).

The two causes of this were:

  • The podcast image has to be between 1200 and 3000 pixels in size, but also has to be perfectly square. Oh, and it can’t be a large file – about 300k is preferred so I made it a jpg.
  • I saved my podcast in .ogg format. This is fine for every podcast app out there APART FROM iTUNES! Make sure you use mp3 format when exporting from Audacity.

Once it worked in Podcast Addict and the iTunes desktop player, and I could embed the audio link into my blog I was ready to submit the feed to the iTunes store. In the iTunes desktop player, click on the store and then look for the Submit a Podcast link on the right hand side of the window. This took me to a Podcasts Connect page where I logged in with my Apple ID and then followed the simple instructions. Within 4 or 5 hours the feed had beed accepted into the store. Now, a search for Scene From Above in the store will return the podcast details, or they are available online here.


Source material:

https://www.thepolyglotdeveloper.com/2016/04/host-a-podcast-for-cheap-on-amazons-s3-service/

https://www.thepolyglotdeveloper.com/2016/02/create-podcast-xml-feed-publishing-itunes/

https://techraptor.net/content/how-to-set-up-podcast-hosting-on-amazon-s3

Tagged ,

VirtualBox shared folders

This helped when i didn’t have access rights to the host folder (host OSX, guest Ubuntu Mate):

https://darrenma.wordpress.com/2012/07/18/you-do-not-have-the-permissions-necessary-to-view-the-contents-of-shared_folder/

Basically, type this in Linux (changing username to that of your user):

sudo usermod -a -G vboxsf username

Tagged ,

ECW

I’ve been looking at getting ecw support in gdal on Ubuntu 16.04. It looks as if the last supported version using ubuntugis was 12.04.

Then I found this: https://gis.stackexchange.com/questions/94917/how-do-i-add-and-view-ecw-raster-images-in-qgis-2-2-0-on-ubuntu-14-04-lts/200532#200532

I was going to try and compile my own gdal (honest, I was) but it was quicker and easier to install gvSig – which just works.

Tagged ,

Rasterio conflict in Anaconda

Based on a comment from ocefpaf at https://github.com/conda-forge/gdal-feedstock/issues/69

I recently had an issue where my Anaconda environment was having issues when I wanted to import gdal or rasterio. I was getting the following error when I tried to import rasterio:

ImportError: libmfhdf.so.0: cannot open shared object file: 
No such file or directory

It seems to be a reasonably common issue based on online searches. The following fix worked for me and I was able to install other packages that I needed into the new env and it (so far) seems to be working OK.

conda create -n rasterio_test_env python=3.5 rasterio --yes -c conda-forge 
source activate rasterio_test_env
python -c "import rasterio; print(rasterio.__version__)"

 

Tagged
Advertisements