Friday, November 20, 2015

Parsing a xml-file in Python with a Conditional Search to get Corner Coordinates of Satellite Images

I still find it somewhat tedious to search through xml-files. The following describes a solution on how to find variables in a xml-files that meet certain conditions:

A Radarsat satellite image comes with a product.xml file. In the depth of the xml-hierarchy, there is a long list of image tie points providing coordinates for selected image pixels:


From this long list, I want to retrieve those tie points being the corner coordinates of this image. In order to do this, I need to read from the xml-file

1) the number of pixels and the number of lines of this image in order to determine the corners
2) the tiepoints containing latitude and longitude information for these corners

A good starting point to understand how xml-files can be parsed in Python can be found here.

Using the elementtree module in Python, I first parse the xml file and get the root of the xml-file:

import xml.etree.ElementTree as etree      
tree = etree.parse(xml_file_extracted)
root = tree.getroot()

I can look at the immediate subelements of the root level using a simple loop:

>>> root = tree.getroot()
>>> for child in root:
...     print child.tag, child.attrib
... 
{http://www.rsi.ca/rs2/prod/xml/schemas}productId {}
{http://www.rsi.ca/rs2/prod/xml/schemas}documentIdentifier {}
{http://www.rsi.ca/rs2/prod/xml/schemas}sourceAttributes {}
{http://www.rsi.ca/rs2/prod/xml/schemas}imageGenerationParameters {}
{http://www.rsi.ca/rs2/prod/xml/schemas}imageAttributes {}
>>> 

Note that the name of the element is preceded by "{http://www.rsi.ca/rs2/prod/xml/schemas}", so if you search an element named rasterAttribute, the full name is in fact "{http://www.rsi.ca/rs2/prod/xml/schemas}rasterAttribute"

I now need to search for this element "rasterAttribute." I do not know its exact path within the xml-file, but I know from looking at the file that this element contains the number of pixels and the number of lines. Using "root.iter" I can iterate through all elements with a given name, I use it here as a simple way to find the one rasterAttribute element:


    for attributes in root.iter('{http://www.rsi.ca/rs2/prod/xml/schemas}rasterAttributes'):
        attribute_list = attributes.getchildren()


The attribute_list gives me all the subelements, the children of the rasterAttribute element:

[<Element '{http://www.rsi.ca/rs2/prod/xml/schemas}dataType' at 0x6158588>,
<Element '{http://www.rsi.ca/rs2/prod/xml/schemas}bitsPerSample' at 0x61585f8>,
 <Element '{http://www.rsi.ca/rs2/prod/xml/schemas}numberOfSamplesPerLine' at 0x6158630>,
 <Element '{http://www.rsi.ca/rs2/prod/xml/schemas}numberOfLines' at 0x6158668>,
 <Element '{http://www.rsi.ca/rs2/prod/xml/schemas}sampledPixelSpacing' at 0x61586a0>,
 <Element '{http://www.rsi.ca/rs2/prod/xml/schemas}sampledLineSpacing' at 0x61586d8>,
 <Element '{http://www.rsi.ca/rs2/prod/xml/schemas}lineTimeOrdering' at 0x6158710>,
 <Element '{http://www.rsi.ca/rs2/prod/xml/schemas}pixelTimeOrdering' at 0x6158748>]

If I look at the rasterAttribute element in the xml-file, it would look like this:



It is the  numberOfSamplesPerLine and the numberOfLines elements I want to retrieve. I could simply get these from the attribute_list using:

attribute_list[2].text
attribute_list[3].text

attribute_list[2] reads the third element of the list being "numberOfSamplesPerLine" and the ".text" is the ElementTrees command to access its valus.

This approach of accessing the attribute_list would work if the structure of the xml-file always remained the same. However, different products from Radarsat have slightly different xml-files. For SLC products, the wanted elements are for example at the fourth and fifth position of the attribute_list.  The two lines above will therefore not always retrieve the correct element, so one needs to check that the correct element is read. In order to do this, I loop through the attribute_list, which contains all subelements within the rasterAttribute element, and if the element has the correct name, then I read the value into a variable:


for attribute in attribute_list:
      if attribute.tag == "{http://www.rsi.ca/rs2/prod/xml/schemas}numberOfSamplesPerLine":
           numberOfSamplesPerLine = float(attribute.text) - 1
      if attribute.tag == "{http://www.rsi.ca/rs2/prod/xml/schemas}numberOfLines":
           numberOfLines = float(attribute.text) - 1

Why the minus 1? The first pixel at line 1 and pixel 1 has the location (0,0), so the number of lines and rows is one higher than the "address" of the pixel.

Now I know the number of pixels and the number of lines of my image. The next step is to search through all elements called "imageTiePoint", finding the ones matching the corners of the image. The upper left corner is then the one where the pixel location is at line 0 and at pixel 0. The upper right at line 0 and pixel "numberOfSamplesPerLine" and so on ... Again, I loop through all elements called imageTiePoint, create a list "child_list" containing all subelements and then read the pixel and line values. If these match with the corners, I read the corresponding latitude and longitude:


for attribute in attribute_list:
    for tiepoint in root.iter('{http://www.rsi.ca/rs2/prod/xml/schemas}imageTiePoint'):
        child_list = tiepoint.getchildren()
        
        line      = float(child_list[0][0].text)
        pixel     = float(child_list[0][1].text)
        latitude  = float(child_list[1][0].text)
        longitude = float(child_list[1][1].text)
        
        
        #check if upper left
        if (line == 0.0) and (pixel == 0.0):
            sar_upperleft_x = longitude
            sar_upperleft_y = latitude
        
        #check if upper right
        if (line == 0.0) and (pixel == numberOfSamplesPerLine):
            sar_upperright_x = longitude
            sar_upperright_y = latitude
            
                
        #check if lower left    
        if (line == numberOfLines) and (pixel == 0.0):
            sar_lowerleft_x = longitude
            sar_lowerleft_y = latitude
            
                
        #check if lower right
        if (line == numberOfLines) and (pixel == numberOfSamplesPerLine):
            sar_lowerright_x = longitude
            sar_lowerright_y = latitude


For these I actually do access the values through its position within the child_list since within the imageTiePoint element they appear to be the same for all Radarsat products. It would be safer of course to check the name of the element the same way as I did for the rasterAttribute element.

I can with these values create a WKT representation of a polygon, which is then the bounding box of the satellite image:
    wkt_sarimage = "POLYGON((" + str(sar_upperleft_x) + " " +  str(sar_upperleft_y) + "," \
                  + str(sar_upperleft_x) + " " + str(sar_lowerright_y) + \
                  "," + str(sar_lowerright_x) + " " + str(sar_lowerright_y) + "," + str(sar_lowerright_x) \
                  + " " + str(sar_upperleft_y) + "," + str(sar_upperleft_x) + " " +  str(sar_upperleft_y) + "))"

Friday, October 23, 2015

Batch Downloading Sentinel Images from ESA's Scihub with Python

ESA's Sentinel Satellite Images can be downloaded from https://scihub.esa.int/ with batch scripting as described here: https://scihub.esa.int/userguide/5APIsAndBatchScripting

The following describes a simple script on how to batch-download Sentinel images over an area of interest using Python. The search request in this version below is hard-coded into the script, but solves the main-task of downloading these files. The following is a first attempt and certainly will be improved. I will update this page then.

Please note that scihub has now two alternative download locations. The one called "dhus" contains for now data a longer time back, but is not very stable. The one called "apihub" contains the last 30 days, but is dedicated for script download and more stable. For apihub replace "dhus" in all links below.


Authenticate at Webpage 

Since scihub needs registration, any web-request passed to scihub needs authentication. So first you have to open your access to this page. The tools within Python's urllib library needed for this are described at this page.

    
    # The lower level url to authenticate
    'https://scihub.copernicus.eu/dhus/'
    
    # Create a password manager
    # as described at 
    # https://docs.python.org/2.7/howto/urllib2.html#openers-and-handlers
    username = 'username'
    password = 'password'
    p = urllib2.HTTPPasswordMgrWithDefaultRealm() 
    p.add_password(None, url, username, password)
    handler = urllib2.HTTPBasicAuthHandler(p)

    # create "opener" (OpenerDirector instance)
    opener = urllib2.build_opener(handler)
    
    # Install the opener.
    # Now all calls to urllib2.urlopen use our opener.
    urllib2.install_opener(opener)


Create the query and save resulting xml-response

They query itself is a weblink created as described here: https://scihub.esa.int/twiki/do/view/SciHubUserGuide/5APIsAndBatchScripting#Open_Search

As an example, the following weblink searches for all GRD type products ingested in scihub between the given dates within the area of interest defined by the polygon:


The first issue is, however, to reformat the link above in order for te urllib library to be able to read it. Pasting the link above into the browser will give you the list of results. However, inspecting the link closely shows that what is actually passed on is a link, where some but not all special characters are replaced: 

In order to replace the needed characters, you can use the urllib.quote(string[, safe]) command. The safe parameter defines all the characters, which are not to be replaced, and after some trial and error I find that this string need to be
':()[]/?=,&'

For the url therefore you use urllib.quote, then read the result and write it to an xml file:


    urlrequest = urllib.quote('https://scihub.esa.int/dhus/search?q=productType:GRD AND ingestionDate:[2015-09-30T00:00:00.000Z TO 2015-09-30T23:59:59.999Z] AND footprint:"Intersects(POLYGON((-16.11154750824 69.190260085686,-2.7521725082398 69.190260085686,-2.7521725082398 76.370271635301,-16.11154750824 76.370271635301,-16.11154750824 69.190260085686)))"&rows=10000&start=0', ':()[]/?=,&')
    
    # Read page response into variable
    page = urllib2.urlopen(urlrequest).read()
    
    # Write returned xml list to textfile
    textfile = open('/home/max/Documents/test.xml', 'w')
    textfile.write(page)
    textfile.close()

This result creates an xml-file containing a list of matching products.

Parsing the xml file

The next step is now to parse this xml file for the resulting and matching files and download these. How to do this, I found nicely described in "Dive Into Python" It is worth playing a bit with the different examples from chapter 12 to understand how the ElementTree library works.

Basically, you create a list containing all elements called "entry" since these contain the matching scenes. For each entry tag you need to find the Identifier in the "id" element and the title.

The id element may, for example contain the uuid "18f7993d-eae1-4f7f-9d81-d7cf19c18378", from which we then know that the file can be downloaded under the adress
https://scihub.copernicus.eu/dhus/odata/v1/Products('18f7993d-eae1-4f7f-9d81-d7cf19c18378')/$value

The name of the image itself is in the "title" element and is needed to name the downloaded Sentinel-file.

This code loops through all entries and finds the uuid element, the name of the Sentinel image and creates the link where the file can be downloaded from:

    tree = etree.parse('/home/max/Documents/test.xml')
    entries = tree.findall('{http://www.w3.org/2005/Atom}entry')
    for entry in range(len(entries)):
        uuid_element = entries[entry].find('{http://www.w3.org/2005/Atom}id')
        sentinel_link = "https://scihub.copernicus.eu/dhus/odata/v1/Products('" + uuid_element.text + "')/$value"
        title_element = entries[entry].find('{http://www.w3.org/2005/Atom}title')

Downloading each matching image


The last task is now to download each file. I found some hints on this here and adding this to the for-loop above, I get this code:

    tree = etree.parse('/home/max/Documents/test.xml')
    entries = tree.findall('{http://www.w3.org/2005/Atom}entry')
    for entry in range(len(entries)):
        uuid_element = entries[entry].find('{http://www.w3.org/2005/Atom}id')
        sentinel_link = "https://scihub.copernicus.eu/dhus/odata/v1/Products('" + uuid_element.text + "')/$value"
        title_element = entries[entry].find('{http://www.w3.org/2005/Atom}title')

        # Full path where downloaded file is to be stored
        destinationpath =  "/home/max/Documents/SentinelDownloads/" +    title_element.text + '.zip'
        
        #Open the file and read the data
        print "Downloading ", title_element.text
        downloadfile = urllib2.urlopen(sentinel_link)
        data =  downloadfile.read()
        #Write the data to the destination zipfile
        with open(destinationpath, "wb") as code:
           code.write(data)

Using all this now, you have to create your own query and replace the urlrequest variable. Add your username and password, as well as define your destinationpath:

The complete script


import urllib2, urllib
import xml.etree.ElementTree as etree

# Hard coded link returning list of matching scenes
# https://scihub.esa.int/twiki/do/view/SciHubUserGuide/5APIsAndBatchScripting#Open_Search

# Authenticate at scihub webpage
url =  'https://scihub.copernicus.eu/dhus/'
username = 'your_username'
password = 'your_password'
password_mgr = urllib2.HTTPPasswordMgrWithDefaultRealm()

password_mgr.add_password(None, url, username, password)
handler = urllib2.HTTPBasicAuthHandler(password_mgr)
opener = urllib2.build_opener(handler)
urllib2.install_opener(opener)

# The request query
urlrequest = urllib.quote('https://scihub.copernicus.eu/dhus/search?q=productType:GRD AND ingestionDate:[2015-09-30T00:00:00.000Z TO 2015-09-30T23:59:59.999Z] AND footprint:"Intersects(POLYGON((-16.11154750824 69.190260085686,-2.7521725082398 69.190260085686,-2.7521725082398 76.370271635301,-16.11154750824 76.370271635301,-16.11154750824 69.190260085686)))"&rows=10000&start=0', ':()[]/?=,&')
# Read the response into page and write it to a xml-file
page = urllib2.urlopen(urlrequest).read()
textfile = open('/home/max/Documents/test.xml', 'w')
textfile.write(page)
textfile.close()

#Parsing the xml-file, the entry tag contains the results
tree = etree.parse('/home/max/Documents/test.xml')
entries = tree.findall('{http://www.w3.org/2005/Atom}entry')
print "Number of Scenes Found: ", len(entries)
for entry in range(len(entries)):
    #The uuid element allows to create the path to the file
    uuid_element = entries[entry].find('{http://www.w3.org/2005/Atom}id')
    sentinel_link = "https://scihub.copernicus.eu/dhus/odata/v1/Products('" + uuid_element.text + "')/$value"
    
    #the title element contains the corresponding file name
    title_element = entries[entry].find('{http://www.w3.org/2005/Atom}title')
    
    #Destinationpath with filename where download to be stored
    destinationpath =  "/home/max/Documents/SentinelDownloads/" +    title_element.text + '.zip'
    
    print
    print "Scene ", entry + 1 , "of ", len(entries)
    
    #Check if file already was downloaded
    print sentinel_link
    if os.path.exists(destinationpath):
        print title_element.text, ' already downloaded'
        
        continue
    
    #Download file and read
    print "Downloading ", title_element.text
    downloadfile = urllib2.urlopen(sentinel_link)
    data =  downloadfile.read()
    
    # Write the download into the destination zipfile
    with open(destinationpath, "wb") as code:
        code.write(data)

print "Done downloading"