Friday, November 20, 2015

Parsing a xml-file in Python with a Conditional Search to get Corner Coordinates of Satellite Images

I still find it somewhat tedious to search through xml-files. The following describes a solution on how to find variables in a xml-files that meet certain conditions:

A Radarsat satellite image comes with a product.xml file. In the depth of the xml-hierarchy, there is a long list of image tie points providing coordinates for selected image pixels:

From this long list, I want to retrieve those tie points being the corner coordinates of this image. In order to do this, I need to read from the xml-file

1) the number of pixels and the number of lines of this image in order to determine the corners
2) the tiepoints containing latitude and longitude information for these corners

A good starting point to understand how xml-files can be parsed in Python can be found here.

Using the elementtree module in Python, I first parse the xml file and get the root of the xml-file:

import xml.etree.ElementTree as etree      
tree = etree.parse(xml_file_extracted)
root = tree.getroot()

I can look at the immediate subelements of the root level using a simple loop:

>>> root = tree.getroot()
>>> for child in root:
...     print child.tag, child.attrib
{}productId {}
{}documentIdentifier {}
{}sourceAttributes {}
{}imageGenerationParameters {}
{}imageAttributes {}

Note that the name of the element is preceded by "{}", so if you search an element named rasterAttribute, the full name is in fact "{}rasterAttribute"

I now need to search for this element "rasterAttribute." I do not know its exact path within the xml-file, but I know from looking at the file that this element contains the number of pixels and the number of lines. Using "root.iter" I can iterate through all elements with a given name, I use it here as a simple way to find the one rasterAttribute element:

    for attributes in root.iter('{}rasterAttributes'):
        attribute_list = attributes.getchildren()

The attribute_list gives me all the subelements, the children of the rasterAttribute element:

[<Element '{}dataType' at 0x6158588>,
<Element '{}bitsPerSample' at 0x61585f8>,
 <Element '{}numberOfSamplesPerLine' at 0x6158630>,
 <Element '{}numberOfLines' at 0x6158668>,
 <Element '{}sampledPixelSpacing' at 0x61586a0>,
 <Element '{}sampledLineSpacing' at 0x61586d8>,
 <Element '{}lineTimeOrdering' at 0x6158710>,
 <Element '{}pixelTimeOrdering' at 0x6158748>]

If I look at the rasterAttribute element in the xml-file, it would look like this:

It is the  numberOfSamplesPerLine and the numberOfLines elements I want to retrieve. I could simply get these from the attribute_list using:


attribute_list[2] reads the third element of the list being "numberOfSamplesPerLine" and the ".text" is the ElementTrees command to access its valus.

This approach of accessing the attribute_list would work if the structure of the xml-file always remained the same. However, different products from Radarsat have slightly different xml-files. For SLC products, the wanted elements are for example at the fourth and fifth position of the attribute_list.  The two lines above will therefore not always retrieve the correct element, so one needs to check that the correct element is read. In order to do this, I loop through the attribute_list, which contains all subelements within the rasterAttribute element, and if the element has the correct name, then I read the value into a variable:

for attribute in attribute_list:
      if attribute.tag == "{}numberOfSamplesPerLine":
           numberOfSamplesPerLine = float(attribute.text) - 1
      if attribute.tag == "{}numberOfLines":
           numberOfLines = float(attribute.text) - 1

Why the minus 1? The first pixel at line 1 and pixel 1 has the location (0,0), so the number of lines and rows is one higher than the "address" of the pixel.

Now I know the number of pixels and the number of lines of my image. The next step is to search through all elements called "imageTiePoint", finding the ones matching the corners of the image. The upper left corner is then the one where the pixel location is at line 0 and at pixel 0. The upper right at line 0 and pixel "numberOfSamplesPerLine" and so on ... Again, I loop through all elements called imageTiePoint, create a list "child_list" containing all subelements and then read the pixel and line values. If these match with the corners, I read the corresponding latitude and longitude:

for attribute in attribute_list:
    for tiepoint in root.iter('{}imageTiePoint'):
        child_list = tiepoint.getchildren()
        line      = float(child_list[0][0].text)
        pixel     = float(child_list[0][1].text)
        latitude  = float(child_list[1][0].text)
        longitude = float(child_list[1][1].text)
        #check if upper left
        if (line == 0.0) and (pixel == 0.0):
            sar_upperleft_x = longitude
            sar_upperleft_y = latitude
        #check if upper right
        if (line == 0.0) and (pixel == numberOfSamplesPerLine):
            sar_upperright_x = longitude
            sar_upperright_y = latitude
        #check if lower left    
        if (line == numberOfLines) and (pixel == 0.0):
            sar_lowerleft_x = longitude
            sar_lowerleft_y = latitude
        #check if lower right
        if (line == numberOfLines) and (pixel == numberOfSamplesPerLine):
            sar_lowerright_x = longitude
            sar_lowerright_y = latitude

For these I actually do access the values through its position within the child_list since within the imageTiePoint element they appear to be the same for all Radarsat products. It would be safer of course to check the name of the element the same way as I did for the rasterAttribute element.

I can with these values create a WKT representation of a polygon, which is then the bounding box of the satellite image:
    wkt_sarimage = "POLYGON((" + str(sar_upperleft_x) + " " +  str(sar_upperleft_y) + "," \
                  + str(sar_upperleft_x) + " " + str(sar_lowerright_y) + \
                  "," + str(sar_lowerright_x) + " " + str(sar_lowerright_y) + "," + str(sar_lowerright_x) \
                  + " " + str(sar_upperleft_y) + "," + str(sar_upperleft_x) + " " +  str(sar_upperleft_y) + "))"

1 comment:

  1. awesome post presented by you..your writing style is fabulous and keep update with your blogs.

    python online training india