Parsing Proc Tabulate table using Python

It has been a while since my last post about SASpy so I thought maybe a good idea to try something that has been on my mind for a while. I must say in advance this is not a perfect solution but is a step forward in (I guess) the right direction.

Some of the challenges I faced were the lack of python modules in SAS Studio. While I get that BeautifulSoup is a nice-to-have, the lxml package should be mandatory! Nonetheless there are good souls out there sharing scripts and I managed to borrow one for this example.

For this post, I am going to submit the below proc tabulate and convert the output table to a python dictionary.

As you know the submit statement returns a dictionary with 2 keys: [LOG,LST]. We are interested in parsing LST as it contains the output of our tabulate.

Firstly, we create a session:

import saspy
sas = saspy.SASsession()
Using SAS Config named: default
SAS Connection established. Subprocess id is 19451

Secondly, we define our tabulate:

tblt = """
proc tabulate data=sashelp.cars; 
var MSRP;
class TYPE;
table MSRP*TYPE*mean;
run;
"""

and submit it to the SAS server:

sbmt = sas.submit(tblt)
# print(sbmt('LST')) # contains proc tabulate output in html

sbmt('LST') is a html that we need to parse to extract the table produced by proc tabulate. Instead of writing my own parser I checked the www and found that my good friend Josua Schmid (don’t really know him) wrote class that extracts html tables. Link to github repo here

from html.parser import HTMLParser
class HTMLTableParser(HTMLParser):
    """ This class serves as a html table parser. It is able to parse multiple
    tables which you feed in. You can access the result per .tables field.
    """
    def __init__(
        self,
        decode_html_entities=False,
        data_separator=' ',
    ):

        HTMLParser.__init__(self)

        self._parse_html_entities = decode_html_entities
        self._data_separator = data_separator

        self._in_td = False
        self._in_th = False
        self._current_table = []
        self._current_row = []
        self._current_cell = []
        self.tables = []

    def handle_starttag(self, tag, attrs):
        """ We need to remember the opening point for the content of interest.
        The other tags (<table>, <tr>) are only handled at the closing point.
        """
        if tag == 'td':
            self._in_td = True
        if tag == 'th':
            self._in_th = True

    def handle_data(self, data):
        """ This is where we save content to a cell """
        if self._in_td or self._in_th:
            self._current_cell.append(data.strip())

    def handle_charref(self, name):
        """ Handle HTML encoded characters """

        if self._parse_html_entities:
            self.handle_data(self.unescape('&#{};'.format(name)))

    def handle_endtag(self, tag):
        """ Here we exit the tags. If the closing tag is </tr>, we know that we
        can save our currently parsed cells to the current table as a row and
        prepare for a new row. If the closing tag is </table>, we save the
        current table and prepare for a new one.
        """
        if tag == 'td':
            self._in_td = False
        elif tag == 'th':
            self._in_th = False

        if tag in ['td', 'th']:
            final_cell = self._data_separator.join(self._current_cell).strip()
            self._current_row.append(final_cell)
            self._current_cell = []
        elif tag == 'tr':
            self._current_table.append(self._current_row)
            self._current_row = []
        elif tag == 'table':
            self.tables.append(self._current_table)
            self._current_table = []

So what we need to do is feed our parser with the SAS generated html.

p = HTMLTableParser()
p.feed(sbmt['LST'])
print(p.tables)
[[['MSRP'], ['Type'], ['Hybrid', 'SUV', 'Sedan', 'Sports', 'Truck', 'Wagon'], ['Mean', 'Mean', 'Mean', 'Mean', 'Mean', 'Mean'], ['19920.00', '34790.25', '29773.62', '53387.06', '24941.38', '28840.53']]]

and we get a list of lists with every item in our table. We can then manipulate it to create a dictionary with the data:

data = {}
for v in zip(p.tables[0][2],p.tables[0][4]):
    data[v[0]] = float(v[1])
data
{'Hybrid': 19920.0,
 'SUV': 34790.25,
 'Sedan': 29773.62,
 'Sports': 53387.06,
 'Truck': 24941.38,
 'Wagon': 28840.53}
print(data['SUV'])
34790.25

We also can create a more complicated structure:

data2 = {p.tables[0][0][0]:{}}
for v in zip(p.tables[0][2],p.tables[0][4]):
    data2[p.tables[0][0][0]][v[0]] = float(v[1])
data2

{'MSRP': {'Hybrid': 19920.0,
  'SUV': 34790.25,
  'Sedan': 29773.62,
  'Sports': 53387.06,
  'Truck': 24941.38,
  'Wagon': 28840.53}}
print(data2['MSRP']['SUV'])
34790.25

Here is where I stop - I suppose we could write a more complex parser to return a structure that can accommodate any output of proc tabulate but whilst it is a nice exercise I am struggling to see any practical application of it.

Hope this sparks some nice ideas at your end!

Comments

comments powered by Disqus