Arduino – Ethercard

Arduino – Ethercard – Web Scraping – Example 2

posted in: Arduino | 0

As a follow up to my previous posting on web scraping using the Ethercard library for Arduino, I’ve written an example of how to retrieve data from a page which doesn’t contain a unique marker pattern (text) before the text that needs to be retrieved.

This example retrieves the current temperature and wind direction for London Heathrow airport, from this web page http://www.metoffice.gov.uk/weather/uk/observations/

This web page contains data for a large number of weather stations

Each weather station report is just a row in a large table.

If you navigate to the page, view the source, and then search for Heathrow, you can see the <tr> tag for the whole table. This data changes all the time, as the weather changes, but the general structure will look the same as this

<tr>
	<td><a href="/weather/uk/se/heathrow_latest_weather.html">Heathrow</a></td>
	<td align="center"><img src="/lib/images/symbols/w5.gif" alt="Mist" title="Mist" width="26" height="24"></td>
	<td class="value">2.6</td>
	<td class="unit">&deg;C</td>
	<td align="center">N</td>
	<td class="value">3</td>
	<td class="unit">mph</td>
	<td class="value"></td>
	<td class="unit"></td>
	<td class="value">6</td>
	<td class="unit">km</td>
	<td align="center">1016&nbsp;hPa, Falling</td>
</tr>

In the HTML code above, the temperature is 2.6 (degrees). The text prior to this text is <td class=”value“> however this is not a unique pattern as its used by cells on the same row (and within other rows in the same table)

So in order to find this non-unique pattern, we first need to find the row in the table for Heathrow, this will move the search (pointer) position to the character after the text Heathrow. Note in the code, I’ve just commented out the code that extracts some text just after the search position, as the example is a modified version of my last example, and it makes things clearer if the “found state” for Heathrow is in the code.

Now we can go ahead and find the text just before the temperature value i.e <td class=”value“>.

Once the multiBufferFindPattern function has found this pattern, we can retrieve the text, as we know the unique character after the data is the less than sign <

The example then continues to get the data from the next column which is the wind direction.

For more complex web pages, you may need more states (or a more complex type of state machine with substates inside each state), and also skip over multiple columns e,g, to get the Wind direction data and not the temperature, you’d still need skip over the Temperature column, by doing the same thing that I’ve done when searching for “Heathrow”

 

 

 

// Demo of web scraping 2
// By Roger Clark. 
// www.rogerclark.net
// 2012/11/24
// http://opensource.org/licenses/mit-license.php
//
// This demo retrieves 2 items (strings) of weather data and prints them to the console.
// More items could be added by extending the state machine code in the my_callback function
// State 999 is used when all data has been found to terminate futher packets being received (and the callback being called unnecessarily)
// Notes.
// * This code was written as a demostration and is not heavily optimised. Code size could be reduced.
// * The example assumes that the order of the multiple items of data on the web page will always appear in the same order. 
//   If this is not the case, a more complex search system would need to be written.
// * Normal issues applying to web page "scaping" ( http://en.wikipedia.org/wiki/Web_scraping ) apply to this code.
// * The code does not do extensive error checking. For robust applications more error checking should be applied.

#include "EtherCard.h"
//#define DEBUG 1

// ethernet interface mac address, must be unique on the LAN
static byte mymac[] = { 0x74,0x69,0x69,0x2D,0x30,0x31 };
char website[] PROGMEM = "www.metoffice.gov.uk";
byte Ethernet::buffer[700];
static uint32_t timer;

// Patterns and pattern buffer pointer
char *searchPattern[] = { "Heathrow",
                          "\"value\">",
                        "<td align=\"center\">"};
char *searchPatternProgressPtr;

// Output bugger and pointer to the buffer
char displayBuff[96];
char *outputBufferPtr;

int foundPatternState;// Initialsed in loop()

// Utility functions
void removeSubstring(char *s,const char *toremove)
{
  while( s=strstr(s,toremove) )
  {
    memmove(s,s+strlen(toremove),1+strlen(s+strlen(toremove)));
  }
}

// Function to find a string in a buffer, and return a pointer to the first char after the end of the match
// Returns pointer to NULL if a pattern is not found
// This function is designed to fing the search string across multiple buffers, and uses the patternInProgress pinter to store partial pattern matches
// The calling code needs to maintain patternInProgress between subsequent calls to the function.
char *multiBufferFindPattern(char *buffer,char *searchString,char *patternInProgress)
{
    while (*buffer && *patternInProgress)
    {
        if (*buffer == *patternInProgress)
        {
            patternInProgress++;
        }
        else
        {
            patternInProgress=searchString;// reset to start of the pattern
        }
        buffer++;
    }
    if (!*patternInProgress)
    {
        return buffer;
    }
    return NULL;
}

int getData(char *inputBuffer, char *outputBuffPtr, char endMarker)
{
    while(*inputBuffer && *inputBuffer!=endMarker && *outputBuffPtr)
    {
        *outputBuffPtr=*inputBuffer;
        outputBuffPtr++;
        inputBuffer++;

    }
    if (*inputBuffer==endMarker && *outputBuffPtr!=0)
    {
        *outputBuffPtr=0;
        // end character found
        return 1;
    }
    else
    {
      return 0;
    }
}

// Called for each packet of returned data from the call to browseUrl (as persistent mode is set just before the call to browseUrl)
static void browseUrlCallback (byte status, word off, word len) 
{
   char *pos;// current positition in ethernet packet buffer of the search.
   pos=(char *)(Ethernet::buffer+off);// Set position to start of data in the ethernet buffer
   Ethernet::buffer[off+len] = 0;// set the byte after the end of the buffer to zero to act as an end marker (also handy for displaying the buffer as a string)

   // Before each search the searchPatternProgressPtr pointer needs to be initialised with the pattern that needs to be found.
   if (foundPatternState==0)
   {
     // initialise pattern search pointers 
     searchPatternProgressPtr=searchPattern[0];
     foundPatternState=1;// Carry on to next state
   }

   // Do the search for the pattern which has been setup in the previous state.
   if (foundPatternState==1)
   {
       pos = multiBufferFindPattern(pos,searchPattern[0],searchPatternProgressPtr);
       if (pos)
       {
         foundPatternState=2;// Pattern has been found. Move to next state.
         outputBufferPtr=displayBuff; 
         memset(displayBuff,'0',sizeof(displayBuff));// clear the output display buffer
         displayBuff[sizeof(displayBuff)-1]=0;//end of buffer marker
       }  
       else
       {
         return;// Need to wait for next buffer, so just return to save processing the other if states
       }
   }     

   if (foundPatternState==2)
   {
     // we dont need to retrieve any text frmo the first search. The first search was only performed to find the row of the table which contained data for Heathrow airport
     // So carry stright on to state 3, the search for the data within the row.
     foundPatternState=3;

     /*
     if (getData(pos,outputBufferPtr,'\"'))
     {
          Serial.print("Data 1 is "); 
          Serial.println(displayBuff);  
          foundPatternState=3;  
     }
     else
     {
       // end marker is not found, stay in same findPatternState and when the callback is called with the next packet of data, outputBufferPtr will continue where it left off
     }
     */
   }

// Setup for next search
  if (foundPatternState==3)
   {
        searchPatternProgressPtr=searchPattern[1];
        foundPatternState=4;
   }
   if (foundPatternState==4)
   {
       pos = multiBufferFindPattern(pos,searchPattern[1],searchPatternProgressPtr);
       if (pos)
       {
         foundPatternState=5;
         outputBufferPtr=displayBuff; // Reset outbutBuffertPtr ready to receive new data
         memset(displayBuff,'0',sizeof(displayBuff));// clear the output display buffer
         displayBuff[sizeof(displayBuff)-1]=0;//end of buffer marker
       }  
       else
       {
         return;// Need to wait for next buffer, so just return to save processing the other if states
       }
   }     

   // Previous search pattern has been found.
   if (foundPatternState==5)
   {
     // retrieve data until end marker character < is found
     if (getData(pos,outputBufferPtr,' timer) 
  {
    timer = millis() + 60000;// every 30 secs

    Serial.println("\nSending request for page /weather/uk/observations/");
    foundPatternState=0;// Reset state machine
    ether.persistTcpConnection(true);// enable persist, so that the callback is called for each received packet. 
    ether.browseUrl(PSTR("/weather/uk/observations/"), "", website, browseUrlCallback);  }
}

void setup ()
{
  Serial.begin(115200);
  Serial.println("\n[Web scraper example]");

  if (ether.begin(sizeof Ethernet::buffer, mymac) == 0)
  {
    Serial.println( "Error:Ethercard.begin");
    while(true);
  }

  if (!ether.dhcpSetup())
  {
    Serial.println("DHCP failed");
    while(true);
  }

  ether.printIp("IP:  ", ether.myip);
  ether.printIp("GW:  ", ether.gwip); 
  ether.printIp("DNS: ", ether.dnsip);

  // Wait for link to become up - this speeds up the dnsLoopup in the current version of the Ethercard library
  while (!ether.isLinkUp())
  {
      ether.packetLoop(ether.packetReceive());
  }
  if (!ether.dnsLookup(website,false))
  {
    Serial.println("DNS failed. Unable to continue.");
    while (true);
  }
  ether.printIp("SRV: ", ether.hisip);
}

void loop ()
{
  ether.packetLoop(ether.packetReceive());

  if (millis() > timer)
  {
    timer = millis() + 60000;// every 30 secs

    Serial.println("\nSending request for page /weather/right-now/ASXX0075:1");
    foundPatternState=0;// Reset state machine
    ether.persistTcpConnection(true);// enable persist, so that the callback is called for each received packet.
    ether.browseUrl(PSTR("/weather/right-now/ASXX0075:1"), "", website, browseUrlCallback);
  }
}