Arduino – Ethercard – Web Scraping – Example 1

I’ve been working on an Arduino based project to retrieve the local weather conditions from various online resources. This has many applications including automated warning systems.

I’ve now written an Arduino sketch which retrieves specific text strings from the Weather.com page for Melbourne Australia.

The major problem with retrieving the data, was that it only available on public websites which on pages which are not designed to be machine viewable – i.e the pages are just intended to be viewed by by normal web browsers and contain lots of formatting and description text as well as the data I wanted to access.

The first hurdle to overcome was that the JCW/Ethernet Library, only appeared to retrieve a maximum page size of 512 bytes, because it only retrieves one TCP/IP packet before the library terminates the connection to the server.

Initially, I thought I would need to modify the packetLoop at the core of the tcpip.cpp module, however after posting my solution to the Ethercard team at GitHub (note this issue may now be closed), I had a response from vicatcu that the latest version of the library had a new flag (and setter function) called persistTcpConnection

 

Setting this to true,  ether.persistTcpConnection(true);   stops the current TCP connection to be terminated after the first packet has been received and allows the response callback to be called multiple times, with each successive packet of data until the whole web page has been received.

But the problem still remained of how to extract a few small pieces of text from a large web page which is delivered to the Arduino callback in small 512 byte chunks.

The generic name for the process is web scraping and there is plenty of code on the web to do this on platforms with plenty of resources, but I could not find a solution that had been written that would run in the small amount of program space and memory on an Arduino

So I’ve written my own simple web scraping tools, that attempt to do multi-packet scraping for multiple pieces of data (texts). The code has one main limitation, in that it assumes that the order of the items on the page don’t change, i.e it looks for the first item, and then the second item etc. The code could be modified to allow the order to change, however it would require more pointers to buffers etc. However I’m not sure how worthwhile adding that feature would be, as if a web page changes the order of the items, its quite likely that the entire format of the page has been updated and the search strings may no longer be valid anyway.

The basic principal is that there is a unique piece of text which always occurs just before the text that needs to be retrieved, and there is always a termination / marker character after the text to be retrieved.
Web pages which have more complex formatting where there these conditions are not met, would require more complex detection algorithms.

I’ll update the blog later with a more detailed description of how it works.

I’ve posted my code below as a work in progress.
Note. Although I have attempted to search across buffer boundaries, there may be some instances where this logic doesn’t work.

// Demo of web scraping
// By Roger Clark.
// 2012/11/08
// http://opensource.org/licenses/mit-license.php
//
// This demo retrieves 3 items (strings) of weather data and prints them to the console.
// More items could be added by extending the state machine code in the my_callback function
// State 999 is used when all data has been found to terminate futher packets being received (and the callback being called unnecessarily)
// Notes.
// * This code was written as a demostration and is not heavily optimised. Code size could be reduced.
// * The example assumes that the order of the multiple items of data on the web page will always appear in the same order.
//   If this is not the case, a more complex search system would need to be written.
// * Normal issues applying to web page "scraping" ( http://en.wikipedia.org/wiki/Web_scraping ) apply to this code.
// * The code does not do extensive error checking. For robust applications more error checking should be applied.
 
 
#include "EtherCard.h"
//#define DEBUG 1
 
// ethernet interface mac address, must be unique on the LAN
static byte mymac[] = { 0x74,0x69,0x69,0x2D,0x30,0x31 };
char website[] PROGMEM = "www.weather.com";
byte Ethernet::buffer[700];
static uint32_t timer;
 
// Patterns and pattern buffer pointer
char *searchPattern[] = { "<meta property=\"og:description\" content=\"",
                          "<span class=\"wx-temp\">",
                          "<span class=\"wx-temp\" itemprop=\"visibility-mile\">"};
char *searchPatternProgressPtr;
 
// Output bugger and pointer to the buffer
char displayBuff[64];
char *outputBufferPtr;
 
int foundPatternState;// Initialsed in loop()
 
 
// Utility functions
void removeSubstring(char *s,const char *toremove)
{
  while( s=strstr(s,toremove) )
  {
    memmove(s,s+strlen(toremove),1+strlen(s+strlen(toremove)));
  }
}
 
 
// Function to find a string in a buffer, and return a pointer to the first char after the end of the match
// Returns pointer to NULL if a pattern is not found
// This function is designed to fing the search string across multiple buffers, and uses the patternInProgress pinter to store partial pattern matches
// The calling code needs to maintain patternInProgress between subsequent calls to the function.
char *multiBufferFindPattern(char *buffer,char *searchString,char *patternInProgress)
{
    while (*buffer && *patternInProgress)
    {
        if (*buffer == *patternInProgress)
        {
            patternInProgress++;
        }
        else
        {
            patternInProgress=searchString;// reset to start of the pattern
        }
        buffer++;
    }
    if (!*patternInProgress)
    {
        return buffer;
    }
    return NULL;
}
 
int getData(char *inputBuffer, char *outputBuffPtr, char endMarker)
{
    while(*inputBuffer && *inputBuffer!=endMarker && *outputBuffPtr)
    {
        *outputBuffPtr=*inputBuffer;
        outputBuffPtr++;
        inputBuffer++;
 
    }
    if (*inputBuffer==endMarker && *outputBuffPtr!=0)
    {
        *outputBuffPtr=0;
        // end character found
        return 1;
    }
    else
    {
      return 0;
    }
}
 
 
// Called for each packet of returned data from the call to browseUrl (as persistent mode is set just before the call to browseUrl)
static void browseUrlCallback (byte status, word off, word len)
{
   char *pos;// used for buffer searching
   pos=(char *)(Ethernet::buffer+off);
   Ethernet::buffer[off+len] = 0;// set the byte after the end of the buffer to zero to act as an end marker (also handy for displaying the buffer as a string)
 
   //Serial.println(pos);
   if (foundPatternState==0)
   {
     // initialise pattern search pointers
     searchPatternProgressPtr=searchPattern[0];
     foundPatternState=1;
   }
  
   if (foundPatternState==1)
   {
       pos = multiBufferFindPattern(pos,searchPattern[0],searchPatternProgressPtr);
       if (pos)
       {
         foundPatternState=2;
         outputBufferPtr=displayBuff;
         memset(displayBuff,'0',sizeof(displayBuff));// clear the output display buffer
         displayBuff[sizeof(displayBuff)-1]=0;//end of buffer marker
       } 
       else
       {
         return;// Need to wait for next buffer, so just return to save processing the other if states
       }
   }    
  
   if (foundPatternState==2)
   {
     if (getData(pos,outputBufferPtr,'"'))
     {
          //Serial.print("Weather is ");
          removeSubstring(displayBuff,"&deg;");// Use utility function to remove unwanted characters
          Serial.println(displayBuff); 
          foundPatternState=3; 
     }
     else
     {
       // end marker is not found, stay in same findPatternState and when the callback is called with the next packet of data, outputBufferPtr will continue where it left off
     }
   }
  if (foundPatternState==3)
   {
        searchPatternProgressPtr=searchPattern[1];
        foundPatternState=4;
   }
   if (foundPatternState==4)
   {
       pos = multiBufferFindPattern(pos,searchPattern[1],searchPatternProgressPtr);
       if (pos)
       {
         foundPatternState=5;
         outputBufferPtr=displayBuff; // Reset outbutBuffertPtr ready to receive new data
         memset(displayBuff,'0',sizeof(displayBuff));// clear the output display buffer
         displayBuff[sizeof(displayBuff)-1]=0;//end of buffer marker
       } 
       else
       {
         return;// Need to wait for next buffer, so just return to save processing the other if states
       }
   }    
  
   if (foundPatternState==5)
   {
     if (getData(pos,outputBufferPtr,'<'))
     {
          Serial.print("Wind speed ");Serial.print(displayBuff);Serial.println(" mph");
          foundPatternState=6;  //Move to next state (not used in this demo)
     }
     else
     {
       // end marker is not found, stay in same findPatternState and when the callback is called with the next packet of data, outputBufferPtr will continue where it left off
     }
   }     
  
   if (foundPatternState==6)
   {
        searchPatternProgressPtr=searchPattern[2];
        foundPatternState=7;
   }
   if (foundPatternState==7)
   {
       pos = multiBufferFindPattern(pos,searchPattern[2],searchPatternProgressPtr);
       if (pos)
       {
         foundPatternState=8;
         outputBufferPtr=displayBuff; // Reset outbutBuffertPtr ready to receive new data
         memset(displayBuff,'0',sizeof(displayBuff));// clear the output display buffer
         displayBuff[sizeof(displayBuff)-1]=0;//end of buffer marker
       } 
       else
       {
         return;
       }
   }    
  
   if (foundPatternState==8)
   {
              //Serial.println("Found HR start");
     if (getData(pos,outputBufferPtr,'<'))
     {
          Serial.print("Visability ");Serial.println(displayBuff);
          foundPatternState=999;  //Move to next state (not used in this demo)
     }
     else
     {
       // end marker is not found, stay in same findPatternState and when the callback is called with the next packet of data, outputBufferPtr will continue where it left off
     }
   } 
 
 
   if (foundPatternState==999)
   {
     // Found everything on this page. dissable persistence to stop any more callbacks.
     ether.persistTcpConnection(false);
   }
 }
 
void setup ()
{
  Serial.begin(115200);
  Serial.println("\n[Web scraper example]");
 
  if (ether.begin(sizeof Ethernet::buffer, mymac) == 0)
  {
    Serial.println( "Error:Ethercard.begin");
    while(true);
  }
 
  if (!ether.dhcpSetup())
  {
    Serial.println("DHCP failed");
    while(true);
  }
 
  ether.printIp("IP:  ", ether.myip);
  ether.printIp("GW:  ", ether.gwip); 
  ether.printIp("DNS: ", ether.dnsip);
 
 
  // Wait for link to become up - this speeds up the dnsLoopup in the current version of the Ethercard library
  while (!ether.isLinkUp())
  {
      ether.packetLoop(ether.packetReceive());
  }
  if (!ether.dnsLookup(website,false))
  {
    Serial.println("DNS failed. Unable to continue.");
    while (true);
  }
  ether.printIp("SRV: ", ether.hisip);
}
 
void loop ()
{
  ether.packetLoop(ether.packetReceive());
 
  if (millis() > timer)
  {
    timer = millis() + 60000;// every 30 secs
 
    Serial.println("\nSending request for page /weather/right-now/ASXX0075:1");
    foundPatternState=0;// Reset state machine
    ether.persistTcpConnection(true);// enable persist, so that the callback is called for each received packet.
    ether.browseUrl(PSTR("/weather/right-now/ASXX0075:1"), "", website, browseUrlCallback);
  }
}

Zipped Sourecode

10 thoughts on “Arduino – Ethercard – Web Scraping – Example 1

  1. Another approach to consider would be creating a Yahoo Pipe that returns just the data you want.

  2. admin on

    Thanks for the comment.

    I wanted to create a solution that didn’t require any other servers.

  3. Oscar on

    First of all, thank you for putting this up!
    This will prove very usefull!

    Second, I cant get this to compile.

    I get error from the compiler:
    java.lang.StackOverflowError
    at java.util.Vector.addElement(Unknown Source)
    at java.util.Stack.push(Unknown Source)
    at com.oroinc.text.regex.Perl5Matcher._pushState(Perl5Matcher.java)

    This seems to be caused by some pre processing of the code
    http://arduino.cc/en/Guide/Troubleshooting#toc4
    under: “Why do I get a java.lang.StackOverflowError when I try to compile my program?”

    “this is what’s happening. Look for unusual sequences involving “double-quotes”, “single-quotes”, \backslashes, comments, etc. For example, missing quotes can cause problems and so can the sequence ‘\”‘ (use ‘”‘ instead).”

    And I’ve narrowed it down to be one line in your function

    ->static void browseUrlCallback (byte status, word off, word len)

    if (getData(pos,outputBufferPtr,’\”‘)) <————some weerdness here with '\"'

    Do you know a way around this?

  4. Roger Clark on

    It looks like the code I pasted into the blog, got mangled by WordPress and it stripped out some of the quotes and also some html which is part of the search strings, but got interpreted in the blog as formatting information.

    I’ve tried to re-paste the code into the blog, so hopefully this will fix it.

    I’ll see if I can attach the actual .INO file, as that is guaranteed to work.

    PS.
    Those Java errors are often caused by non matching quotes, and seem to completely prevent the code from compiling.
    I had this issue some time ago, and did a bit of research, and the problems with quotes etc, cause the Arduino preprocessor to fail.

  5. Roger Clark on

    Hi Kevin,

    I was just using these weather sites and an example. The actual data I wanted was available as a CSV or XML, but I posted an general example rather than the one for the dataset I was personally interested in.

    Cheers

    Roger

  6. Christophe on

    Get the same problem, stack overflow. Could you recheck the code above ?

    Exception in thread “Thread-5″ java.lang.StackOverflowError
    at java.util.regex.Pattern$CharProperty$1.isSatisfiedBy(Pattern.java:3337)

  7. Roger Clark on

    Hi Christophe,

    No worries. I thought this was caused by WordPress messing up the HTML and quotes in the code and that I’d fixed it and attached the file to the posting, but I’ll take another look.

    Cheers

    Roger

  8. Hi Christophe,

    I’ve tested both the onscreen version in the posting and the downloadable ZIP and they both compiled OK for me.

    I’m using Arduino 1.01 on Mac. I’ll double check if there is an issue on the PC.

    BTW. Have you tried downloading the zipped version of the INO file ??

  9. Update.

    I’ve tracked down the problem, where my example would not compile.

    It compiled fine on the Mac, but would not compile on the PC.

    The problem was that I had escaped the quote character like this ‘\”‘ and the Mac pre-compiler was happy with this, however the PC pre-compiler had an issue with the double quote on its own.

    So I’ve removed the back slash and the code now reads ‘”‘ and seems to work fine on both PC and Mac