Tag Archives: Data Mining and Scraping

Dr. Myspace – Create RSS Feeds From Myspace Artist Tour Dates

If you haven’t read my post on using PHP and Simple HTML DOM parser to scrape artist tour dates off myspace then check that out first seeing as this post starts where that one ended. I’ve received countless emails and noticed in my Google Analytics a large percentage of visitors reach my site by searching for Myspace RSS feeds. Myspace Artist Tour Dates in RSS format to be more precise. I’ve had this idea for quite some time but never created it. Tonight I spent about 4 hours putting together what I’d like to introduce as the first version of Dr. Myspace. Built specifically to allow you to create RSS feeds from Myspace artist tour dates. It’s extremely simple in it’s architecture but extremely useful at the same time.

Introducing Dr. Myspace

Screen shots taken from Dr. Myspace  wizard with artists added to the RSS feed
create-rss-feeds-from-myspace-artist-tour-dates_1234338143329-medium
Sample of the RSS feed created by Dr. Myspace
drmyspacecom-myspace-artist-tour-dates_1234338166684-medium

Create an RSS Feed from a Myspace Artists Tour Dates

  1. Go to drmyspace.com
  2. Add your artists by artist name, and artist id
  3. Click “Create your RSS Feed” and copy the URL on the resulting page.
  4. You’re done!

Here is a sample Myspace RSS feed with The Killers and Kill the Noise

Be sure to check back regularly. I will be updating drmyspace.com constantly to keep up with feedback and demands. I will be implementing several new features and also I will be releasing the full source code.

You’re welcome,
Bryan

Using PHP and Simple HTML DOM Parser to scrape artist tour dates off MySpace

UPDATED 2/11/2009 – Check out the new post on creating RSS feeds from myspace artist tour dates and the launch of dr. myspace

Have you ever visited a web page and seen information that you’d like in another format or through a different medium, but you just don’t have that option? Lets take artist tour dates off MySpace for example. Currently MySpace has (a lot of problems) but one of the issues I see is that their information seems to locked up. They’re not even sure how to utilize it properly. Let’s take the Myspace artist tour pages for example.

I’m fascinated with the idea of taking what I call “2D information” and forcing it to be “3D information” Programmers would probably argue that all information is 3D and so on and so forth, but that’s not the literal translation they’re imaging and associating information with. I’m speaking more of taking something that sucks and making it not suck. Taking something that isn’t being utilized properly and making it valuable thus monetizing it. Usually this results in some form of scraping. You can call it data mining, apples to apples as far as I’m concerned.

Enough of that. It’s also true most of the time for me that the majority of ideas are spun out of frustration with something that isn’t performing to my liking in it’s current state. It can be anything from a cup, to a windshield wiper; hence, “need is the mother of all invention.” That brings me straight to the point of this posting.

Artist tour dates on MySpace

I’ll spare you the MySpace introduction, here’s what an artists tour page looks like on MySpace:

Taken from Kill The Noise’s MySpace

Notice the “view all” button on the top right. Yeah, good stuff. That basically means we don’t have to even worry about looping, and pagination. Thanks MySpace. Like candy from a baby — candy from a baby.

Here’s what the “view all” link page looks like:

It’s obviously the same dates, but I imagine if the artist have more, say 25, it would have been paginated on his profile page, but not on this page. This means we’ll have a reliable source to scrape for all of an artists tour dates permanently.

MySpace artist tour dates HTML key

I noticed something interesting as I was looking for ways to iterate over the HTML in a consistent fashion, then I found the key. I refer to it as the key because the more scraping you do, the more you will look for a unified pattern that the data is encompassed in to perform actions on, and over it. You’re probably wondering “bryan, what did you say to your self? What did you think about this key?” I just screamed to myself “that’s what i’m screamin’” and that is all.

The point of that long-ass over-detailed previous paragraph was that myspace puts all of the information on the page twice. Once in an HTML table format and second in hidden html input elements.

Check it out:

lol.

It’s Business Time

Our key is the hidden html input fields. These are the easiest to manipulate because of their form in HTML. We could even use a regular expression over this page, and get results quickly.

Simple HTML DOM Parser
It’s important to have the right tools at your disposal when you’re data mining/scraping. This is a library written in PHP that gives you dictator like control over HTML. It’s the best PHP HTML DOM Parser class I’ve ever used. It’s also a requirement for the code in this artcle to work considering I’m using it.You can download Simple HTML DOM here, and give the short but sweet manual a once over before you continue (it’s brief, trust me.) if impatient then just download it (Sourceforge), and lets get going. You can also skip to the bottom of this article to download the source with Simple HTML DOM parser included or without.

/**
* Include Simple HTML DOM
*/
include('simplehtmldom/simple_html_dom.php');

/**
* This is the URL to the "view all" link on myspace artist pages.
* Myspace uses an incremental friendid element in their system, so it starts at 1 and moves to.. wherever the newest member is.
*/
$myspace_url = "http://collect.myspace.com/index.cfm?fuseaction=bandprofile.listAllShows&friendid=3087303&n=band";

/**
* Set the user_agent just in case myspace checks.
*/
ini_set('user_agent', 'Scrape/2.5');
$html = file_get_html($myspace_url);

/*
Create a variable to keep track of iterations for our array of shows.
Since we only want hidden elements, we apply  the filter on our find.
Since myspace assigns a name attribute to each hidden element, we just check to see which element we're on, and either add it to our array, or move on.
*/
$i = 0;
foreach ($html->find('input[type="hidden"]') as $k => $v) {
/** You remember the key, don't you? ;)  */
if ($v->name == 'calEvtLocation') {
$shows[$i]['Location'] = $v->value;
}

if ($v->name == 'calEvtTitle') {
$shows[$i]['Title'] = $v->value;
}

if ($v->name == 'calEvtCity') {
$shows[$i]['City'] = $v->value;
}

if ($v->name == 'calEvtState') {
$shows[$i]['State'] = $v->value;
}

if ($v->name == 'calEvtZip') {
$shows[$i]['Zip'] = $v->value;
}

/**
* This is the last element, so grab it, and lets increment our placeholder.
*/
if ($v->name == 'calEvtDateTime') {
$shows[$i]['Date'] = $v->value;
$i++;
}
}

print_r($shows);

When we run this code:

print_r($shows);

It gives us:

Now it’s easy for us to manipulate in it’s new, local, “3D information” style. Unfortunately I’m leaving you to your imagination here. I’m sure you can come up with something.

Ideas

  • Database that allows you to view shows for multiple artists. Multiple bands, etc.
  • Dynamic RSS feed builder by Date – Band – Venue – Check out Dr. Myspace
  • Google Mapping feature

Demo and Download

See it in action

Download
MySpace artist tour page scraper with Simple HTML DOM
MySpace artist tour page scraper without Simple HTML DOM

I will be writing a few followup articles to the context of this one and bigger more involved data mining in the very near future. Stay tuned.

The regulator,
bryan