The XFN microformat and a spider in Python

One microformat which is one of the more used ones is XFN. XFN stands for XHTML Friends Network and is about marking links to the pages of your contacts and friends. Additionally you can use XFN to mark links to your own pages (e.g. linking all your profiles together).

Doing that is quite easy, you simply add a HTML attribute rel="me" to the link-tag, e.g.

<a href="http://flickr.com/photos/mrtopf" rel="me">my flickr pages</a>

If I have this on my profile page it says that I this flickr URL is also one of my pages. That way some spider can collect all the information about me. In my case I prepared a page on my site which can serve as a starting point for such a spider, linking to many of my profiles.

But what if somebody else links with rel=“me“ to one of my pages?

In this case it helps to check for symmetrical relationships. This means that the page I link to should also link the same way back.

What about different URLs pointing to the same page

Apparently URLs can be written differently and if it’s just omitting the „www.“ or having a trailing slash. This can be a problem for identifying your profiles. One way to avoid this problem is to canonicalizing the link addresses, like Google’s Social Graph API does.

The problem for websites might be still to find out what links are the same as it always means spidering the web which might be hard if you don’t have a link page like I have. Google of course has a better position here as they have all pages already indexed and finding out those links is then more or less asking the own database. That’s what Google’s Social Graph API can do for.

It’s also about friends

XFN is not only about your web sites but also about who your friends are. The same mechnism is used but instead of „me“ other names are used, some can also be mixed, e.g.

<a href="http://tanya.example.org" rel="friend met colleague">...

This means, the person you link to is your friend, you met in real life and it’s a colleague. A more general way of simply linking to contacts is using the „contact“ name. This is what e.g. Twitter does in it’s contact list. Here is an example with the hCard microformat mixed in:

 <span class="vcard">    <a href="http://twitter.com/Scobleizer"        class="url" rel="contact" title="Scobleizer"><img 			alt="Scobleizer" 			class="photo fn" id="profile-image"                         src="...." height="24" width="24"></a>  </span>

In this case the rel=“contact“ is the XFN part, the other bold parts belong to the hCard microformat.

Of course you need to crawl the web here to really retrieve all your contacts and also find out which friends links are actually pointing to the same person. Thus you probably also need to follow all the me-Links for every friend. Otherwise you cannot know that

http://flickr.com/photos/mrtopf

and

are the same person unless there is some Link-Graph with me-Links which links these together (e.g. by starting from my connect-Page).

Thus using XFN is probably only useful if you want to write spiders. It also might mean that you cannot display information in real-time as spidering might take some time. Alternatively you can of course use the Google API.

A XFN Spider in Python (actually 2)

A while back I created a Python script which retrieves rel=me links starting from one page. It does no canonicalization and only outputs a list. But it should be easy to extend. There is one script I wrote initially and another one which does the same job but uses Linden Lab’s eventlet library for retrieving links in a non-blocking, more concurrent way.

You can find both scripts at the Google Code project I created:

http://code.google.com/p/pydataportability/

Both scripts need the HTML parser BeautifulSoup to be installed and the eventlet apparently needs eventlet to be installed (I used the SVN version).

Both scripts are called with a starting URL, optionally you can add -v if you want verbose output on what it does.

Here is an example output:

./xfn2_eventlet.py http://mrtopf.de/connect
44 profiles found
http://mrtopf.de/connect
http://mrtopf.de
http://taotakashi.wordpress.com
http://www.slexchange.com/modules.php?name=Marketplace&MerchantID=13238
http://dev.comlounge.net
https://www.xing.com/profile/Christian_Scholz4
http://www.linkedin.com/in/mrtopf
http://mrtopf.tv
http://comlounge.tv
http://mrtopfde.blip.tv
http://taotakashi.blip.tv
http://flickr.com/photos/mrtopf
http://flickr.com/people/mrtopf
http://mrtopf.de
http://flickr.com/people/mrtopf/contacts
http://flickr.com/photos/taotakashi
http://flickr.com/people/taotakashi
http://flickr.com/people/taotakashi/contacts
http://www.facebook.com/profile.php?id=652229223

http://twitter.com/mrtopf/friends
http://pownce.com/mrtopf
http://pownce.com/mrtopf/friends
http://del.icio.us/mrtopf
http://www.facebook.com/profile.php?id=Tao Takashi
http://technorati.com/people/technorati/mrtopf
http://technorati.com/blogs/mrtopf.de%2Fblog
http://technorati.com/blogs/mrtopf.de%2Fpodcast
http://mrtopf.de/podcast
http://technorati.com/blogs/dev.comlounge.net
http://technorati.com/blogs/comlounge.net
Wir lieben Web!
http://technorati.com/blogs/mrtopf.blogspot.com http://mrtopf.blogspot.com http://technorati.com/blogs/mrtopf.tv%2Fvlog http://mrtopf.tv/vlog http://technorati.com/blogs/comlounge.tv%2Fblog http://comlounge.tv/blog http://technorati.com/blogs/twitter.com%2Fmrtopf http://mrtopf.jaiku.com http://www.last.fm/user/mrtopf http://www.last.fm/user/mrtopf/friends http://upcoming.yahoo.com/user/28980 http://www.slideshare.net/mrtopf

Here it’s also interesting to see how they get more and more although I haven’t added more services myself. But it means that more and more services actually add those links (like flickr).

Technorati Tags: , , , , ,

Teile diesen Beitrag