Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Hacker News RSS Feed + Readability (hacketal.com)
44 points by nirmal on April 1, 2009 | hide | past | favorite | 24 comments


Now that this made it to the front page, I just realized that the page content is not pulled correctly by the Python. :)

The Readability bookmarklet seems to do it just fine. I will need to investigate.

It seems to work fine with most blogging systems. I don't use one. I use Markdown, Python and rsync.

UPDATE: I received a request to add a Comments link to the top and did that. Makes sense if you like to check the comments before deciding what's worth reading. It will probably take a while before the code is reloaded.


Awesome! Thanks Nirmal! Talk about customer service.

Now, if only it had better twitter integration ...


Yeah, apostrophes seem to turn into some wierd jumble (e.g. ’) and so do other charaters. Aside from that, very nice work. Added to my reader.


Any clues as to why that happens? The code simply inserts the HTML of the content sections of the linked sites into the RSS feed. It doesn't seem to happen with all sites.


Probably it happens when the real text encoding is different from the declared one (at least that's the problem I encountered when aggregating content from unfiltered wild web).

If it really bothers, you can use chardet [1] to try to detect the real encoding (BeautifulSoup should use it if it's installed). But even this is not 100% foolproof.

[1] http://chardet.feedparser.org/


Yep, ’ means that the pages contains the UTF-8 quote character, but the browser renders the bytestream as if it's a single byte character stream.

The crux is that the basic alphabet is encoded the same. So you only notice it with special characters such as the curly quote and the em-dash.


Is there any concern here for the missed click-throughs for content creators for purposes of metrics and/or advertising revenue?

I've not been in a situation to have to worry about either but I know that site owners make a conscious decision when they formulate the contents of their own RSS feeds. Some clearly don't want to provide all the goods in feed readers without an ad stream to compliment it.

I haven't gone through a lot of the feed yet so I apologize if I am missing some way you've already addressed this.

Regardless, Thanks! I really appreciate this feed.


I suppose I could try to go to each linked article and find their RSS feed and see if they normally expose the content but that seems quite complicated. I don't know if missed click-throughs is why pg doesn't do something similar for the existing HN rss feed. I suspect that it's hard to make it always have the correct content. This is clearly evident in the fact that my page doesn't get parsed correctly. :)


I'm not super familiar with pg's RSS feed but doesn't it just provide a link to the article? That doesn't bypass the click-through to the content provider. It encourages it, just like the HN homepage.


Right, so maybe that's why pg doesn't do something like this feed for the regular feed. Also, there's the fact that no heuristic will be a 100% reliable in determining the actual content area.


"...that seems quite complicated."

That doesn't mean you shouldn't do it. I see you're a PhD student. What if I wrote a program that would let people republish your research in full without your prior consent? Yes, I could add a feature that limits this program to only working with research that's been approved for it, but that seems quite complicated.

This Readbility service is wrong. I guess that's why it's popular?


I'm sorry that you mistook my description of the task difficulty as meaning that I wouldn't do something complicated. Like you said, I'm a PhD student, which incidentally means that I have many other demands on my time and do many difficult things every day.

"What if I wrote a program that would let people republish your research in full without your prior consent"

Do it.

Incidentally, my hack isn't the original Readability service. My inspiration is http://lab.arc90.com/experiments/readability/


Nirmal that readability service is incredible. I tried it on my blog - I hate to admit it, but it does look a lot better. One thing that makes it much easier to read is that it strips out the comments. I am not sure how I feel about this, but I suspect for 80% of readers (and likely many of the people who use the bookmarklet) the comments are just noise.

Similarly stripping the navigation also makes it easier to read, but then it loses it's value as a website.

I wonder if there is a readability wordpress plugin that could display the readability version in a JS pop-up overlay. I think if I can find one, I'll add that to my site.

Edit: I just tried it on DaringFireball. Beautiful.


That's not the point, though.

You may be ok with it, but what about all the other folks out there? Checking with each of them would be the right thing to do, but complicated.

The culture of lifting original information and data regardless of the wishes of the original creator is a bad thing.


Let's make this a constructive debate. Do you have any thoughts on a system that could be developed to secure permissions of the original creator?


Well the traditional approach is to contact them and secure their permission. But as you alluded, that doesn't scale.

So perhaps there's a startup opportunity here, for a system that secures permission from people who are ok with their content being reproduced without their prior consent.

Perhaps there could be a service that creates a list of web sites where the content is licensed Creative Commons. Then, if someone was going to create a web app like this, they could filter that list against their web app to only include sites that have been licensed creative commons.


I just replied to an email about my server hanging when trying to retrieve the python code. If you have a similar problem just email me I will reply and attach the code.

Email address and twitter are in my profile.

Also, any fixes are much appreciated. Patch file not required. I've gotten a few emails that just show me where I should add code and why. :)


As also pointed out by BeaufifulSoup's author, I would suggest you to use lxml (http://codespeak.net/lxml/). It is much faster. I've migrated a couple of my projects to it.



There seems to be a lot of extraneous HTML and CSS in the feed. Good to know about pipes though.


Nice. I only follow HN in Reader, and this makes that even easier. Thanks a lot. Now if we could only get full feeds from the NYTimes...


Same for me. This combined with the Google Reader keyboard shortcuts makes it much easier and faster to go through HN content.


Wow, combining my favorite bookmarklet with my favorite news site. Well done!


This is great, thanks so much for this!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: