<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
		>
<channel>
	<title>Comments on: The CouchDB indexer &#8211; lightweight search engine in hours</title>
	<atom:link href="http://knuthellan.com/2009/07/09/the-couchdb-indexer-lightweight-search-engine-in-hours/feed/" rel="self" type="application/rss+xml" />
	<link>http://knuthellan.com/2009/07/09/the-couchdb-indexer-lightweight-search-engine-in-hours/</link>
	<description>General geekyness and starting a company</description>
	<lastBuildDate>Sat, 05 Mar 2011 00:14:53 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
	<item>
		<title>By: knuthellan</title>
		<link>http://knuthellan.com/2009/07/09/the-couchdb-indexer-lightweight-search-engine-in-hours/#comment-60</link>
		<dc:creator><![CDATA[knuthellan]]></dc:creator>
		<pubDate>Sat, 05 Mar 2011 00:14:53 +0000</pubDate>
		<guid isPermaLink="false">http://knuthellan.com/?p=185#comment-60</guid>
		<description><![CDATA[In writing that Quora post, I revisited this blog post and realized that it is indeed outdated. My presentation at JavaZone 2010, http://bit.ly/cO1AE5, shows how to do this with just a summing reducer and I admit I should probably write a new blog post about that.]]></description>
		<content:encoded><![CDATA[<p>In writing that Quora post, I revisited this blog post and realized that it is indeed outdated. My presentation at JavaZone 2010, <a href="http://bit.ly/cO1AE5" rel="nofollow">http://bit.ly/cO1AE5</a>, shows how to do this with just a summing reducer and I admit I should probably write a new blog post about that.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: knuthellan</title>
		<link>http://knuthellan.com/2009/07/09/the-couchdb-indexer-lightweight-search-engine-in-hours/#comment-61</link>
		<dc:creator><![CDATA[knuthellan]]></dc:creator>
		<pubDate>Sat, 05 Mar 2011 00:14:53 +0000</pubDate>
		<guid isPermaLink="false">http://knuthellan.com/?p=185#comment-61</guid>
		<description><![CDATA[In writing that Quora post, I revisited this blog post and realized that it is indeed outdated. My presentation at JavaZone 2010, http://bit.ly/cO1AE5, shows how to do this with just a summing reducer and I admit I should probably write a new blog post about that.]]></description>
		<content:encoded><![CDATA[<p>In writing that Quora post, I revisited this blog post and realized that it is indeed outdated. My presentation at JavaZone 2010, <a href="http://bit.ly/cO1AE5" rel="nofollow">http://bit.ly/cO1AE5</a>, shows how to do this with just a summing reducer and I admit I should probably write a new blog post about that.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Quora</title>
		<link>http://knuthellan.com/2009/07/09/the-couchdb-indexer-lightweight-search-engine-in-hours/#comment-59</link>
		<dc:creator><![CDATA[Quora]]></dc:creator>
		<pubDate>Sat, 05 Mar 2011 00:03:24 +0000</pubDate>
		<guid isPermaLink="false">http://knuthellan.com/?p=185#comment-59</guid>
		<description><![CDATA[&lt;strong&gt;Is real-time search fundamentally different from various search services Google already provides?...&lt;/strong&gt;

The short answer is no. Search consists of data collection, creation of an index and ranking. Traditional data collection is done by crawlers following links. Obviously that&#039;s not optimal for real-time search unless you treat oft-changing sites differ...]]></description>
		<content:encoded><![CDATA[<p><strong>Is real-time search fundamentally different from various search services Google already provides?&#8230;</strong></p>
<p>The short answer is no. Search consists of data collection, creation of an index and ranking. Traditional data collection is done by crawlers following links. Obviously that&#8217;s not optimal for real-time search unless you treat oft-changing sites differ&#8230;</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: knuthellan</title>
		<link>http://knuthellan.com/2009/07/09/the-couchdb-indexer-lightweight-search-engine-in-hours/#comment-33</link>
		<dc:creator><![CDATA[knuthellan]]></dc:creator>
		<pubDate>Sat, 21 Nov 2009 22:27:58 +0000</pubDate>
		<guid isPermaLink="false">http://knuthellan.com/?p=185#comment-33</guid>
		<description><![CDATA[Just have to say that as of CouchDB 0.10, this approach will probably not work. That being said, it&#039;s still pretty easy to do an indexer in CouchDB, but you either have to do more work in the mapper or summarize outside of CouchDB. While this takes some elegance out of the solution, I believe it will be more viable for larger data sets.]]></description>
		<content:encoded><![CDATA[<p>Just have to say that as of CouchDB 0.10, this approach will probably not work. That being said, it&#8217;s still pretty easy to do an indexer in CouchDB, but you either have to do more work in the mapper or summarize outside of CouchDB. While this takes some elegance out of the solution, I believe it will be more viable for larger data sets.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Scott</title>
		<link>http://knuthellan.com/2009/07/09/the-couchdb-indexer-lightweight-search-engine-in-hours/#comment-29</link>
		<dc:creator><![CDATA[Scott]]></dc:creator>
		<pubDate>Tue, 18 Aug 2009 17:09:28 +0000</pubDate>
		<guid isPermaLink="false">http://knuthellan.com/?p=185#comment-29</guid>
		<description><![CDATA[If you are interested, I wrote about my own experiences using CouchDB to compute TF-IDF: http://www.elusivesnark.com/2009/08/computing-tf-idf-with-couchdb-and.html

There doesn&#039;t seem to be much written about this yet so I would be happy to hear your or others opinions.

Cheers!]]></description>
		<content:encoded><![CDATA[<p>If you are interested, I wrote about my own experiences using CouchDB to compute TF-IDF: <a href="http://www.elusivesnark.com/2009/08/computing-tf-idf-with-couchdb-and.html" rel="nofollow">http://www.elusivesnark.com/2009/08/computing-tf-idf-with-couchdb-and.html</a></p>
<p>There doesn&#8217;t seem to be much written about this yet so I would be happy to hear your or others opinions.</p>
<p>Cheers!</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: knuthellan</title>
		<link>http://knuthellan.com/2009/07/09/the-couchdb-indexer-lightweight-search-engine-in-hours/#comment-27</link>
		<dc:creator><![CDATA[knuthellan]]></dc:creator>
		<pubDate>Fri, 14 Aug 2009 13:45:19 +0000</pubDate>
		<guid isPermaLink="false">http://knuthellan.com/?p=185#comment-27</guid>
		<description><![CDATA[I&#039;ve tried to torture CouchDB with this particular map reduce the last couple of days and it seems like it&#039;s too much for it at least in the standard configuration.

I used pages from the Norwegian wikipedia and while I&#039;ve never seen the reduce_over_flow error, it quickly becomes dead slow. I started with 10000 articles and the map reduce seemed to get almost stuck at a little above 300 tasks (out of 10000). I stopped it and tested with fewer documents, but I didn&#039;t find the pain point, but it&#039;s somewhere between 100 and 10000, based on the point it got stuck earlier I guess it&#039;s around 300.

It might be that the reduce_over_flow you see is a hint that you need to handle rereduce.

Change the reduce function to

function(keys, values, rereduce) {
  var docs = [];
  if (rereduce) {
    for (var i = 0; i &lt; values.length; ++i) {
      for (var j = 0; j &lt; values[i].length; ++j) docs.push(values[i][j]);
    }
    return docs;
  }
  for (var i = 0; i &lt; values.length; ++i) docs.push(values[i]);
  return docs;
}

and handle rereduce.]]></description>
		<content:encoded><![CDATA[<p>I&#8217;ve tried to torture CouchDB with this particular map reduce the last couple of days and it seems like it&#8217;s too much for it at least in the standard configuration.</p>
<p>I used pages from the Norwegian wikipedia and while I&#8217;ve never seen the reduce_over_flow error, it quickly becomes dead slow. I started with 10000 articles and the map reduce seemed to get almost stuck at a little above 300 tasks (out of 10000). I stopped it and tested with fewer documents, but I didn&#8217;t find the pain point, but it&#8217;s somewhere between 100 and 10000, based on the point it got stuck earlier I guess it&#8217;s around 300.</p>
<p>It might be that the reduce_over_flow you see is a hint that you need to handle rereduce.</p>
<p>Change the reduce function to</p>
<p>function(keys, values, rereduce) {<br />
  var docs = [];<br />
  if (rereduce) {<br />
    for (var i = 0; i &lt; values.length; ++i) {<br />
      for (var j = 0; j &lt; values[i].length; ++j) docs.push(values[i][j]);<br />
    }<br />
    return docs;<br />
  }<br />
  for (var i = 0; i &lt; values.length; ++i) docs.push(values[i]);<br />
  return docs;<br />
}</p>
<p>and handle rereduce.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Scott</title>
		<link>http://knuthellan.com/2009/07/09/the-couchdb-indexer-lightweight-search-engine-in-hours/#comment-26</link>
		<dc:creator><![CDATA[Scott]]></dc:creator>
		<pubDate>Tue, 11 Aug 2009 20:08:28 +0000</pubDate>
		<guid isPermaLink="false">http://knuthellan.com/?p=185#comment-26</guid>
		<description><![CDATA[It could also be a case of my own experience :)  I have only been playing with couchdb for 2 days now.  I pasted a small example to friendpaste:

http://friendpaste.com/2nQ841lJ7O8aP4iPBeRPgc

Apologies if I have misread something.]]></description>
		<content:encoded><![CDATA[<p>It could also be a case of my own experience <img src='http://s0.wp.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />   I have only been playing with couchdb for 2 days now.  I pasted a small example to friendpaste:</p>
<p><a href="http://friendpaste.com/2nQ841lJ7O8aP4iPBeRPgc" rel="nofollow">http://friendpaste.com/2nQ841lJ7O8aP4iPBeRPgc</a></p>
<p>Apologies if I have misread something.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: knuthellan</title>
		<link>http://knuthellan.com/2009/07/09/the-couchdb-indexer-lightweight-search-engine-in-hours/#comment-25</link>
		<dc:creator><![CDATA[knuthellan]]></dc:creator>
		<pubDate>Tue, 11 Aug 2009 18:39:35 +0000</pubDate>
		<guid isPermaLink="false">http://knuthellan.com/?p=185#comment-25</guid>
		<description><![CDATA[I only ran this with about 100 documents for a demo application showing off something else, but I will definitely run a test as well. If couchdb map reduce in general breaks down at 3000 documents that means it needs fixing. Could of course be a combined factor of document length and number of documents, but 3000 documents is still surprisingly low and should probably be much higher. ]]></description>
		<content:encoded><![CDATA[<p>I only ran this with about 100 documents for a demo application showing off something else, but I will definitely run a test as well. If couchdb map reduce in general breaks down at 3000 documents that means it needs fixing. Could of course be a combined factor of document length and number of documents, but 3000 documents is still surprisingly low and should probably be much higher.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Scott</title>
		<link>http://knuthellan.com/2009/07/09/the-couchdb-indexer-lightweight-search-engine-in-hours/#comment-24</link>
		<dc:creator><![CDATA[Scott]]></dc:creator>
		<pubDate>Tue, 11 Aug 2009 17:25:02 +0000</pubDate>
		<guid isPermaLink="false">http://knuthellan.com/?p=185#comment-24</guid>
		<description><![CDATA[Did you test this approach on a non-trivial database?  I tried following your example on a database with ~3000 documents and get reduce_over_flow errors.]]></description>
		<content:encoded><![CDATA[<p>Did you test this approach on a non-trivial database?  I tried following your example on a database with ~3000 documents and get reduce_over_flow errors.</p>
]]></content:encoded>
	</item>
</channel>
</rss>

