Let's be Careful with Numbers, Okay?

S. Osokine.
Server Architect, Infolio.
6 Jul 2001.

   This is a comment to the recent articles about MusicCity (Kazaa) and Gnutella on the O'Reilly Network: P2P Articles:

Morpheus Out of the Underworld by Kelly Truelove and Andrew Chastin
Is All Music File-Sharing Piracy? by Richard Koman

   And when I say "a comment", I mean a comment - it is not a criticism. In fact, I liked both articles a lot. If anything, this "comment" is a case study in the information propagation over the Net, and in how the myths are created.

   Anyway - while the first article provides a very thoughtful analysis of the MusicCity Network, it contains one very interesting statement:

   "...Gnutella, which Clip2 and Lime Wire have independently estimated with simultaneous users of about 40,000 (mean). The Morpheus-KaZaA count stands at over 300,000 at present..."

   This statement is entirely correct by itself - the estimates of both Clip2 and LimeWire are usually pretty accurate.

   What is missing here, is the context. What these numbers - 40,000 and 300,000 - tell us, is the number of machines in the corresponding networks at any given moment. Which is not the same as the number of the network users.

   In fact, later on, Kelly and Andrew say:

   "...Also by default, Morpheus is launched and run in the background when the PC is booted. Further, the Morpheus application does not close when its window is closed; it simply minimizes to the system tray and keeps running in the background."

   Taken together, these two quotes paint a very clear picture of the MusicCity as the network that uses the "daemon clients" to maximize the clients' (and thus the P2P file servers') uptime and the content availability. A very sensible move, by the way. For the past year I've probably spent a couple of thousands of hours on the Gnutella analysis and development, so I can remember hearing the similar suggestion on the Gnutella development forums and in private discussions quite a few times.

   However, I think that this suggestion was never implemented in the mainstream Gnutella clients. Normally the Gnutella client is just another application - not a daemon. You start it, you look for some files, you download them, and then close the application. Your client is a part of the network, and your files are shared only when you are looking for stuff yourself. If my memory serves me right, the average length of the Gnutella client session is just over one hour - probably several times less than the session length of the MusicCity client. Naturally, the same number of daily users translates into the Gnutella Network that is several times smaller than the MusicCity Network with the same number of daily users.

   It is difficult to say why Gnutella clients work this way. Remember, Gnutella clients are developed by many independent vendors, and it is difficult to "reverse-engineer" their design decisions. If I'd have to guess, the reason might be that the current Gnutella Network with its 40,000 simultaneous users is bigger than the file search query propagation radius anyway - the search query can probably reach only about 5,000 to 10,000 hosts. (Which is a lot - you can find pretty much everything you'd want on such a network.) So the network already has more content than can be searched by a single client. Thus the thinking probably is: "why should we take measures to increase the network size if we cannot fully search even the current network?"

   Well - I think that this reasoning is a mistake, because the issue here is not how many other clients you can search, but also how sure you can be that the other client won't be shut down in the middle of the file transfer. "Daemon clients" obvously increase the probability of the successful download, which is right now the most serious problem for the Gnutella network. Still, every client developer has his own reasons for doing or not doing something. This diversity can be viewed as an advantage of the Gnutella network, in fact - no vendor can crash the network even if his latest client version has some very serious bugs. (And, by the way, I am sorry if there are daemon clients for Gnutella - I'm just not aware of them).

   So Kelly and Andrew made a very clear and correct technical statement. Okay, I did wince when I read it, since I've had a premonition that someone was going to misinterpret it, but still the statement was pretty clear if you kept the whole article in mind - it never mentioned the number of people using Gnutella or MusicCity over, say, one-day or one-week period. The total number of users was also never mentioned.

   And then - in just four days - the second article. In this article, Richard Koman says:

    "...MusicCity regularly boasts more than 300,000 simultaneous users, blowing away not only Napster but also Gnutella, which averages a mean of about 40,000 users."

   Note the use of the expression "blow away" here. This is not a technical statement anymore, is it? What we have here is a subtle message that wonderful MusicCity Network (no irony here, it really is pretty cool) attracts eight times more people than the Gnutella Network.

   Now it's about time for the disclaimer. Here goes: I don't have an idea how many daily, weekly or total users are there in Gnutella or Music City Networks.

   Okay, maybe I might have some projections for Gnutella Network - after all, I've seen quite a lot of statistical data about it. Still, I do not have enough data to do a comparison between the total numbers of users in these networks. For all I know, Gnutella can have more, less or the same number of users as MusicCity. What I'm saying is that nothing in Kelly and Andrew's article provides any information to make such a comparison even possible - much less to say that MusicCity "blows away" Gnutella in terms of the user base.

   Yes, the Gnutella network is smaller - but does it tell us anything about how many people prefer Gnutella to MusicCity or other way around? No. It sure doesn't.

   On the other hand, I cannot blame Richard for falling into this trap - if I would not have spent lots of time analyzing the Gnutella Network, I might end up with the same impression myself after reading the article by Kelly and Andrew (and again, they are not to blame, too - everything they said was entirely correct, even though maybe just a little bit too vague).

   This is a very interesting situation: the piece of the information (call it meme or whatever) starts a life of its own. Sort of like a... not a virus, right? There's no executable code in these "40,000/300,000" statements, after all... Call it "prion", maybe? A small piece of DNA that cannot even replicate on its own, but is capable of transferring the Mad Cow Disease all right. I mean, I am afraid to even think how Richard's article will be quoted in a few days or weeks. How about this:

   "MusicCity Blows Gnutella Away!!!"

   And what is interesting, it is always Gnutella that is on the receiving end of such "prion diseases". I still remember the February, 2001 article which stated that Gnutella is doomed because it does not scale. I was rolling on the floor laughing after I read it, because by then it was already pretty clear to all the insiders that today Gnutella is an only truly and infinitely scalable P2P network (if you wonder why, and don't mind working through a bit of math, see the flow control algorithm design document).

   The joke was on me, though. In the months that followed, every time I was having a conversation with literally anyone who knew about P2P but was not intimately familiar with Gnutella design, the first question I had to answer invariably was: "But I've heard that Gnutella does not scale?" Which translates as: "How come you are doing something that pathetic?"

   Oh well. Maybe it is just that the distributed, "P2P" marketing is less effective than the centralized one?

   And in any case, let's be careful with these numbers, okay? :-)


Back to the Gnutella development page.