Citeseer and Web Noise

Citeseer works by crawling web pages containing links to PDF documents (usually scientific or technical papers). These web pages have to be manually submitted by the author, although no validation is made here to verify whether the author really submitted the page or it was some one else.

The Citeseer crawler then traverses those web pages and identifies elements in those PDF documents such as titles, authors, and citations. The PDF documents are cached and the information extracted indexed and made available through their web site. All this sounds good enough. The problem is that this task is far from simple, and the success of Citeseer is modest. Therefore, one side effect that Citeseer produces is effectively add a lot of noise to the Web.

By noise I refer, for example, to how common it is to find multiple times the same document with slightly different names or list of authors, because similar, but not identical, PDF documents were found by the crawler, and the latter is not smart enough to differentiate them. Take for example this paper, which is missing the authors and pretty much all its metadata (date, journal, etc.) except for its abstract. The same paper seems to appear again, this time with the correct author but missing abstract. There are thousands of similar cases just like this.

The problem with Unknown authors is serious enough, but because the search capability does not seem to work very well, searching for Unknown as author only yields 4 results. But simply spending a few minutes browsing the site makes obvious that there are several orders of magnitude more documents with that problem.

Another problem with Citeseer is that it caches the PDF document files. This can be useful if the web page that originally contained them disappears. However, this is not very useful if the web page really exists. This can even be a problem if the original PDF is updated at some point whereas the old version of the PDF remains cached by Citeseer.

My experience with Citeseer

Some time a few years ago I had the silly idea of submitting my own web page to Citeseer, only to find out soon after that all my papers were listed by Citeseer with serious errors or omissions like the mentioned above. At that point I manually fixed those errors by directly editing the metadata in Citeseer, which took me hours. Afterwards it takes weeks or months for those changes to appear in Citeseer because they are manually verified by them. Unfortunately, the errors appeared again months after they had been fixed. I suppose they forget about corrections when they rebuild the index, or whatever. I think I went through the same process three times with the same result.

At this point I decided it had been a mistake to submit my web page to Citeseer and a waste of time to try and fix again those errors. Therefore I contacted Lee Giles, who is the director of Citeseer, and asked him to please remove those documents from Citeseer. Well, this is when I realised that Citeseer is not only quite questionable from the technical point of view, but also their people is quite particular.

It seems to me that Lee Giles did not take very well me not wanting to have my documents in Citeseer (and I am talking about second rate publications with a rather small number of citations and relevance in the field), because then he started to create all types of problems to have those documents removed. This included me having to provide authorisation from all coauthors of the papers. This is simply stupid, since those papers can be found in multiple other sites (including the coauthors own web pages, Google Scholar, DBLP, etc.), and those coauthors did not have to provide authorisation to have their papers crawled by Citeseer (it would make no sense at all). This request in particular did not only make no sense, but it was impossible for me to achieve because I have no relation with some coauthors of papers written many years ago (e.g. students that are long gone).

Finally I just decided to gave up and forget about the whole thing even though it means having a lot of noise about my papers in Citeseer. I just hope that someone who wants to download or cites one of my papers does using a version not found in Citeseer, for the sake of correctness.


My advice would be to be very careful before submitting something to Citeseer. They are not just very limited in their success with huge downtimes and all those problems mentioned above, they also have a questionable attitude towards their users and in general I found it to be very unpleasant experience. 1