UNIVERSITY OF AARHUS
DEPARTMENT OF COMPUTER SCIENCE

Algorithms for Web Indexing and Searching, Fall 2002

Lecturers · Time and place · Course description · Schedule · Literature · Evaluation · Prerequisites · Literature · Course language · Credits

Lecturers

Gerth Stølting Brodal and Rolf Fagerberg.

Time and place

Mondays 13.15-14.00 and Wednesdays 9.15-11.00 in Auditorium D4.

Course description

With the growth of the web and other online resources in recent years, the problems associated with managing massive amounts of primarily textual data have become increasingly important. Besides classical string algorithms and data structures, a variety of algorithms and techniques have recently emerged for indexing, filtering, searching, and transmitting these online resources. As witnessed by e.g. Google(tm), application of such techniques can significantly improve performance for search engines on the web.

In this course we will study the algorithmic issues involved in efficient indexing and searching of the web. Starting with classic string algorithms and data structures, we will continue with recent techniques such as indexing schemes for massive amounts of data, link-based ranking algorithms, spectral methods for graph analysis, clustering and nearest neighbor search in high-dimensional spaces, statistical methods for clustering documents, efficient near-equality testing using randomized algorithms and cache replacement strategies. This list of subjects is tentative, and may evolve during the course.

[Pensum ]

Schedule

DateTopicsReading
September 2Introduction to course (slides ps | pdf)
Anatomy of Search Engines (slides ps | pdf)
[ACGPR01],[BP98]
September 4Anatomy of Search Engines
Web Crawling (slides ps | pdf)
[ACGPR01],[BP98],[NH01],[HN99],[NW01]
September 9Web Crawling and Web Dynamics (slides ps | pdf)
[RM02],[CG00]
September 12Focused crawling
Crawling the hidden web
[MPSR01],[CBD99],[RG00]
September 16Project discussion 
September 18I/O model (slides ps | pdf), Sorting (slides ps | pdf), B-trees
Indexing (slides ps | pdf)
[MRYG00],[WMB99],[GT98, Pages 660-663]
September 23Hashing[WMB99]
September 25Tries, Ternary trees[BS97],[MM90]
September 30 Project discussion 
October 2Suffix trees, Suffix arrays[U95],[MM90]
October 7Efficient computation of PageRank[H99]
October 9Efficient computation of PageRank, Authoritative sources[H99],[K99]
October 14Fall break 
October 16Fall break 
October 21Extensions of Kleinbergs algorithm [CDRRGK98],[CDGKRRT98],[BH98]
October 23Optimal merging (slides ps | pdf)
Classical information retrieval - models and ranking
[BB99, Chapter 3],[BR99, Section 2.5.1-3, 3.2, 8.4]
October 28Querying [BR99, Section 4.2-3]
October 30Fingerprinting, Document resemblance[B93],[B97],[BGNZ97]
November 4Clustering[B97],[BGNZ97]
November 6Clustering[HGI00]
November 11Clustering large matrices[DFKVV99]
November 13Latent semantic indexing[BDB94],[AFKMS01]
November 18Cancelled 
November 20Cancelled 
November 25Cancelled 
November 27Finding mirrored hosts[BBDH00]
December 2The web as a graph[GL02],[BKMRRSTW00],[KRRSTU00]
December 4Cancelled ("Åben dag") 
December 9Bayesian classifiers[L98]
December 11Cancelled (moved to December 12) 
December 12
15.15-17.00
Auditorium D4
Guest lecture
René DePont Christensen, Thomas Rask Thomsen (Dreamgate)
announcement
December 16SALSA[LM01]
December 18Computational game theory and mechanisms design[P01]

Literature

  1. [ACGPR01] Searching the Web, Arvind Arasu, Junghoo Cho, Hector Garcia-Molina, Andreas Paepcke, and Sriram Raghavan. ACM Transactions on Internet Technology, 1, p. 2-43, 2001.
  2. [BP98] The Anatomy of a Large-Scale Hypertextual Web Search Engine. Sergey Brin and Lawrence Page. Proceedings of the 7th International WWW conference, 1998.
  3. [NH01] High-Performance Web Crawling. Marc Najork and Allan Heydon. Compaq SRC Research Report 173
  4. [HN99] Mercator: A Scalable, Extensible Web Crawler. Allan Heydon and Marc Najork. In World Wide Web, December 1999, pages 219-229.
  5. [NW01] Breadth-First Search Crawling Yields High-Quality Pages. Marc Najork and Janet L. Wiener. In Proceedings of the Tenth Internal World Wide Web Conference, pages 114-118, May, 2001
  6. [RM02] Search engines and Web dynamics. Knut Magne Risvik and Rolf Michelsen. Computer Networks, Volume 39, Issue 3, 21 June 2002, Pages 289-302.
  7. [CG00] The Evolution of the Web and Implications for an Incremental Crawler. Junghoo Cho and Hector Garcia-Molina. Proceedings of 26th International Conference on Very Large Data Bases (VLDB), 10-14 September 2000, Cairo, Egypt, Pages 200-209.
  8. [MPSR01] Evaluating Topic-Driven Web Crawlers. Filippo Menczer, Gautam Pant, Padmini Srinivasan, and Miguel E. Ruiz. Proceedings SIGIR 01, September 9-12, 2001, New Orleans, Louisiana, USA, Pages 241-249.
  9. [CBD99] Focused crawling: a new approach to topic-specific Web resource discovery. Soumen Chakrabarti, Martin van den Berg, and Byron Dom. Eighth World Wide Web conference, Toronto, 1999.
  10. [RG00] Crawling the Hidden Web. Sriram Raghavan and Hector Garcia-Molina. Stanford Technical Report 2000-36. Full version of paper appearing in proceedings of 27th International Conference on Very Large Data Bases, Rome, Italy, September 11-14, 2001.
  11. [MRYG00] Building a Distributed Full-Text Index for the Web. Sergey Melnik, Sriram Raghavan, Beverly Yang, and Hector Garcia-Molina. Stanford Technical Report 2000-55. Short version of paper appeared in the proceedings of the 10th International WWW conference, May 2-5, 2001, Hong Kong.
  12. [WMB99] Managing Gigabytes: Compressing and Indexing Documents and Images. Ian H. Witten, Alistair Moffat, and Timothy C. Bell. Morgan Kaufmann Publishing, San Francisco, 1999. ISBN 1-55860-570-3.
  13. [BS97] Fast Algorithms for Sorting and Searching Strings. Jon Bentley and Robert Sedgewick. Proceedings of the eighth annual ACM-SIAM symposium on Discrete algorithms. New Orleans, January, 1997. Pages 360- 369.
  14. [MM90] Suffix arrays: a new method for on-line string searches. Udi Manber and Gene Myers. Proceedings of the first annual ACM-SIAM symposium on Discrete algorithms, 1990. Pages 319 - 327.
  15. [U95] On-Line Construction of Suffix Trees. Esko Ukkonen. Algorithmica, 14, pages 249-260, 1995.
  16. [H99] Efficient Computation of PageRank. Taher H. Haveliwala. Stanford Technical Report 2000-36.
  17. [K99] Authoritative Sources in a Hyperlinked Environment. Jon M. Kleinberg. Journal of the ACM, 46(5), 604-632, 1999.
  18. [CDRRGK98] Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text. Soumen Chakrabarti, Byron Dom, Prabhakar Raghavan, Sridhar Rajagopalan, David Gibson and Jon Kleinberg. Proceedings of the Seventh International World Wide Web Conference (WWW7). Brisbane, Australia. April 1998. Also appeared in: Computer Networks and ISDN Systems 30, 1998, pp.65-74
  19. [CDGKRRT98] Experiments in topic distillation. S. Chakrabarti, B. E. Dom, D. Gibson, R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. In Proceedings of the ACM SIGIR Workshop on Hypertext Information Retrieval on the Web, Melbourne, Australia, 1998. ACM.
  20. [BH98] Improved Algorithms for Topic Distillationa in Hyperlinked Environments. Krishna Bharat and Monika R. Henzinger. Proceedings of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 1998, pp. 104-111.
  21. [BB99] Understanding Search Engines: Mathematical Modeling and Text Retrieval. M.W. Berry and M. Browne. SIAM Book Series: Software, Environments, and Tools, (June 1999), ISBN: 0-89871-437-0.
  22. [BR99] Modern Information Retrieval. Ricardo Baeza-Yates, Berthier Ribiero-Neto. Addison Wesley Higher Education, 1999.
  23. [GT98] Data Structures and Algorithms in Java. M. Goodrich and R. Tamassia, Wiley, First edition, 1998.
  24. [B97] On the resemblance and containment of documents. Andrei Broder. In Compression and Complexity of Sequences (SEQUENCES'97), pages 21-29. IEEE Computer Society, 1998.
  25. [BGNZ97] Syntactic clustering of the Web. Andrei Broder, Steve Glassman, Mark Manasse, and Geoffrey Zweig. In Proceedings of the 6th International World Wide Web Conference, pages 391-404, April 1997, also appeared as SRC Technical Note 1997-015 (HTML).
  26. [B93] Some applications of Rabin's fingerprinting method. Andrei Z. Broder. In R. Capocelli, A. De Santis, U. Vaccaro (eds), Sequences II: Methods in Communications, Security, and Computer Science, Springer-Verlag, 1993.
  27. [HGI00] Scalable Techniques for Clustering the Web. Taher Haveliwala, Aristides Gionis, Piotr Indyk. Third International Workshop on the Web and Databases, 2000.
  28. [DFKVV99] Clustering in large graphs and matrices. P. Drineas, Alan Frieze, Ravi Kannan, Santosh Vempala, V. Vinay. Proceedings of the tenth annual ACM-SIAM symposium on Discrete algorithms, 1999.
  29. [BDB94] Using Linear Algebra for Intelligent Information Retrieval. Michael W. Berry and Susan T. Dumais, and Gavin W. O'Brien. December 1994. Published in SIAM Review 37:4 (1995), pp. 573-595.
  30. [AFKMS01] Spectral analysis of data. Yossi Azar, Amos Fiat, Anna R. Karlin, Frank McSherry, and Jared Saia. Proceedings of the 33rd annual ACM symposium on Theory of computing, pp 619-626, 2001.
  31. [BBDH00] A Comparison of Techniques to Find Mirrored Hosts on the WWW. Krishna Bharat, Andrei Broder, Jeffrey Dean, Monika R. Henzinger. Journal of the American Society for Information Science October 2000 Volume 51 Issue 12. Also in 1999 ACM Digital Library Workshop on Organizing Web Space (WOWS).
  32. [GL02] The Web Graph: an Overview. Jean-Loup Guillaume, Matthieu Latapy. In Proc. AlgoTel 2002.
  33. [BKMRRSTW00] Graph structure in the web. Andrei Broder, Ravi Kumar, Farzin Maghoul, Prabhakar Raghavan, Sridhar Rajagopalan, Raymie Stata, Andrew Tomkins, Janet Wiener. In Proc. Ninth International World Wide Web Conference (WWW9), 2000.
  34. [KRRSTU00] Stochastic models for the Web graph. R. Kumar, P. Raghavan, S. Rajagopalan, D. Sivakumar, A. Tomkins, and E. Upfal. Proceedings of the 41th IEEE Symp. on Foundations of Computer Science. November 2000, pp. 57-65.
  35. [L98] Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval. D. D. Lewis. Proceedings of the 10th European Conference on Machine Learning (ECML-98), LNAI, Vol. 1398, pp. 4-18, Springer, April 21-23 1998.
  36. [LM01] SALSA: The Stochastic Approach for Link-Structure Analysis. R. Lempel and S. Moran. ACM Transactions on Information Systems 19(2), pp. 131-160, April 2001. Also in 9th International World Wide Web conference, 2000
  37. [P01] Algorithms, games, and the internet. Christos Papadimitriou.. Proc. 33rd Annual ACM Symposium on Theory of Computing, 2001, pp. 749-753, 2001.

Search engine project

Links to related pages


Evaluation
Implementation projects
Prerequisites
dADS
Literature
Handouts
Course language
Danish or English
Credits
2 points/10 ECTS

Page maintained by
Gerth Stølting Brodal <gerth@cs.au.dk>.