Masters of Engineering (by research)
2002 - 2003

Novel Techniques for Web-based Data Mining

     The objective of this project is to develop new techniques that can be used for the development of powerful web-based data mining applications, which enable end-users as well as E-commerce applications to turn this wealth of data on Internet into knowledge. As an example, with the newly proposed techniques, e-commerce applications can be developed to allow managers to understand exactly what customers are experiencing when contacting their company in order to identify actions needed to improve productivity and processes impacting customer satisfaction. In order to achieve that goal, various techniques need to be explored to see how the relevant information embedded inside a Web document in a semi-structured form such as in HTML or XML can be pre-processed and extracted before any data mining tools can be applied.

     More specifically, the source of the dataset used for investigation comes from Internet based information sources, which is made up of "traditional" databases with specific attributes; Web documents such as HTML as well as new forms of Web documents based on XML such as XHTML. These Web documents may also contain data generated by electronic marketplace and commerce. Various techniques need to be investigated and hence applied in order to carry out content-based analyses on those datasets. It is hoped that the techniques can be developed for the comprehensive analyses of those less-structured datasets; those datasets can be 'accurately' converted/transferred into a more structured one; critical attributes can be discovered and existing mining tools can be applied. Hence a feasible E-commerce application can be developed to collect attributes that may influence consumers/managers/end-users in making their decisions.

     More specifically, we 'd like to focus on the representation and clustering of web documents in this project. While there has been active research on Web content extraction using text-based techniques, documents are in fact 2-dimensional entities, and often include multimedia content. Hence new methods for the analysis of multimedia content will be required. It is still an open research topic on how to quantitatively and accurately represent a Web document based on its contents. The representation may highly affect the quality of web documents clustering in terms of precision and recall. Taking into consideration of the huge volume of dynamic data (web documents) on Internet, existing data clustering algorithms may not be appropriate for the clustering tasks. A new clustering algorithm should be explored in order to achieve the high efficiency as well as the high precision/recall of the clustering essential to the special applications.

     The need is evident for further discussion to identify the role of document analysis in Web document content extraction; the role of data clustering in Web document clustering and develop the suitable techniques accordingly. We will look into these two areas in our project. We believed that those techniques are highly important and closely related to E-commerce applications. Web documents should be converted and represented in a more structured format using for example a feature vector, which could closely reflect the contents of the web document. Similar web documents in their content terms should be grouped together dynamically and accurately in order to provide some additional knowledge for users.

 


Publications

  • MEng thesis titled “iSEARCH Framework: Using Web Structure and Summarisation Techniques for Web Content Mining”. Available as pdf electronic document under NTU Publications.

  • Conference paper titled “SVD: A Novel Content-Based Representation Technique for Web Documents” to the 4th International Conference on Information, Communications & Signal Processing and Pacific-Rim Conference on Multimedia (ICICS-PCM 2003)

  • Journal paper titled “Using Web structure and summarisation techniques for Web content mining” (Published in Information Processing & Management: Volume 41, Issue 5, September 2005, Pages 1225-1242) - DOI Bookmark: 10.1016/j.ipm.2004.08.003


Home PageTable of Contents


Copyright© 2007 by Chue Wai Lian

Last updated on Wednesday, 26 December 2007

Please send all mails to wailian at hotmail dot com