Masters
of Engineering (by research) |
The objective of this project is to develop new techniques that can be used for the development of powerful web-based data mining applications, which enable end-users as well as E-commerce applications to turn this wealth of data on Internet into knowledge. As an example, with the newly proposed techniques, e-commerce applications can be developed to allow managers to understand exactly what customers are experiencing when contacting their company in order to identify actions needed to improve productivity and processes impacting customer satisfaction. In order to achieve that goal, various techniques need to be explored to see how the relevant information embedded inside a Web document in a semi-structured form such as in HTML or XML can be pre-processed and extracted before any data mining tools can be applied.
More specifically, the source of the dataset used for investigation comes from Internet based information sources, which is made up of "traditional" databases with specific attributes; Web documents such as HTML as well as new forms of Web documents based on XML such as XHTML. These Web documents may also contain data generated by electronic marketplace and commerce. Various techniques need to be investigated and hence applied in order to carry out content-based analyses on those datasets. It is hoped that the techniques can be developed for the comprehensive analyses of those less-structured datasets; those datasets can be 'accurately' converted/transferred into a more structured one; critical attributes can be discovered and existing mining tools can be applied. Hence a feasible E-commerce application can be developed to collect attributes that may influence consumers/managers/end-users in making their decisions.
More specifically, we 'd like to focus on the representation and clustering of web documents in this project. While there has been active research on Web content extraction using text-based techniques, documents are in fact 2-dimensional entities, and often include multimedia content. Hence new methods for the analysis of multimedia content will be required. It is still an open research topic on how to quantitatively and accurately represent a Web document based on its contents. The representation may highly affect the quality of web documents clustering in terms of precision and recall. Taking into consideration of the huge volume of dynamic data (web documents) on Internet, existing data clustering algorithms may not be appropriate for the clustering tasks. A new clustering algorithm should be explored in order to achieve the high efficiency as well as the high precision/recall of the clustering essential to the special applications.
The need is evident for further discussion to identify the role of document analysis in Web document content extraction; the role of data clustering in Web document clustering and develop the suitable techniques accordingly. We will look into these two areas in our project. We believed that those techniques are highly important and closely related to E-commerce applications. Web documents should be converted and represented in a more structured format using for example a feature vector, which could closely reflect the contents of the web document. Similar web documents in their content terms should be grouped together dynamically and accurately in order to provide some additional knowledge for users.
|
Copyright© 2007 by Chue Wai Lian
Last updated on Wednesday, 26 December 2007
Please send all mails to wailian at hotmail dot com