Abstract:
Current web search technologies are good to find similar pages with their content and
link structures. However they are not enough to find similar pages including word
dictionary or cross-linguistic meaning relevance.
This thesis focuses finding similar pages on web with combination of known
techniques. Link gatherings, semantic web metadata parsing are required for Web
content and structural mining. This thesis differs from other web mining methods with
word dictionary meaning and cross-linguistic meanings. All of that information is
processed by web crawlers and indexed on data for web mining.
Indexed data is purified from non-useful words and misleading web sites, such as
advertisement sites. Clean data is processed in clustering data mining. Data processing
contains adding more information to page relations with link distance levels and content
word joint values.
For the web mining process, K-means and EM methods of clustering algorithms are
compared to decide which one will have better results. Chosen method enlists similar
pages to the page of the user selected at starting point of the process.