Syllabus                                                          

“Geospatial Data Mining” (E-Learning)

English e-learning course, corresponding to the contents of a course held at the Institute of Statistics and Information Management, Universidade Nova de Lisboa.

Teachers

Prof. Doutor Fernando Lucas Bação

http://www.isegi.unl.pt/ensino/docentes/fbacao/index.html

 

Prof. Doutor Victor Lobo

http://www.isegi.unl.pt/docentes/vlobo/

 

 

Version 5

Date: 05.09.2007

 

This document will be used as the roadmap of your study in the discipline of “Geospatial Data Mining”. Whatever you need to know will be here. The document will be frequently updated as the course advances, so please check this html regularly for news.

 

 

        Goals

The goal is that students completing this course, should be able to:

Ø       Define Data Mining.

Ø       Explain the characteristic features of Data Mining.

Ø       Explain why Data Mining can be a valuable addition in the context of GIScience.

Ø       Analyse the implications of the geo prefix in Geographic Data Mining.

Ø       Understand the basic data preparation and pre-processing tasks.

Ø       Understand what a Self-Organizing Map is and how it works.

Ø       Use Self-Organizing Maps in unsupervised classification tasks.

Ø       Understand what a Multi-layer Perceptron is and how the backpropagation training algorithm works.

Ø       Understand what a Classification Trees is and how it works.

Ø       Use Classification Trees and Multi-Layer Perceptron Neural Networks in supervised classification tasks.

 

        Content

 

Part 1:

The idea of the 1st Part of the course is to provide the basic concepts of data mining and knowledge discovery. The student is introduced to the different perspectives from which data mining can be viewed. Emphasis is added to geospatial (or geographical) data mining and to what the geo prefix implies. Different perspectives, from different authors, are presented, so that the student is able to produce his own synthesis. 

Reading List:

1.      Definition of data mining, “Knowledge Discovery and Data Mining: towards a unifying framework”, Fayyad, Shapiro and Smyth.

2.      Data Mining: Statistics and More?, by David J. Hand

3.      Is inductive machine learning just another wild goose (or might it lay the golden egg)? – the perspective from Mark Gahegan

4.      Geographical data mining: key design issues - the perspective from Stan Openshaw

5.      Geospatial data mining – my own biased perspective.

6.      Is spatial data special?, “On the Particular Characteristics of Spatial Data and its Similarities to Secondary Data Used in Data Mining”, Bação, F., Lobo, V., Painho, M., GIS PLANET 2005.

 

SELF-TEST

 

Part 2:

In the 2nd part we deal with the concepts of unsupervised classification. The student starts by reading a document about unsupervised classification, followed by a small document on the fundamentals of data preparation and pre-processing. Next, two different tools are presented: the k-means algorithm and the self-organizing map. This point is closed with a discussion on profiling and the tools available to explore the unsupervised classification results.

The clustering/exploratory tools presented during this part constitute an important set of tools in the context of GIScience. In many circumstances the researcher just wants to improve his knowledge about the problem at hand, having little, if any, assumptions. In these cases clustering/exploratory tools come as an important resource to reduce the data and improve the knowledge about the phenomena. K-means clustering constitutes a well-known and heavily used tool, which has a long history within human and social sciences. Its simplicity and efficiency has contributed to its long stand popularity, continuing to be the most used tool in clustering. The Self-Organizing Map (SOM) is a more recent tool and is a neural network which can be seen as a “visualization and analysis tool for high dimensional data”. Depending on the use of the SOM it may overlap with k-means. Nevertheless, the range of applications of the SOM is substantially larger, enabling the user to apply it in different situations.

The fundamental objective here is to produce “informed users” which are able to, not only understand the underpinnings of the algorithms, but also to use them, and all accompanying tools, in a useful and sound fashion.

Reading List:

1.      Introduction to Unsupervised Classification, from Fernando Bação and Victor Lobo

2.      The fundamentals of clustering, from Statsoft Electronic Book

3.      Fundamentals of data preparation and pre-processing, from Fernando Bação and Victor Lobo (“Data Preprocessing for Supervised Learning” by S. B. Kotsiantis, D. Kanellopoulos and P. E. Pintelas) (Data Preparation Exercises)

4.      The k-means algorithm, from Statsoft Electronic Book

5.      The self-organizing map (SOM), by Fernando Bação and Victor Lobo

6.      Samuel Kaski and Teuvo Kohonen, “Exploratory data analysis by the self-organizing map: Structures of welfare and poverty in the world.” In Apostolos-Paul N. Refenes, Yaser Abu-Mostafa, John Moody, and Andreas Weigend, editors, Neural Networks in Financial Engineering, pages 498--507. World Scientific, Singapore, 1996.

7.      Result interpretation and profiling, by Fernando Bação and Victor Lobo

Demos and Applets:

·         SOFM – My favourite demo on the workings of a SOM. Every time I need to explain the SOM, this always seems the easiest way to do it.

·         Interactive Self-Organizing Map demonstrations – two applets from the HUT people in Finland.

·         Our own demos, developed by Roberto Henriques, one of our whiz programmers. This is only a movie, in the future we will develop an applet for the internet. The software has been developed for geographic applications, nevertheless it seems to be a great tool to understand the basics of the SOM.

Additional reading

1.      Mitchell, T., (1997) Machine Learning, McGraw Hill.

2.      Hand, D. J., Mannila, H., Smyth, P. (2001) Principles of Data Mining (Adaptive Computation and Machine Learning), MIT Press.

 

SELF-TEST

 

Project 1:

 

Project 1 deals with the application of the concepts related with unsupervised classification presented in Part 2. The project consists of two exercises in which the student uses the tools addressed in Part 2 (k-means algorithm and the self-organizing map) to classify data from satellite images. The first exercise is organized in a tutorial fashion and the student just has to follow the steps to achieve the desired result. The second exercise has no instructions and is meant to evaluate the level of understanding and autonomy of the student.

 

1.    Datasets – Exercise 1 (Lisbon shapfile), Exercise 2

2.    Software - SOM_PAK, a very good software package, very efficient and capable of processing very large datasets. It doesn’t have a graphical interface, and all interaction is done through DOS command line. This may be frightening for some of you, but let me assure you that after the first shock (and some experiments) it is relatively easy to use. The manual is available here, you should read it in order to be able to complete the proposed exercises.

3.    Instructions - Step by step tutorial, somlx2.bat

 

 

Part 3:

In the Part 3 the objective is to study supervised classification methods. The student is introduced to the typical process involved in supervised classification tasks. Emphasis is given to two different types of tools: classification trees and feed-forward neural networks with backpropagation.

An overview of the purpose and application of supervised classification is presented, pointing out underlying assumptions, advantages, and possible pitfalls of the process. A review the principles of statistical classification, namely Bayesian classifiers and related techniques such as naive Bayes, are presented.

Classification trees are studied with some detail. Classification trees, induced from data, provide very powerful insights into the reasons why data is classified into a certain class. A detailed explanation of various induction processes is presented, covering aspects such as different decision criteria for splitting nodes and pruning techniques.

The third chapter of this part deals with the use of neural networks for supervised classification. The classical perceptron and multilayer perceptron networks are studied and different training algorithms, including error backpropagation, are presented. Issues such as type of activation function and network size and topology are discussed. Other neural networks, such as Radial Basis Function networks (RBF), and Linear Vector Quantization (LVQ) are also presented, although with less detail.

Reading List:

1.      Introduction to Supervised Classification, from Fernando Bação and Victor Lobo

2.      Classification Trees, from Statsoft Electronic Book

3.      Neural Networks, from Statsoft Electronic Book

4.      Additional topics on the use of Classification Trees, from Fernando Bação and Victor Lobo

Additional reading

1.      Bishop, C. (1995). Neural Networks for Pattern Recognition. Oxford: University Press.

2.      Haykin, S. (1998) Neural Networks - A Comprehensive Foundation. Prentice Hall, 2nd edition.

3.      Fausett, L. (1994). Fundamentals of Neural Networks. New York: Prentice Hall.

 

SELF-TEST

 

Project 2

 

Project 2 will deal with the application of concepts related with supervised classification presented in Part 3 of this course. The project consists of two exercises in which the student uses the tools addressed in Part 3 to build predictive models.

 

1.    Datasets – Exercise 1, Exercise 2

2.    Software:

·       CTree.xls – an Excel based program, from Angshuman Saha, which builds decision trees. This software package was developed essentially as a learning aid and the “performance is not too bad”. It is easy to use and it only requires the user to have Excel.

·       NNPred.xls (Neural Network Model for Prediction) – another Excel based program, from Angshuman Saha, which builds neural networks. In his own words “(it) is a very basic implementation of FeedForward - BackPropagation Neural Network, used for prediction and classification problems.”

3.    Instructions - Project 2

 

        Style

·         Problem-oriented approach with active knowledge acquisition

·         Theory and practical project

·         Asynchronous part: self study based on online materials, self-tests at the end of each unit, projects.

·         Synchronous part: discussion of problems and tasks in 3 synchronous sessions

·         Access to teacher via E-Mail

·         Students' interaction via forum

·         One exam at the end of the course

·         Student workload: 90 h (2x30 h), equivalent to 3 credit points (ECTS)

 

        Participants

·         Students from:

1.      International Institute for Geo-Information Science and Earth Observation (Enschede, The Netherlands);

2.      Institute for Geoinformatic - University of Münster (Germany);

3.      College of Geoinformatics - University of West Hungary.

·         Knowledge in geoinformatics and basic statistics is highly recommended

 

        Organization

·         Start and end: Oct 16 - Feb 10, 2006

·         Synchronous sessions:

1.      November 24, 17.00 – 18.00hrs Portuguese time

2.      December 14, 17.00 – 18.00hrs Portuguese time

3.      January 12, , 17.00 – 18.00hrs Portuguese time

·         Max. number of participants: 20

·         Student online activity will be tracked by the platform, completion of self-tests and online questions will be used to assess the progress of the course

 

        Successful participation

·         Grades are between 0 and 20, to pass you need to have at least 10;

·         Complete the proposed projects

·         Send in project 1 (date to be announced)

·         Send in project 2 (date to be announced)

·         Attend synchronous sessions

·         Pass exam