|
Syllabus “Geospatial
Data Mining” (E-Learning) English e-learning course, corresponding to the contents
of a course held at the
Version 5 Date: 05.09.2007 This
document will be used as the roadmap of your study in the discipline of
“Geospatial Data Mining”. Whatever you need to know will be here. The document
will be frequently updated as the course advances, so please check this html regularly for news. Goals
The goal is that students
completing this course, should be able to: Ø Define
Data Mining. Ø Explain
the characteristic features of Data Mining. Ø Explain
why Data Mining can be a valuable addition in the context of GIScience. Ø Analyse
the implications of the geo prefix in Geographic Data Mining. Ø Understand
the basic data preparation and pre-processing tasks. Ø Understand
what a Self-Organizing Map is and how it works. Ø Use
Self-Organizing Maps in unsupervised classification tasks. Ø Understand
what a Multi-layer Perceptron is and how the backpropagation training
algorithm works. Ø Understand
what a Classification Trees is and how it works. Ø Use
Classification Trees and Multi-Layer Perceptron Neural Networks in supervised
classification tasks. Content
Part 1: The idea of the 1st
Part of the course is to provide the basic concepts of data mining and
knowledge discovery. The student is introduced to the different perspectives
from which data mining can be viewed. Emphasis is added to geospatial (or
geographical) data mining and to what the geo prefix implies. Different
perspectives, from different authors, are presented, so that the student is
able to produce his own synthesis. Reading
List: 1. Definition of data mining,
“Knowledge Discovery and Data Mining: towards a unifying framework”, Fayyad,
Shapiro and Smyth. 2. Data Mining: Statistics and
More?, by David J. Hand 3. Is inductive machine
learning just another wild goose (or might it lay the golden egg)? – the
perspective from Mark Gahegan 4. Geographical data mining: key design issues - the perspective from Stan Openshaw 5. Geospatial data mining –
my own biased perspective. 6. Is spatial data special?, “On the
Particular Characteristics of Spatial Data and its Similarities to Secondary
Data Used in Data Mining”, Bação, F., Lobo, V., Painho, M., GIS PLANET 2005. SELF-TEST Part 2: In the 2nd part we
deal with the concepts of unsupervised classification. The student starts by
reading a document about unsupervised classification, followed by a small
document on the fundamentals of data preparation and pre-processing. Next,
two different tools are presented: the k-means algorithm and the
self-organizing map. This point is closed with a discussion on profiling and
the tools available to explore the unsupervised classification results. The clustering/exploratory tools
presented during this part constitute an important set of tools in the context
of GIScience. In many circumstances the researcher just wants to improve his
knowledge about the problem at hand, having little, if any, assumptions. In
these cases clustering/exploratory tools come as an important resource to
reduce the data and improve the knowledge about the phenomena. K-means
clustering constitutes a well-known and heavily used tool, which has a long
history within human and social sciences. Its simplicity and efficiency has
contributed to its long stand popularity, continuing to be the most used tool
in clustering. The Self-Organizing Map (SOM) is a more recent tool and is a
neural network which can be seen as a “visualization and analysis tool for
high dimensional data”. Depending on the use of the SOM it may overlap with
k-means. Nevertheless, the range of applications of the SOM is substantially
larger, enabling the user to apply it in different situations. The fundamental objective here
is to produce “informed users”
which are able to, not only understand the underpinnings of the algorithms,
but also to use them, and all accompanying tools, in a useful and sound
fashion. Reading
List: 1. Introduction to
Unsupervised Classification, from 2. The
fundamentals of clustering, from Statsoft Electronic Book 3. Fundamentals
of data preparation and pre-processing, from 4. The
k-means algorithm, from Statsoft Electronic Book 5. The self-organizing map (SOM), by 6. Samuel
Kaski and Teuvo Kohonen, “Exploratory
data analysis by the self-organizing map: Structures of welfare and poverty
in the world.” In Apostolos-Paul N. Refenes, Yaser Abu-Mostafa,
John Moody, and Andreas Weigend, editors, Neural
Networks in Financial Engineering, pages 498--507. World
Scientific, 7. Result
interpretation and profiling, by Demos
and Applets: ·
SOFM – My favourite demo on the workings of a SOM. Every time I need
to explain the SOM, this always seems the easiest way to do it. ·
Interactive
Self-Organizing Map demonstrations – two applets
from the HUT people in ·
Our own demos,
developed by Roberto Henriques,
one of our whiz programmers. This is only a movie, in the future we will
develop an applet for the internet. The software has been developed for
geographic applications, nevertheless it seems to be a great tool to understand
the basics of the SOM. Additional
reading 1.
Mitchell, T., (1997) Machine Learning,
McGraw Hill. 2.
Hand, D. J., Mannila, H., Smyth, P.
(2001) Principles of Data Mining (Adaptive Computation and Machine Learning),
MIT Press. SELF-TEST Project
1: Project 1 deals
with the application of the concepts related with unsupervised classification
presented in Part 2. The project consists of two exercises in which the
student uses the tools addressed in Part 2 (k-means algorithm and the
self-organizing map) to classify data from satellite images. The first
exercise is organized in a tutorial fashion and the student just has to
follow the steps to achieve the desired result. The second exercise has no
instructions and is meant to evaluate the level of understanding and autonomy
of the student. 1.
Datasets – Exercise 1 (Lisbon
shapfile), Exercise 2 2. Software
- SOM_PAK, a very good software package, very
efficient and capable of processing very large datasets. It doesn’t have a
graphical interface, and all interaction is done through DOS command line.
This may be frightening for some of you, but let me assure you that after the
first shock (and some experiments) it is relatively easy to use. The manual
is available here, you should read it in order to be able to
complete the proposed exercises. 3.
Instructions - Step by step tutorial, somlx2.bat Part 3: In the Part 3 the objective is
to study supervised classification methods. The student is introduced to the
typical process involved in supervised classification tasks. Emphasis is
given to two different types of tools: classification trees and feed-forward
neural networks with backpropagation. An overview of the purpose and
application of supervised classification is presented, pointing out
underlying assumptions, advantages, and possible pitfalls of the process. A
review the principles of statistical classification, namely Bayesian
classifiers and related techniques such as naive Bayes, are presented. Classification trees are studied
with some detail. Classification trees, induced from data, provide very
powerful insights into the reasons why data is classified into a certain
class. A detailed explanation of various induction processes is presented,
covering aspects such as different decision criteria for splitting nodes and
pruning techniques. The third chapter of this part
deals with the use of neural networks for supervised classification. The
classical perceptron and multilayer perceptron networks are studied and different
training algorithms, including error backpropagation, are presented. Issues
such as type of activation function and network size and topology are
discussed. Other neural networks, such as Radial Basis Function networks
(RBF), and Linear Vector Quantization (LVQ) are also presented, although with
less detail. Reading
List: 1. Introduction
to Supervised Classification, from 2. Classification
Trees, from Statsoft
Electronic Book 3. Neural
Networks, from Statsoft
Electronic Book 4. Additional
topics on the use of Classification Trees, from Additional
reading 1.
Bishop, C. (1995). Neural Networks
for Pattern Recognition. 2.
Haykin, S. (1998) Neural Networks - A
Comprehensive Foundation. Prentice Hall, 2nd edition. 3.
Fausett, L. (1994). Fundamentals of
Neural Networks. SELF-TEST Project
2 Project 2 will deal
with the application of concepts related with supervised classification
presented in Part 3 of this course. The project consists of two exercises in
which the student uses the tools addressed in Part 3 to build predictive
models. 1.
Datasets – Exercise
1, Exercise 2 2. Software: · CTree.xls – an Excel based program, from Angshuman Saha, which
builds decision trees. This software package was developed essentially as a
learning aid and the “performance is
not too bad”. It is easy to use and it only requires the user to have
Excel. · NNPred.xls
(Neural Network Model for Prediction) – another Excel based program, from
Angshuman Saha, which builds neural networks. In his own words “(it) is a
very basic implementation of FeedForward - BackPropagation Neural Network,
used for prediction and classification problems.” 3.
Instructions - Project 2 Style
·
Problem-oriented
approach with active knowledge acquisition ·
Theory and
practical project ·
Asynchronous part:
self study based on online materials, self-tests at the end of each unit,
projects. ·
Synchronous
part: discussion of problems and tasks in 3 synchronous sessions ·
Access to
teacher via E-Mail ·
Students'
interaction via forum ·
One exam at the
end of the course ·
Student
workload: 90 h (2x30 h), equivalent to 3 credit points (ECTS) Participants
·
Students from: 1.
International
Institute for Geo-Information Science and Earth Observation (Enschede, The 2.
Institute for
Geoinformatic - 3.
·
Knowledge in
geoinformatics and basic statistics is highly recommended Organization
·
Start and end:
Oct 16 - Feb 10, 2006 ·
Synchronous
sessions: 1.
November 24, 17.00
– 18.00hrs Portuguese time 2.
December 14,
17.00 – 18.00hrs Portuguese time 3.
January 12, ,
17.00 – 18.00hrs Portuguese time ·
Max. number of
participants: 20 ·
Student online
activity will be tracked by the platform, completion of self-tests and online
questions will be used to assess the progress of the course Successful participation
·
Grades are
between 0 and 20, to pass you need to have at least 10; ·
Complete the
proposed projects ·
Send in project
1 (date to be announced) ·
Send in project
2 (date to be announced) ·
Attend
synchronous sessions ·
Pass exam |