¡@

COMP5331: Knowledge Discovery in Databases

Course Details

Instructor:
   
Prof. Raymond Chi-Wing Wong
   Office Hours: TBA

Time:
Monday and Wednesday (10:30am-11:50am)

Venue: Rm 2504 (LT 25/26)

Area: DB or AI (This course can count towards one of the areas only and cannot be double counted towards the required credits).

TA:
   Tianwen CHEN
   Email: tchenaj <AT> connect.ust.hk
   Office Hours: TBA

Course Description

Data mining has emerged as a major frontier field of study in recent years. Aimed at extracting useful and interesting patterns and knowledge from large data repositories such as databases and the Web, the field of data mining integrates techniques from database, statistics and artificial intelligence. This course will provide a broad overview of the field, preparing the students with the ability to conduct research in the field.

Topics

  1. Association
  2. Clustering
  3. Classification
  4. Data Warehouse
  5. Data Mining over Data Streams
  6. Web Databases

Reference Book/Materials

  • Papers
  • Data Mining: Concepts and Techniques. Jiawei Han, Micheline Kamber and Jian Pei : Morgan Kaufmann Publishers (3rd edition)
  • Introduction to Data Mining. Pang-Ning Tan, Michael Steinbach, Vipin Kumar Boston : Pearson Addison Wesley (2006)

Grading Scheme

  • Assignment 30%
  • Project 30%
  • Final Exam 40%
    ¡@

Homework

NOTE: No late submissions are allowed.

  • HW1 (pdf) Solution (pdf)
  • HW2 (pdf) Solution (pdf)
  • HW3 (pdf) Solution (pdf)
    • If you submit it in hardcopy, you could submit it in Raymond WONG's office (Rm 3541 (LT 25/26)) in the period between 10:20am and 10:30am on 27 Nov, 2019.
    • If you submit it in softcopy (DOC/PDF), you could submit it via CASS (Logged in with the CSE Account).


      CSE Account Application Procedure (for non-CSE students)

      You need a CSE account (in addition to your current ITSC account).
      This account will be used to submit your HW3 via the file submission system called CASS developed by CSE


      The following instructions show how you can apply for a CSE account.

      - please go to this link to obtain a CSE account.
      (This link can only be accessible within the UST campus network. You could use the UST VPN or UST Virtual Barn to access this link too if you are currently outside the campus.)


Lecture Notes


No.

Topic References
1 Overview (ppt) Data Mining: Concepts and Techniques. Jiawei Han, Micheline Kamber and Jian Pei. Morgan Kaufmann Publishers (3rd edition)


Introduction to Data Mining. Pang-Ning Tan, Michael Steinbach, Vipin Kumar Boston : Pearson Addison Wesley (2006)

2 Association (ppt) R. Agrawal, R. Srikant, "Fast Algorithms for Mining Association Rules", VLDB 1994 (pdf)
3 FP-Tree (ppt) J. Han, J. Pei, Y. Yin, "Mining Frequent Patterns without Candidate Generation", SIGMOD 2000 (pdf)
4 Clustering (ppt) Data Mining: Concepts and Techniques. Jiawei Han, Micheline Kamber and Jian Pei. Morgan Kaufmann Publishers (3rd edition)


Introduction to Data Mining. Pang-Ning Tan, Michael Steinbach, Vipin Kumar Boston : Pearson Addison Wesley (2006)

5 Other Clustering Techniques (ppt) A. P. Demster, N. M. Laird, D. B. Rubin, "Maximum Likelihood from Incomplete Data via the EM Algorithm", Journal of the Royal Statistical Society, Series B, Vol. 39, No. 1, 1977 (pdf)

M. Ester, H.-P. Kriegel, J. Sander, X. Xu, "A Density-based Algorithm for Discovering Clusters in Large Spatial Databases with Noise", SIGKDD 1996 (pdf)

T. Zhang, R. Ramakrishnan, M. Livny, "BIRCH: An efficient data clustering method for very large databases", SIGMOD 1996 (pdf)
6 Outlier (ppt) Data Mining: Concepts and Techniques. Jiawei Han, Micheline Kamber and Jian Pei. Morgan Kaufmann Publishers (3rd edition)

M. M. Breunig, H.-P. Kriegel, R. T. Ng, J. Sander, "LOF: Identifying Density-Based Local Outliers", SIGMOD 2000 (pdf)

7 Subspace Clustering (ppt) K. Beyer, J. Goldstein, R. Ramakrishnan, U. Shaft, "When is Nearest Neighbor Meaningful?", ICDT 1999 (pdf)

R. Agrawal, J. Gehrke, D. Gunopulos, P Raghavan, "Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications", SIGMOD 1998 (pdf)

C.-H. Cheng, A. W.-C. Fu and Y. Zhang, "Entropy-based Subspace Clustering for Mining Numerical Data", SIGKDD 1999  (pdf)
¡@
8 Classification (ppt) Data Mining: Concepts and Techniques. Jiawei Han, Micheline Kamber and Jian Pei. Morgan Kaufmann Publishers (3rd edition)


Introduction to Data Mining. Pang-Ning Tan, Michael Steinbach, Vipin Kumar Boston : Pearson Addison Wesley (2006)

9
Other Classification Model 1:
Support Vector Machine (ppt)
¡@
Data Mining: Concepts and Techniques. Jiawei Han, Micheline Kamber and Jian Pei. Morgan Kaufmann Publishers (3rd edition)
10
Other Classification Model 2:
Neural Network (ppt)
¡@
Data Mining: Concepts and Techniques. Jiawei Han, Micheline Kamber and Jian Pei. Morgan Kaufmann Publishers (3rd edition)
11
Other Classification Model 3:
Recurrent Neural Network (pptx)
¡@
S. Hochreiter, J. Schmidhuber. "Long Short-Term Memory", Neural Computation. 9 (8): 1735–1780 (1997) (pdf)

K. Cho, B. Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, Y. Bengio. "Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation", arXiv 2014 (pdf)
12 Data Warehouse (ppt)
V. Harinarayan, A. Rajaraman, J. Ullman, "Implementing Data Cubes Efficiently", SIGMOD 1996 (pdf)
¡@
13 Data Mining over Data Streams (ppt) G. S. Manku, R. Motwani, "Approximate Frequency Counts over Data Streams", VLDB 2002 (pdf)

A. Metwally, D. Agrawal, A. El Abbadi, "Efficient Computation of Frequent and Top-k Elements in Data Streams", ICDT 2005 (pdf)
14 Other Data Stream Models (ppt)
P. Domingos and G. Hulten, "Mining High-Speed Data Streams", SIGKDD 2000 (pdf)
¡@
15 Web DB (ppt) J. M. Kleinberg, "Authoritative Sources in a Hyperlinked Environment", Journal of the ACM, 46:5, Sept. 1999, pp 604-632 (pdf)

L. Page, S. Brin, R. Motwani, T. Winograd, "The PageRank Citation Ranking: Bringing Order to the Web", Manuscript, 1998 (pdf)
16 Multi-Criteria Decision Making (ppt)
D. Papadias, Y. Tao, G. Fu, B. Seeger, "Progressive Skyline Computation in Database Systems", ACM Transactions on Database Systems (TODS), 30(1), 41-82, 2005  (pdf)
¡@
17 Advanced Topic (ppt) -
¡@
¡@

 

Exam

  • Online Final Exam

    Date: 13 Dec, 2019 (Fri) (HK Time)
    Time: 4:30pm-6:30pm (HK Time)
    Venue: A silent place near to you with good internet access (Original Exam Venue: LT L (CYT Building) )

    Submission Site: Canvas ("Courses" --> "COMP5331 (L1) ..." --> "Assignments" --> "Final Exam")
    • Some of you may not access Canvas directly from your normal desktop (e.g., accessing from some countries with firewall). In this case, our university gives the guideline about "Accessing Canvas via a virtual desktop". Please read the guideline under title "Accessing Canvas via a virtual desktop" of this link.

    Details of Online Final Exam:
    1. Exam Paper Delivery
      1. The instructor will send an email between 4:25pm and 4:30pm on 13 Dec (Fri) (HK Time) to all of you.
      2. This email contains the exam paper (PDF).

    2. Exam Paper Writing
      1. You could do this exam paper in the exam period.
      2. In this exam, we are doing in a "trusted" environment that you should do the exam paper by yourself.
        Please do not discuss or communicate with other people when you are doing the exam.
        (Honestly, we could not prevent you from communicating with others due to this "online" setting.
        Honesty is a kind of attitude that you should have as a univeristy student.)
      3. You could write on a sheet of paper
        or you could type it electronically (in any form).

        Note: If you need to write some symbols for some questions, writing on a sheet of paper is faster.

      4. The final submission of your exam paper is a PDF file.

        If you write on a sheet of paper, you have to scan it (or take a picture) to generate a PDF file for submission.

        If you type it electronically (in some forms like DOC), you should convert the format to the PDF format.

      5. If you have any questions (about the exam paper) in this exam period, please send an email to me. I am ready to reply emails to you.

      6. In the exam period, you may check emails sent by me (if any) if there are some clarifications about the questions of the exam paper.

    3. Exam Paper Submission
      1. At the time before the exam ending time (i.e., 6:30pm), please generate a PDF file and submit your PDF file.
        We allow 10 minutes buffer for submission (e.g., you could submit within 10 minutes after the exam ending time) since you may need to generate the PDF file and the network may be slow.

      2. The submission site is Canvas.

        This Canvas system has a feature to find any plagiarism (if any). This is the reason why we are using Canvas.

    Details of Trial Final Exam:
    1. Since it may be your first time to deal with this "online" exam and it is your first time to deal with this Canvas system in our course, I will have a "trial" final exam session (which is a short version of the real exam) as follows.

      Trial Exam Date: 12 Dec, 2019 (Thu) (HK time) (i.e., one day before the real exam)
      Trial Exam Starting Time: 4:30pm-5:00pm (HK time)
      Venue: A silent place near to you with good internet access

    2. All procedures in this "trial" final exam are the same as those in the "real" final exam (but, the duration of this exam is short only).
    3. This "trial" final exam is for you to "experience" the whole "online" exam format only.
    4. There are no scores counted for this "trial" final exam.
      Thus, it is "optional". However, you are encouraged to experience this so that in the "real" online exam, you could be quick for the real exam.
    5. If you have some other tasks (e.g., other exams) to do in this "trial" final exam period, it does not matter.
      This is because the submission site will allow submissions after this trial exam ending time (to facilitate students who will not be free in this trial exam time).

Project

The details of the course project can be found in this link

¡@