RMBI4310/COMP4332 Big Data Mining (Spring 2017)

Course Info

This is a project oriented course. It will expose students to practical issues of large-scale and real world data mining. Data mining is a process of extracting implicit, previously unknown, and potentially useful knowledge from data, and it is a critical task in many applications. This course will place emphasis on applications of data mining on areas such as business intelligence, which aims to uncover facts and patterns in large volumes of data for decision support. Application areas also include many other areas in science and engineering applications. This course builds on basic knowledge gained in the introductory data-mining course, and explores how to more effectively mine large volumes of real-world data and to tap into large quantities of data. It will introduce new algorithms that can more effectively find hidden and profitable data patterns and knowledge. Working on real world data sets, students will experience all steps of a data-mining project, beginning with problem definition and data selection, and continuing through data exploration, data transformation, sampling, portioning, modeling, and assessment.

Course Learning Outcomes

Lectures:  WeFr 03:00PM - 04:20PM

Venue: 2464

Tutorial

Date/Time

Venue

TA

T1

Mo 03:00PM - 03:50PM

2465

TBA

Textbook:

·         Introduction to Data Mining with Case Studies, Second Edition, G.K. Gupta. PHI Learning, 2011. [DM-CS]

·         Luna Dong, Divesh Srivastava: Big Data Integration. Morgan & Claypool, 2015.[BI]

·         Bing Liu, Web Data Mining, Springer, 2011[BI]

Evaluation: Midterm Exam (30%)

                     Final Exam (30%)

                     Project: Report+Code (30%), Demo+Presentation(10%)

Midterm

      Midterm Score

 

PROJECT

Project Group Information

Project Presentation Signup Sheet

 

Project Topics Examples

1. Music Portal

Nowadays, one company cannot buy all the copyright of all the songs, users can’t use one music player to listen to all the music, and we are hoping to use this portal to combine the music and singers in the popular music Websites and users can check which music player they can use to listen the music they want to listen to. In addition, with this music portal, users can find frequent played songs, the most popular singers, or the most popular songs.

2. Movie Portal

   As movies become an increasingly popular entertainment for people of all ages, people always concern about how to distinguish a good one in a sea of choices. As the advertisements are usually confusing, people may prefer to search for some details about movies and do some comparison.

The movie data portal can support people to search the basic information of movies including movie name, cast, director, country, language, duration, released date and types of movie. More importantly, this portal will integrate the users’ comments (ranks) and box office trends which it’s helpful for user to distinguish whether the positive (or negative) comments come from real users or paid posters. With abundant data, the system could even classify whether a movie hire posters or not (e.g. the movie with less box office but have a large amount of comments in total or a large amount of comments submitted in a short period) and that is meaningful when people view the movie’s ranks and comments before they make decisions and go to watch movies

3. Real Estate Property Portal

Real Estate Portal is a data portal designed to give users integrated information of a real estate in Hong Kong. This data portal will retrieve data from various data source in the internet, crawl all of the data and place it in one big database for our data portal to compile and give out the best price for a certain real estate in a certain location and which agency the user should go to buy/rent the property.  Based on the collected data, the users can find most popular real estate or price trend of the real estate in recent years.

Schedule

Week

Lecture

WF 03:00PM - 04:20PM

 

Supplementary Materials

Tutorial

 Mo 03:00PM - 03:50PM

#1
Feb,3

Introduction

Overview of Big Data (PPT, PDF)

 

 NO Tutorials in the first week!

#2
Feb,8

Web Crawling(PPT, PDF), Data Extraction (PPT)

Document Representation (PPT)

Bayesian Networks (PPT)

Dynamic Program

(PDF) (Thanks for Prof. Yi Ke’s Notes)

Regular Expression and Finite Automa(PPT)

#3
Feb,20th

Data Integration and Schema Mapping (PPT, PDF)

Schema Matching Survey(PPT, PDF)

Tutorial 1

#4
Feb,22

 

Tutorial 2, bsTest.py

#5
March, 1

Entity Resolution

Functional Dependencies

MinHashing

Tutorial 3, Stockcrawler.py

Stockcrawler_bs.py

Stock_withbackup.py

Stock_withoutbackup.py

Mysqltest.py

6
Mar,8,

Record Linkage for Big Data

 

Tutorial 4

#7
Mar,15


Data Fusion

Truth Discovery

Tutorial 5

#8
March 22,

Text Mining (I)

Midterm Review

EM algorithm

Tutorial 6

Longestcommonsubstring.py

#9
March,29

Midterm Exam, Text Mining (II)

 

Tutorial7

#10
Apr,05

Introduction to NLP and Final Review

Project Presentations (6 groups)

 

No tutorial, Questions/Answers

#11
Apr,12

Mid-Term Break (no class)

 

No tutorial, Questions/Answers

#12
Apr,19

Project Presentations (6 groups)

 

No tutorial, Questions/Answers

#13
Apr,26

Project Presentations (6 groups)

 

No tutorial, Questions/Answers

#14

May 5th

Project Presentations (6 groups)

 

 

 

Presentation Bonus Marks

1.     Participant and Marking Bonus: all the students are strongly encouraged to attend paper presentation sessions and give marks to the presenter (the score sheet will be distributed at the beginning of the session). I will give you bonus 0.25 mark for each filled score sheet.

2.     Question Bonus: all the students are encouraged to ask questions during the Question/Answer session after each presentation. Each student is allowed to ask one question in each paper presentation. For each asked question, I will give you bonus 0.5 mark.  You can staple your bonus coupon which we will give you on your filled score sheet.

Please note, the questions like “Can you explain more?”, “I cannot understand, can you repeat?” will not be counted. For each paper presentation’s Q/A session, at most 3 questions are allowed to ask.

Policy

Instructor

Lei Chen (Web)
Room: 3509 (Lift 25/26)
Office hours: By appointment

Announcements