The 2nd HKUST-USC Joint Workshop on Big Data Applications

Recently, due to the advance of pervasive sensing and computing, heterogeneous data are generated at unexpected scale and complexity. Data originates from any different domains, such as sensor networks, medical images, scientific measurements, financial transactions, web-interactions and social media. These data are everywhere and exist along the time with different types of representations. Thus, the “BIG Data” concept is introduced to describe these enormous data. So far, there is no clear definition about BIG Data, according to wiki, “Big data are datasets that grow so large that they become awkward to work with using on-hand database management tools. Difficulties include capture, storage, search, sharing, analytics and visualizing. “Clearly, managing and analyzing BIG data have many challenges. In this workshop, we mainly focus on discussing some killer applications (opportunities) brought by the Big Data, such as Geo-Crowdsourcing, Product Recommendation, and Human Motion Tracking, etc. The workshop will be featured with six talks and a panel discussion.

Venue: IAS Lecture Theater, HKUST Jockey Club Institute for Advanced Study, Lee Shau Kee Campus, HKUST

Time: 9:00am to 6:00pm

How to reach IAS building in HKUST, check here.

Feedback

Program:

Philip S. Yu

Jian Pei

Huan Liu

Irwin King

R. Kotagiri

Jie Tang

Cyrus Shahabi

Topic 1:
Title On Mining Big Data and Social Network Analysis Abstract The problem of big data has become increasingly importance in recent years. On the one hand, the big data is an asset that potentially can offer tremendous value or reward to the data owner. On the other hand, it poses tremendous challenges to distil the value out of the big data. The very nature of the big data poses challenges not only due to its volume, and velocity of being generated, but also its variety and veracity. Here variety means the data collected from various sources can have different formats from structured data to text to network/graph data to image, etc. Veracity concerns the trustworthiness of the data as the various data sources can have different reliability. The challenge is thus how to fuse the information from different sources with different formats and veracities together. One of the most critical big data applications is mining social networks. As social networks become increasingly popular, not only the scale of the networks grows rapidly with Facebook having more than 1 billion active users, but also the complexity of the networks increases over time. The vast scale of a network implies non-uniformity or heterogeneity, i.e., distinct behavior over different parts of the network, while the evolving network characteristics infers the dynamic and non-stationary behavior of the network. In this talk, we will discuss these big data issues and approaches to address them using social networks as the example. Bio Dr. Philip S. Yu is a Distinguished Professor and the Wexler Chair in Information Technology at the Department of Computer Science, University of Illinois at Chicago. Before joining UIC, he was at the IBM Watson Research Center, where he built a world-renowned data mining and database department. He is a Fellow of ACM and IEEE. Dr. Yu is the recipient of IEEE Computer Society’s 2013 Technical Achievement Award for “pioneering and fundamentally innovative contributions to the scalable indexing, querying, searching, mining and anonymization of big data”. With more than 850 publications and 300 patents, cited more than 56,000 times with an H-index of 114, Dr. Yu is a leader in the data mining and data management community. Dr. Yu is the Editor-in-Chief of ACM Transactions on Knowledge Discovery from Data. He is on the steering committee of the IEEE Conference on Data Mining and ACM Conference on Information and Knowledge Management and was a member of the IEEE Data Engineering steering committee. He was the Editor-in-Chief of IEEE Transactions on Knowledge and Data Engineering (2001-2004). He received a Research Contributions Award from IEEE Intl. Conference on Data Mining (ICDM) in 2003, the ICDM 2013 10-year Highest-Impact Paper Award, and the EDBT Test of Time Award (2014). Dr. Yu received his PhD from Stanford University.
Topic 2:
Title Finding Outstanding Aspects and Contrast Subspaces Abstract In our recent endeavor of computational health informatics/intelligence, we face a series of interesting problems of finding outstanding aspects that distinguish a specific object from its peers. In this talk, I will introduce the unsupervised and supervised versions. Specifically, given a set of objects and a query object, all in a multidimensional space, the unsupervised version finds the minimal subspaces where the object is most outlying against the other objects in the set. The supervised version assumes class labels (either positive or negative) for objects in the set and the query object, and finds the minimal subspaces where the query object is most dissimilar to the other objects in the same class and similar to those in the other class. I will discuss the subtle differences between this group of problems and the traditional outlier detection problem. Furthermore, I will review our preliminary progress and demonstrate the challenges remained open. Bio Jian Pei is Canada Research Chair (Tier 1) in Big Data Science and a Professor of Computing Science at Simon Fraser University. He is a veteran of data mining research and his work has been embraced by industry and government. Since 2000, his research has focused on developing effective and efficient ways to analyze - and capitalize on - the vast stores of data housed in applications such as social networks, network security informatics, healthcare informatics, business intelligence, and web searches. A prolific and widely-cited author, Professor Pei has received several prestigious awards including induction as a Fellow of IEEE.
Topic 3:
Title Large Scale Retinal and Brain MRI analysis for early Detection of Cardiovascular Diseases Abstract Recent studies show that, cerebral White Matter Lesion (WML) is related to cerebrovascular diseases, cardiovascular diseases, dementia and psychiatric disorders. There is also evidence that pathologies in retina are closely related to Lesions in Brain MRI. The main goal is to develop a system that can be used for large scale screening for early detection of CVD. Manual segmentation of WML is not appropriate for long term longitudinal studies because it is time consuming and it shows high intra- and inter-rater variability. In this paper, a fully automated segmentation method is utilized to segment WML from brain Magnetic Resonance Imaging (MRI). The segmentation method uses a combination of global neighbourhood given contrast feature-based Random Forest (RF) classifier and Markov Random Field (MRF) to segment WML. To remove false positive lesions we use a rule based morphological post-processing operation. Quantitative evaluation of the proposed method was performed on 24 subjects of ENVIS-ion study. The segmentation results were validated against the manual segmentation performed by an experienced radiologist and was compared to a recently published WML segmentation method. The results show a dice similarity index of 0.75 for high lesion load, 0.71 for medium lesion load and 0.60 for low lesion load are achieved. Bio Professor Ramamohanarao (Rao) Kotagiri received PhD from Monash University. He was awarded the Alexander von Humboldt Fellowship in 1983. He has been at the University Melbourne since 1980 and was appointed as a professor in computer science in 1989. Rao held several senior positions including Head of Computer Science and Software Engineering, Head of the School of Electrical Engineering and Computer Science at the University of Melbourne and Research Director for the Cooperative Research Centre for Intelligent Decision Systems. He served or serving on the Editorial Boards of the Computer Journal, Universal Computer Science, IEEE TKDE, VLDB Journal and International Journal on Data Privacy. He was the program Co-Chair for VLDB, PAKDD, DASFAA and DOOD conferences. He is a steering committee member of IEEE ICDM, PAKDD. He received distinguished contribution award for Data Mining from PAKDD. Rao is a Fellow of the Institute of Engineers Australia, a Fellow of Australian Academy Technological Sciences and Engineering and a Fellow of Australian Academy of Science. He was awarded Distinguished Contribution Award in 2009 by the Computing Research and Education Association of Australasia. He has published more than 350 articles and 48 PhD completions. He was the conference chair of ICDE 2013 and a conference co-chair of SIGMOD2015.
Topic 4:
Title Online Learning and Online Learning Abstract This talk will be in two parts. The first part will be related to online learning as in machine learning and the second part will be dealing with topics in education analytics for online learning in Massive Open Online Course (MOOC), University Open Online Course (UOOC), Small Personal Online Course (SPOC), flipped classroom, etc. Online learning is a promising technique for big data analytics, especially for learning from streaming data. One important property of online learning is that it can adaptively update the parameters of learning models when a new sample appears. This can avoid retraining from scratch. In the first part of the talk, I will give two novel online learning models on 1) how to adaptively update the weights of the models while selecting features among multiple tasks and 2) how to adaptively seek nonlinear classifiers when two classes of data are imbalanced. Our proposed online learning for multi-task feature selection and kernelized online imbalanced learning are two tools to solve these two issues, respectively. Formulation, algorithms, theory, and experimental results are presented accordingly. Big Education is the convergence of Big Data in education as these are two hot topics of intense research and discussion in recent years. In the second part of the talk, I will introduce a new project that is being funded by the Hong Kong SAR Government named, Knowledge and Education Exchange Platform (KEEP). The KEEP portal is a knowledge aggregator and technology integrator that provides access to online educational resources for producing positive teaching and learning experiences to the educators and students. Bio Prof. King's research interests include machine learning, social computing, web intelligence, data mining, and multimedia information processing. In these research areas, he has over 200 technical publications in journals and conferences. He is an Associate Editor of the ACM Transactions on Knowledge Discovery from Data (ACM TKDD) and Journal of Neural Networks. He is a both member of the Board of Governors and Vice-President for INNS and APNNA. Moreover, he is the General Chair of WSDM2011, General Co-Chair of RecSys2013, ACML2015, and in various capacities in a number of top conferences such as WWW, NIPS, ICML, IJCAI, AAAI, etc. Prof. King is Associate Dean (Education), Faculty of Engineering and Professor at the Department of Computer Science and Engineering, The Chinese University of Hong Kong. Recently, he was on leave with AT&T Labs Research, San Francisco and was also teaching Social Computing and Data Mining as a Visiting Professor at UC Berkeley. He received! his B.Sc. degree in Engineering and Applied Science from California Institute of Technology, Pasadena and his M.Sc. and Ph.D. degree in Computer Science from the University of Southern California, Los Angeles.
Topic 5:
Title Mining Social Media: Looking Ahead Abstract Social media offers us a new way to connect and communicate and also serves an innovative lens for researchers to understand people. Social media mining is one effective way to process social media at scale. Ming social media differs from traditional data mining in many ways and faces new challenges such as “big data paradox” and “evaluation dilemma”. We show how these challenges bring us unique opportunities to study intricacies of social media data, to conduct interdisciplinary research to make new discoveries, and to develop original algorithms and advance data mining and machine learning with social media. Bio Dr. Huan Liu is a professor of Computer Science and Engineering at Arizona State University. He was recognized for excellence in teaching and research in Computer Science and Engineering at Arizona State University. His DMML lab focuses on research in data mining, machine learning, social computing, and artificial intelligence, investigating problems that arise in real-world applications with high-dimensional data of disparate forms. His well-cited publications include books, book chapters, encyclopedia entries, and conference and journal papers. He serves on journal editorial/advisory boards and numerous conference organization and program committees. He is a Fellow of IEEE.
Topic 6:
Title Social Influence and Information Diffusion Abstract Social influence is the behavioral change of a person because of the perceived relationship with other people, organizations and society in general. Social influence has been a widely accepted phenomenon in social networks for decades. Many applications have been built based around the implicit notation of social influence between people, such as marketing, advertisement and recommendations. With the exponential growth of online social network services such as Facebook and Twitter, social influence can for the first time be measured over a large population. In this talk, I will employ Twitter and Weibo as examples to explain how we model social influence and how social influence affect the diffusion of online information (behaviors). Bio Jie Tang is an associate professor at the Department of Computer Science and Technology, Tsinghua University. His research interests include social network analysis, data mining, and machine learning. He has published more than 100 journal/conference papers (in major international journals and conferences and held 10 patents. He also served as PC Co-Chair of WSDM’15, ASONAM’15, ADMA’11, SocInfo’12, KDD-CUP Co-Chair of KDD’15, Poster Co-Chair of KDD’14, Workshop Co-Chair of KDD’13, Local Chair of KDD’12, Publication Co-Chair of KDD’11, and also serves as the (S)PC member of more than 50 international conferences. He is the principal investigator of National High-tech R&D Program (863) Program, NSFC project, Chinese Young Faculty Research Funding, National 985 funding, and international collaborative projects with Minnesota University, IBM, Google, Nokia, Sogou, etc. He is now leading the project Arnetminer.org for academic social network analysis and mining, which has attracted millions of independent IP accesses from 220 countries/regions in the world. He was honored with the CCF Young Scientist Award, NSFC Excellent Young Scholar, and IBM Innovation Faculty Award.
Topic 7:
Title TransDec: A Big-Data Framework for Decision-Making in Transportation Systems Abstract The vast amounts of transportation datasets (traffic flow, incidents, etc.) collected by various federal and state agencies are extremely valuable in real-time decision-making, planning, and management of the transportation systems. In this talk, I will argue that considering the large volume of the transportation data, variety of the data (different modalities and resolutions), and the velocity of the data arrival, developing a scalable system that allows for effective querying and analysis of both archived and real- time data is an intrinsically challenging BigData problem. Subsequently, I will present our end-to-end prototype systme, dubbed TransDec (short for Transportation Decision-Making), which enables real-time integration, visualization, querying, and analysis of these dynamic and archived transportation datasets. I will then discuss a GPS navigation application enabled by such a system and demonstrate its commercialization as a product called ClearPath (see http://myfastestpath.com). Motivated by ClearPath, we will look under the hood and focus on a route-planning problem where the weights on the road- network edges vary as a function of time due to the variability of traffic congestion. I will show that naïve approaches to address this problem are either inaccurate or slow, leading to our new approach to this problem: A time-dependent A* algorithm. Bio Cyrus Shahabi is a Professor of Computer Science and Electrical Engineering and the Director of the Information Laboratory (InfoLAB) at the Computer Science Department and also the Director of the NSF's Integrated Media Systems Center (IMSC) at the University of Southern California (USC). He is also the director of the Informatics Program at USC’s Viterbi School of Engineering. He was the CTO and co-founder of a USC spin-off, Geosemble Technologies, which was acquired in July 2012. Since then, he founded another company, ClearPath, focusing on predictive path-planning for car navigation systems. He received his B.S. in Computer Engineering from Sharif University of Technology in 1989 and then his M.S. and Ph.D. Degrees in Computer Science from the University of Southern California in May 1993 and August 1996, respectively. He authored two books and more than two hundred research papers in the areas of databases, GIS and multimedia with more than 12 US Patents. Dr. Shahabi has received funding from several agencies such as NSF, NIJ, NASA, NIH, DARPA, AFRL, and DHS as well as several industries such as Chevron, Google, HP, Intel, Microsoft, NCR, NGC and Oracle. He was an Associate Editor of IEEE Transactions on Parallel and Distributed Systems (TPDS) from 2004 to 2009 and IEEE Transactions on Knowledge and Data Engineering (TKDE) from 2010- 2013. He is currently on the editorial board of the VLDB Journal, ACM Transactions on Spatial Algorithms and Systems (TSAS), and ACM Computers in Entertainment. He is the founding chair of IEEE NetDB workshop and also the general co-chair of ACM GIS 2007, 2008 and 2009. He chaired the nomination committee of ACM SIGSPATIAL for the 2011-2014 terms. He is a general co-chair of SSTD’15 and PC co-Chair of DASFAA 2015. He has been PC co-chair of IEEE MDM 2013 and IEEE BigData 2013, and regularly serves on the program committee of major conferences such as VLDB, ACM SIGMOD, IEEE ICDE, ACM SIGKDD, and ACM Multimedia. Dr. Shahabi is a fellow of IEEE, and a recipient of the ACM Distinguished Scientist award in 2009, the 2003 U.S. Presidential Early Career Awards for Scientists and Engineers (PECASE), the NSF CAREER award in 2002, and the 2001 Okawa Foundation Research Grant for Information and Telecommunications. He was also a recipient of the US Vietnam Education Foundation (VEF) faculty fellowship award in 2011 and 2012, an organizer of the 2011 National Academy of Engineering “Japan- America Frontiers of Engineering” program, an invited speaker in the 2010 National Research Council (of the National Academies) Committee on New Research Directions for the National Geospatial- Intelligence Agency, and a participant in the 2005 National Academy of Engineering “Frontiers of Engineering” program.