Speaker: | Professor Limsoon Wong |
Title: | Two Computational Biology Challenges |
Abstract |
1/ When we go for a medical check up today, only a handful of these
lipids---namely the lipoprotein cholesterols and triglycerides---are
measured. Yet there are over three thousand types of lipids in our body.
Furthermore, the critical role of lipids in cell, tissue and organ
physiology is demonstrated by a large number of genetic studies and by many
human diseases involving the disruption of lipid metabolic enzymes and
pathways. Examples of such diseases include cancer, diabetes,
neurodegenerative and infectious disorders. 2/ There is a critical need to address the emergence of drug resistant varieties of pathogens for several infectious diseases. For example, drug-resistant tuberculosis has continued to spread internationally and is now approaching critical proportions. Approaches to counter drug resistance have so far achieved limited success. It has been proposed that this lack of success is due to a lack of understanding of how resistance emerges in bacteria upon drug treatment and that a systems-level analysis of the proteins and interactions involved is essential to gaining insights into the routes required for drug resistance. The implications above signal significant challenges and opportunities for the computational biologists. I would like make a “call to arm” on these two topics. |
Speaker: | Professor Wen-Lian Hsu |
Title: | Artificial Intelligence Techniques in Bioinformatics |
Abstract | Biological Science is gradually becoming an information science. There is a tremendous amount of data in biological labs and databases that are left unattended. Bioinformatics emphasizes the development of informatics tools to support information-driven biological research. We shall discuss some popular AI techniques in bioinformatics including machine learning, data mining, and natural language processing. We will also talk about their potential merits and limitations. We shall illustrate how these techniques are used in structural biology, proteomics, and biological literature mining. |
Speaker: | Professor Xuegong Zhang |
Title: | Computational prediction of MicroRNA-regulated pathways in the differentiation of histological grades in breast cancer |
Abstract: |
The histological grading of malignancy is an important factor in breast
cancer study. We studied the molecular features for the differentiation
between different grades of breast cancer with a DNA microarray dataset
using machine learning methods. Analysis on multiple gene expression data
indicated that microRNAs play a major role in the grade differentiation. We
further analyzed regulation pathways that the microRNAs are involved in. A
putative pathway of microRNA regulation on the differentiation of breast
tumors was proposed based on the intergrated investigation on gene
expression, microRNA target predictions, gene set analysis and pathway
analysis. The proposed pathway was partially validated in follow-up
biological experiments. This provides an example of how computational
approaches can help in predicting possible biological mechanisms from the integration of multiple data sources. |
Speaker: | Dr. Ruiqiang Li, Director of Bioinformatics, BGI Shenzhen |
Title: | Sequencing, sequencing and sequencing |
Abstract | BGI has built a next-generation sequencing platform which can generate more than 500 Gb high-quality data per day. The significant increase in throughput has placed BGI among the top genome centers worldwide. By taking the advantage of the new technologies, several big projects have been undertaking to broaden the range and pace of the applications, such as human gut microbiome survey and profiling to understand its complex and dynamic influence on human health, genome-wide sequencing of thousands of individuals to detect genes associated to type 2 diabetes, very high depth sequencing of an Asian individual to provide the reference genome for east Asian population and demonstrate personal genome sequencing, the International 1000 Genomes Project, the Giant Panda Genome Project, and numerous other large-scale initiatives. We have developed novel applications of the sequencing technologies for accelerating both biological and biomedical researches. I will introduce the bioinformatics development and current research activities at BGI. |
Speaker: | Professor Hong Yan |
Title: | Analysis and Detection of Nucleosome Positions |
Abstract: |
Nucleosomes are the fundamental building blocks of the chromatin structure
of a genome and play an important role in gene regulation. In this talk, our
work on nucleosome positioning signal analysis will be presented. We detect
nucleosome positions in eukaryotic DNA sequences using the matched mirror
position filter (MMPF) and relaxation labeling. The MMPF searches for
regular dinucleotide patterns in the underlying DNA sequence and predicts
the probable nucleosomes based on the pattern matching scores. We carry out
a genome-wide analysis of the correlation between nucleosome positions and
the regulatory regions of several eukaryotic organisms. The results demonstrate the effectiveness of our method for the reliable analysis of the nucleosome landscape and its regulation in eukaryotic chromatins on a genomic scale. Our recent results on nucleosome seqeunce flexibility, characteristics of DNA step parameters and geometric features will also be discussed. |
Speaker: | Nelson Tang |
Title: | Population genetics of gene expression: application of linear mixed model |
Abstract: |
The application of linear mixed-effect modeling to analysis of variance
components such as the gene expression dataset has the advantage to fit
linear model while allows clustering of data. It is obvious that individuals
within an ethnic group would be more alike to each other that across ethnic
groups. In the mixed effect model, the outcome (transcript abundance level
of a particular gene) is considered as the sum of fixed and random effects.
Fixed effects are those that affect the population mean, such as the sex
difference. Random effects, on the other hand, are assumed to lead to
variance of the outcome and account for the consequence of random sampling
within a subgroup in the population. We examined the biological variations of gene expression among the HapMap lymphoblast cell-lines and partitioned the variance of gene expression into components due to analytical error, inter-individual (CVg) and inter-ethnic groups (CVe). The results largely confirmed previously findings of a major component of within ethnic group variation, ie CVg > CVe. Although representing a minority, gene with high population differentiation (CVe>CVg) may be important determinant of ethnic specific disease profile. Top among the list of gene with high population differentiation in term of expression are three UDP-glucuronosyltransferases (UGTs). They are implicated in sex steroid metabolism and contribution to difference in steady state hormone levels in blood. |
Speaker: | Qiang Yang |
Title: | Transfer Learning in Bioinformatics |
Abstract: | A common difficulty in many bioinformatics tasks is the lack of sufficient labeled training examples to build high quality models for classification. One solution is to make use of the available labeled data from a related domain. However, these auxiliary data may follow a different distribution and feature representation. In this talk, I will give several examples of how these cross-domain classification problems can be solved using transfer learning techniques. Transfer learning aims at uncovering the similarity and relatedness of different domains to improve learning, and is beginning to be applied to bioinformatics problems. I will describe recent research done by my students and I in this area, on applying transfer learning methods to the problems of protein subcellular localization and protein-protein interaction network inference. |
Speaker: | Hannah Hong Xue |
Title: | A Bioinformatics Perspective on a Schizophrenia Candidate Gene |
Abstract: | TBA |
Speaker: | Xiaodan Fan |
Title: | Multiple Steady States of Gene Networks |
Abstract: |
Due to the statistical significance of the inconsistency between different
microarray studies, we hypothesize that the gene regulatory networks may show multiple steady states. This phenomenon also exists in some stress response studies, where the lists of differential expressed genes for the same stimulus vary significantly across different study. A dynamic system approach is proposed for study this phenomenon. Cell cycle time series are used for the preliminary study. Limit cycles are used to explain the discrepancy between different cell cycle studies. |
Authors: |
Tak-Ming Chan (Dept. of CS&E, The Chinese University of Hong Kong),
|
Title: | GALF Series for Discovering Generic TFBS Motifs |
Abstract: |
Protein-DNA interactions, primarily bindings between Transcription Factors (TFs)
and TF binding sites (TFBSs), are essential for gene regulation. Discovery
of TFBS motifs is thus a critical problem for deciphering gene regulation.
de novo motif discovery serves as a promising way to predict and better
understand TFBSs, and provides candidates for further biological
verifications. The challenges include search or optimization difficulties,
problem modeling for appropriate evaluation functions, generalization
requirements such as flexible widths and multiple different motifs, and
integrations with further evidences. We have proposed a series of novel genetic algorithm (GA) based algorithms for de novo motif discovery, and been developing integrated methods as well as investigating insights for protein-DNA bindings. In this paper we mainly present a series of de novo motif discovery algorithms, the Genetic Algorithm with Local Filtering (GALF) series: GALF and GALF-P (Post-processing) for effective and efficient searching; GALF-G (Generalized) for modeling and generalizations with flexible widths and multiple different motifs discovery. The GALF series has shown its outstanding performances compared with state-of-the-art algorithms on comprehensive real and benchmark data, and provides a promising platform for our integration methods in progress. |
Authors: |
Xi Yang (City University of Hong Kong) Hong Yan (City University of Hong Kong) |
Title: | Statistical Analysis of Conformational Properties of Periodic Dinucleotide Steps in Nucleosomes |
Abstract: | Deformability of DNA is important for its superhelical folding in the nucleosome and has long been thought to be facilitated by periodic occurrences of certain dinucleotides along the sequences, with the period close to 10.5 bases. This study statistically examines the conformational properties of dinucleotides containing the 10.5-base periodicity and those without that periodicity through scanning all nucleosome structures provided in PDB. By categorizing performances on the distribution of step parameter values, averaged net values, standard deviations and deformability based on step conformational energies, we give a detailed description as to the deformation preferences correlated with the periodicity for the 10 unique types of dinucleotides and summarize the possible roles of various steps in how they facilitate DNA bending. The results show that the structural properties of dinucleotide steps are influenced to various extents by the periodicity in nucleosomes and some periodic steps have shown a clear tendency to take specific bending or shearing patterns. |
Authors: |
Silva Daniel-Adriano (Department of Chemistry, HKUST) |
Title: | Markov State Models to Study the Ligand Binding Mechanism on the LAO protein |
Abstract: |
One of the principal challenges of current protein science is to understand
the molecular basis of protein's conformational changes and structural
recognition. However, this is an extremely complex problem, e.g., in the
case of periplasmic binding proteins (PBPs) despite the wide structural
information available (more than 100 PDBs [1]) and the simplistic function
of this proteins (bind small ligands); the conformational dynamics behind
protein-ligand recognition remains unknown [2]. In this work, to advance in
the understanding of conformational dynamics of arginine binding to the LAO
PBP [3], we applied ¡§Markov State Models¡¨ to Molecular Dynamics
Simulations of this system. The first step to build a Markov State Model is
to use a set of reaction coordinates in order to cluster the huge amount of
conformational data present in MD trajectories. LAO protein MD Simulations
were analysed with an ad-hoc clustering algorithm based in following
structural reaction coordinates: a) opening and b) twisting angles of the
protein and c) the ligand position. Using this algorithm, the initial
520,000 MD's frames were clustered in just a few thousand of states
(microstates). To found the metastable states (macrostates), microstates
were lumped and fitted to a Markov Model with the aid of the ¡§MSM-builder
toolkit¡¨ [4]. The results allowed us to: propose a reaction mechanism for
the ligand binding, predict metastable populations and calculate transition
rates between macrostates. The analysis of the interactions presented in the
binding pathway of LAO may reveal valuable information in order to advance
in the understanding of the molecular basis in macromolecules recognition. References 1. Dwyer MA, Hellinga HW. Curr Opin Struct Biol, 2004, 14, 495-504. 2. Pang A, Arinaminpathy Y, Sansom MS, Biggin PC, Proteins, 2005, 61, 809-822. 3. Oh BH, Ames GF, Kim SH. J Biol Chem, 1994, 269, 26323-26330. 4. Bowman Gregory, Huang Xuhui, Pande VS. Methods, 2009, 49(2),197-201. |
Authors: |
Qiwei Li (The Chinese University of Hong Kong) |
Title: | Detection of Tandem Repeats in Multiple DNA Sequences via Probabilistic Approach |
Abstract: | For the problem of identifying repetitive patterns in long biological sequences, such as tandem repeats in DNA sequences, traditional methods have been largely relying on the periodicity of a short segment in a single long sequence. In this poster, we introduce a full probabilistic generative model to formulate this problem. Our model allows intra-unit mismatches and inter-unit insertions. It is capable of identifying the shared tandem repeats in multiple input DNA sequences. A Bayesian approach is used to compute the model in a de novo fashion. A collapsing technique is used to improve the computing efficiency. The experiments on both the synthetic data and the real data have demonstrated the effectiveness of the proposed algorithm. |
Authors: |
Yuk-Kwan Choi (Department of Biochemistry, Hong Kong University of Science
and Technology), Ka-Wing Fong (Department of Biochemistry, Hong Kong University of Science and Technology), Robert Z. Qi (Department of Biochemistry, Hong Kong University of Science and Technology) |
Title: | A conserved protein domain identified by computational alignment is important to cytoskeleton organization |
Abstract: | In animal cells, the microtubule cytoskeleton is essential for various fundamental processes, including material transport, cell division, polarity establishment, cell motility, and morphogenesis. As the principal nucleator to initiate de novo assembly of microtubule filaments, £^-ubulin exists in two differently sized complexes: the £^-tubulin small complex (£^TuSC) and the £^-tubulin ring complex (£^TuRC). Until now, it is unknown how the microtubule-nucleating activity of the £^-tubulin complexes is regulated. CDK5RAP2 is a centrosomal protein that is involved in neurogenesis, as its mutations associate with autosomal recessive primary microcephaly. We have found that CDK5RAP2 plays an essential role in the microtubule-organizing function of centrosomes. CDK5RAP2 interacts with the £^TuRC through a short stretch highly conserved in £^-tubulin complex-targeting proteins of lower organisms, including Drosophila centrosomin and fission yeast Mto1p and Pcp1p. Loss of CDK5RAP2 function delocalizes £^-tubulin from centrosomes, thus causing disorganization of interphase microtubules and generation of anastral mitotic spindles. Therefore, CDK5RAP2 functions in the assembly of the £^TuRC into centrosomes. Currently, we are characterizing the interaction between the £^TuRC and its binding domain of CDK5RAP2. This work may provide insights on the function and regulation of the £^TuRC. |
Authors: |
Lisheng He (HKUST), |
Title: | Involvement of Hsc70 and EBV Latent Membrane Protein-1 in the Regulation of Cell Mitosis |
Abstract: |
Epstein-Barr virus (EBV) is involved in many human malignancies, such as
nasopharyngeal carcinoma. The latent membrane protein-1 (LMP1) encoded by
EBV is believed to play an important role in tumorigenesis. It has been
shown that expression of LMP1 disrupts the cell cycle. However, the precise
contribution of this viral oncoprotein to cell cycle deregulation is poorly
understood. Hsc70 (heat shock cognate 70 kDa protein), an Hsp70 family
member, performs various functions related to cell proliferation and
tumorigenesis. We have found that LMP1 interacts with Hsc70 in mammalian
cells. In the present study, we set out to investigate the subcellular
localization of Hsc70 during the cell cycle and its function in mitosis.
This study may provide further insights into the action of LMP1 in the cell
cycle |
Authors: |
Tsz Yan Tang (City University of Hong Kong), |
Title: | Analysis of Mouse Periodic Gene Expression Data Based on Singular Value Decomposition and Autoregressive Modeling |
Abstract: | Each DNA microarray experiment generates a large amount of gene expression profiles and it remains a challenge for biologists to robustly identify periodic gene expression profiles with certain noise level in the data. In this paper, we propose a new scheme with noise filtering technique to analyze the periodicity of gene expression base on singular value decomposition (SVD), singular spectrum analysis (SSA) and autoregressive (AR) model-based spectrum estimation. With the combination of these methods, noise can be filtered out and over 85% of periodic gene expression can be identified in mouse presomitic mesoderm transcriptome data set. |
Authors: |
Qian Xu (HKUST), |
Title: | Multitask Learning for Protein Subcellular Location Prediction |
Abstract: | Protein subcellular localization is concerned with predicting the location of a protein within a cell using computational methods. The location information can indicate key functionalities of proteins. Thus, accurate prediction of subcellular localizations of proteins can help the prediction of protein functions and genome annotations, as well as the identification of drug targets. Some machine learning methods have been proposed to solve this problem in the past, but have been shown to suffer from a lack of annotated training data in each species under study. To overcome this data sparsity problem, we observe that because some of the organisms may be related to each other, there may be some commonalities across different organisms that can be discovered and used to help boost the data in each localization task. Thus, we propose a framework to localize proteins of various organisms in cells jointly in a multi-task learning manner. We adapt and compare two specializations of the multi-task learning algorithms on 20 different organisms. Our preliminary experimental results suggest that jointly learning models of all organisms cannot lead to significant improvement compared to learning them individually. However, if the organisms learned jointly are closely related in the biological point of view, then the multi-task learning strategy can do much better than the individual learning strategy. The most significant improvement in terms of localization accuracy is about 25%. |
Authors: |
Xiang Wan, Can Yang (HKUST), Qiang Yang (HKUST), Hong Xue (HKUST), Nelson L.S.Tang, Weichuan Yu (HKUST) |
Title: | Predictive rule inference for epistatic interaction detection in genome-wide association studies |
Abstract: |
Under the current era of genome-wide association study (GWAS), finding
epistatic interactions in the large volume of SNP data is a challenging and
unsolved issue. Few of previous studies could handle genome-wide data due to
the difficulties in searching the combinatorially explosive search space and
statistically evaluating high-order epistatic interactions given the limited
number of samples. In this work, we propose a novel learning approach
(SNPRuler) based on the predictive rule inference to find disease-associated
epistatic interactions.Our extensive experiments on both simulated data and
real genome-wide data from Wellcome Trust Case Control Consortium (WTCCC)
show that SNPRuler significantly outperforms its recent competitor. To our
knowledge, SNPRuler is the first method that guarantees to find the
epistatic interactions without exhaustive search. Our results indicate that
finding epistatic interactions in GWAS is computationally attainable in
practice. |
Authors: |
Can Yang (HKUST), Xiang Wan (HKUST), Qiang Yang (HKUST), Hong Xue (HKUST), Weichuan Yu (HKUST) |
Title: | Identifying Main Effects and Epistatic Interactions from Large-scale SNP Data via Adaptive Group Lasso |
Abstract: |
Single nucleotide polymorphism (SNP) based association studies aim at
identifying SNPs associated with phenotypes, for example, complex diseases.
The associated SNPs may influence the disease risk individually (main
effects) or behave jointly (epistatic interactions). For the analysis of
high throughput data, the main difficulty is that the number of SNPs far
exceeds the number of samples. This difficulty is amplified when identifying
interactions. In this paper, we propose an Adaptive Group Lasso (AGL) model
for large-scale association studies. Our model enables us to analyze SNPs
and their interactions simultaneously. We achieve this by introducing a
sparsity constraint in our model based on the fact that only a small
fraction of SNPs is disease-associated. In order to reduce the number of
false positive findings, we develop an adaptive reweighting scheme to
enhance sparsity. In addition, our method treats SNPs and their interactions
as factors, and identifies them in a grouped manner. Thus, it is flexible to
analyze various disease models, especially for interaction detection. |
Authors: |
Weiqiang Zhou (CITYU), Hong Yan (CITYU) |
Title: | Relationship between periodic dinucleotides and the nucleosome structure revealed by alpha shape modeling |
Abstract: | As the fundamental repeating units in eukaryotic chromatin, nucleosomes play an important role in many biological processes. For this reason, the study of the structure of nucleosomes may help to reveal some of the crucial principals of these processes. In our research, we have used alpha shapes to model nucleosome structure and discovered that the periodic DNA dinucleotides AA, TT and GC occupy special positions in nucleosome structure with one nucleotide inside and the other outside the nucleosome surface. This structural feature and other dinucleotide characteristics can provide useful information for the study of nucleosome positioning. |
Authors: |
Ling Sing Yung (HKUST), Chao Yang(HKUST), Mohammed Dakna (Mosaiques Diagnostics & Therapeutics), Harald Mischak (Mosaiques Diagnostics & Therapeutics), and Weichuan Yu (HKUST) |
Title: | Effective Visualization of LC/CE-MS data |
Abstract: | Quantitative comparison of proteomics data has become increasingly important in proteomics research. Visualization has attracted much attention and has become an emerging technique in the analysis. We present a visualization tool named SyncPro for differential analysis of multiple pre-processed proteomics data sets. It offers features to (i) compare different data sets using synchronization; (ii) facilitate data exploration through selection and extraction; (iii) quickly judge the quality of selected features. SyncPro can be downloaded at http://bioinformatics.ust.hk/SyncPro/SyncProContentPage.html |