Canonical correlation analysis as reduced rank regression

2023 – 2024, University of Toronto, Toronto, Canada

Description: The emergence of high-dimensional datasets presents significant challenges for traditional CCA frameworks, as traditional estimates of the canonical directions cease to be consistent. We propose a novel approach that attempts to narrow the gap between theory and practice in high-dimensional CCA in the case where one dataset is high-dimensional and the other is not. By incorporating concepts from reduced rank regression and sparse estimation, we are able to establish theoretical bounds for our procedure, ensuring statistical rigor while maintaining computational efficiency — making it an attractive option for the analysis of large datasets.

Collabortors: Claire Donnat.

[Paper] [Slides]

Multi-period forecasting

2021 – 2022, Stanford University, Stanford, USA

Forecasting methodologies have always attracted a lot of attention and have become an especially hot topic since the beginning of the COVID-19 pandemic. In this paper we consider the problem of multi-period forecasting that aims to predict several horizons at once. We propose a novel approach that forces the prediction to be “smooth” across horizons and apply it to two tasks: point estimation via regression and interval prediction via quantile regression. This methodology was developed for real-time distributed COVID-19 forecasting. This project was sponsered by Stanford Data Science Institute.

Collaborators: Trevor Hastie, Rob Tibshirani and Delphi Research group.

[Paper] [Slides] [Code]

Weighted low rank matrix approximation

2020 – present, Stanford University, Stanford, USA

Description: Low-rank matrix approximation is one of the central concepts in machine learning, with applications in dimension reduction, de-noising, multivariate statistical methodology, and many more. A recent extension to LRMA is called low-rank matrix completion (LRMC). It solves the LRMA problem when some observations are missing and is especially useful for recommender systems. In this project, we consider an element-wise weighted generalization of LRMA. WLRMA has many applications. For example, it is an essential component of GLM optimization algorithms, where an exponential family is used to model the entries of a matrix, and the matrix of natural parameters admits a low-rank structure.

Collabortors: Trevor Hastie.

[Paper] [WLRMA package]

Canonical correlation analysis with structured regularization

2020 – 2022, Stanford University, Stanford, USA

Description: Canonical correlation analysis (CCA) is a technique that allows to measure the association between two multivariate sets of variables. The regularized modification of canonical correlation analysis (RCCA) is widely used to conduct the analysis of high dimensional data. One limitation of RCCA is that it treats all features equally. In this project we introduce several modifications of RCCA utilizing the underlying data structure and suggesting some tricks that allows to avoid excessive computations while conducting CCA with regularization.

Collabortors: Trevor Hastie and Leonardo Tozzi.

[Paper] [RCCA package] [Slides] [Poster] [Video]

Chromatin conformation reconstruction

2018 – present, Stanford University, Stanford, USA

Description: Three dimensional (3D) genome spatial organization is critical for numerous cellular processes. Genome architecture had been notoriously difficult to elucidate, but the advent of the suite of chromatin conformation capture assays, notably Hi-C, has transformed understanding of chromatin structure and provided downstream biological insights. In this project we target finding a chromatin spatial conformation. We exploit the fact that single chromosome solutions constitute a one dimensional (1D) curve in 3D. The resulting technique thereby combines principal curve methodology with the metric scaling approach.

Collabortors: Trevor Hastie and Mark Segal.

[Paper 1] [Paper 2] [DBMS package] [PoisMS package] [PoisMS vignette] [Slides] [Poster] [Video]

The Human Connectome project

2018 – 2022, Stanford University, Stanford, USA

Description: Being a part of the Connectomes for Emotional Disorders Project, it uses the data collected at Stanford School of Medicine and it has among the main goals the investigation of relation between emotional disorders and brain functioning by means of statistical analysis.

Collaborators: Trevor Hastie, Stanford Williams PanLab.

[Paper] [Poster] [Code]

Course projects

Weighted low rank matrix approximation

2020, Stanford University, Stanford, USA

Description: The low-rank matrix-completion (LRC) problem has always attracted a lot of attention. One of the most famous applications can be found in Netflix Problem, which aims to build a movie recommendation system based on partial user's movie ratings. The general LRC problem statement is: given some incomplete matrix find a low-rank matrix that provides the best approximation to its observed entries. In this project we consider the generalization of LRC problem introducing weights in the problem set-up and stating the weighted low rank matrix approximation (WLRA) problem. We suggest several optimization algorithms solving this problem and compare them in terms of convergence.

Course: Convex Optimization II (EE364B).

[Report]

Early Detection of COVID-19 from Cough Sounds

2020, Stanford University, Stanford, USA

Description: As of June 12th 2020, there have been 7,739,944 cases and 428,337 deaths as a result of the COVID-19 pandemic as per the World Health Organization COVID-19 Dashboard. The countries around the world have imposed restrictions in the form of the lockdown to combat the rate of the spread and flatten the curve describing the number of patients admitted to hospital per day. Currently most of governments are planning their post-lockdown reopening procedures and methods to monitor the spread of the disease. However, limited tested capabilities, varying quality of the tests, and a large number of asymptomatic carriers make this task particularly challenging. In this project we attempt to build a classifier to diagnose the patient based on the cough recording and general patient information collected by the hospital staff. The data was provided by the Wadhwani Institute of Artificial Intelligence.

Course: Data science and AI for COVID-19 (CS472).

[Report]

Art Nouveau style transfer with face alignment

2019, Stanford University, Stanford, USA

Description: Flourished throughout Europe and the United States at the turn of 19th and 20th centuries, Art Nouveau still remains one of the most beautiful decorative art movements. Promulgating the idea of art and design as part of everyday life and inspired by natural forms and patterns of plants and flowers, it has influenced different aspects of art and architecture, such as interior, furnishings and glass design, as well as graphic work, posters, and illustration. This project inspired by Henri de Toulouse-Lautrec and Alphonse Mucha works of art is aimed to develop a deep learning tool transforming already boring photos into a bright and bold Art Nouveau fine art posters.

Course: Deep Learning (CS230).

[Report] [Poster]

Monet is spot, Manet is people

2018, Stanford University, Stanford, USA

Description: It takes some time for a human being to conceive the differences between two of the most well-known impressionist painters Claude Monet and Edouard Manet. Although quite cultivated the French beau monde of the nineteenth century had a struggle with this problem as well. When Claude Monet made his debut at the salon in Paris in 1865, his landscapes were displayed next to the famous painting “Olympia” by Edouard Manet who was already known at the time. Funnily enough, people could not recognize the difference between the paintings of these two artists. Could the machine do better?

Course: Machine Learning (CS229).

[Report] [Poster]

Imputing chromatin landscape from a single essay

2018, Stanford University, Stanford, USA

Description: Chromatin landscapes provide critical insight into the transcriptional regulation of the genome. Current approaches for profiling chromatin landscape require multiple high-throughput sequencing assays, creating the desire for a single cost-effective assay. Here we assess the ability of nascent transcription assay - Global Run-On and sequencing (GRO-seq) and Precision Run-On and sequencing (PRO-seq) – to impute H3K4me3, H3K27ac, H3K27me3, and DNase-seq using XGBoost, Dense Neural Network, and Convolutional Neural Network models.

Course: Deep Learning in Genetics (CS273B).

[Report]

Other projects

Dimension reduction methods in myeloma studies

2016 – 2017, Yandex School of Data Analysis, Moscow, Russia

Description: In the present work the success of targeted therapy as an approach to the treatment of myeloma was investigated. This project is aimed to predict the outcome of the therapy suggested by the hospital using gene expression information of the patient.

Collabortors: Dmitry Rogachev National Research Center of Pediatric Hematology, Oncology and Immunology.

[Report]

Automatic organization of translation workflow

2015 – 2017, SmartCAT, Moscow, Russia

Description: The main goal of the project can be stated as follows. Given the complete information about translators in a marketplace such as average translation pace (words per hour), translation price (cents per word), customer feedbacks and customers rating, texts of the projects translated along with translation mistakes made (detected automatically by system or manually by reviewer or corrector) find the most efficient translator for the new project. By efficiency of the translator we imply that such translator would meet time and price restrictions imposed by the customer and would produce the translation of the highest possible quality. This project was funded by Skolkovo Innovation Center grant.

[Patent]

Amino acids spatial structures

2014 – 2017, Moscow State University, Moscow, Russia

Description: In this project the geometry of polypeptides is investigated by means of geometrical and statistical methods. One part of this project is devoted to the analysis of the information contained in Protein Data Bank. In particular, we have tested several characteristics (such as covalent bonds lengths and angles, cis–trans isomerism or Pauling plane low) of PDB conformations for consistency and general theory fitting. The second part is focused on geometrical modeling of polypeptides. We consider several approaches to polypeptides modeling (e.g by a polygonal chain with fixed edge lengths or by a smooth curve) and investigate folding properties of the suggested models. This project was funded by Russian Science Foundation grant.

Collaborators: Faculty of Biology of Moscow state University.

[Paper]