LogisticsFor questions, concerns or bug reports, please contact Alon Kipnis or Mahsa Lotfi or David Donoho. This course meets Mondays 2:30-3:50 PM on Zoom. If you are a guest speaker for this course, please read travel section to plan your visit.
Data Science News
The Revolution is Here!David Donoho
XYZ StudiesXiaoyan Han
Massive Computational Experiments, PainlesslyVardan Papyan
University of Toronto
IT Infrastructure for ResearchMahsa Lotfi
Painless Data Pipelining with Kedro
According to its documentation, “Kedro is an open-source Python framework for creating reproducible, maintainable and modular data science code“ that “borrows concepts from software engineering best-practice and applies them to machine-learning code; applied concepts include modularity, separation of concerns, and versioning”. In this lecture, we introduce the main features of Kedro and demonstrate how they can alleviate many of the pain points of data science experiments.Alon Kipnis
Exploratory Data Analysis, Painlessly
One key aspect for the analysis and exploration of datasets is for the scientist to be able to quickly and painlessly clean, summarize, and visualize the data. Traditional solutions include code-based frameworks like Tidyverse in R or pandas/matplotlib in Python—or even spreadsheet based approaches such as Microsoft Excel. However, these options are often cumbersome with significant time overheads to perform even the simplest tasks; thereby, putting a significant barrier between a dataset and the scientists who may wish to understand it. Recently years have seen the rise in “Business Intelligences” tools such as Tableau or Microsoft BI that allow users to clean and visualize data using simple drag-and-drop GUI interfaces that significantly cuts down—if not eliminate—the overhead of working with code or spreadsheets. In this lecture, we will explore this new trend in data cleaning and visualization through some simple demonstrations in Tableau.Xiaoyan Han
When are Data Science Results Reproducible?
A defining feature of science is independent reproducibility of results. Since the 1600’s this requirement was satisfied through written language (e.g. English) descriptions of the research steps in the final publication, intended to permit another researcher in the field to carry out the same experiment. In this talk I will discuss what reproducibility might mean in the modern context of massive data science experiments, and the state of the debate today. Reproducibility implies transparency and reliability whose interpretation can present new challenging questions at computational scale, using massive datasets, and regarding bias, incentives, and public access to and trust in science. Reproducibility in computational science was first identified as a research area in the early 1990’s by Stanford Professor Emeritus Jon Claerbout who presented implementations and guiding principles. Since then, as the use of computation in scientific discovery became ubiquitous, myriad approaches have emerged. I will trace this history to give a clear understanding of ongoing reproducibility discussions and solutions, and present recent contributions including the Whole Tale project (2020), AIM for reproducibility in ML tournaments (2018), and Reproducibility Standards Development including the National Academies report "Reproducibility and Replication in Science" (2019) for which I was a committee member. I will motivate what I believe are the most pressing problems to be solved to ensure computational and data-enabled scientific research is reproducible and elucidate a vision for reproducible discovery in data science.Victoria Stodden
University of Southern California