“A Federated and cloud enabled system for climate data analytics and machine learning”
Ghaleb Abdulla is a senior Member of Technical Staff at Lawrence Livermore National Laboratory (LLNL), a Department of Energy research laboratory. Since joining LLNL in 2000, Dr. Abdulla embraced projects that depend on teamwork and data sharing. His tenure includes establishing partnerships with universities seeking LLNL’s expertise in HPC and large-scale data analysis. He supported approximate queries over large-scale simulation data sets for the AQSim project and helped design a multi-petabyte database for the Large Synoptic Survey Telescope. Dr. Abdulla used machine learning (ML) to inspect and predict optics damage at the National Ignition Facility and leveraged data management and analytics to enhance HPC energy efficiency. He served as the lead for the Operational Data Analytics team under the Energy Efficient HPC Working Group. He also led the Cancer Registry of Norway project developing personalized prevention and treatment strategies through pattern recognition, ML, and time-series statistical analysis of cervical cancer screening data and was awarded the DOE Secretary’s Appreciation Award for his work to help with the Cancer moonshot presidential initiative. Dr. Abdulla served as the PI of the Earth System Grid Federation (ESGF) —an international collaboration that manages a global climate database for more
than 25,000 users on six continents. He chaired the ESGF international Executive committee and served as member of the Working Group on Coupled Modelling (WGCM) Infrastructure Panel (WIP) that operates under the World Climate Research Program (WCRP).
He also served as the director for the Institute for Scientific Computing Research (ISCR) at LLNL. He established a data science internship program and helped several universities create data science curriculum targeted to graduate students who can help with scientific data analytics.
In 2020 Dr. Abdulla was chosen as one of the 50 for 50 spotlights which highlights the most influential graduates from Virginia Tech computer science department.
Currently he serves as Deputy program manager for WCI data infrastructure and leads the data infrastructure and analytics team.
Managing and analyzing scientific and simulation data is a challenge due to growing data sizes and the need to subset and fuse different variables from different data sets before running an evaluation metric. Computer and climate scientists are collaborating to enable large-scale data analytics and while there are success stories, the new high-resolution models require more distributed computing resources to build complex data analysis and machine learning workflows. The high-resolution data produced by climate models is distributed in nature and the scientists working on analyzing the data are located in different locations around the globe, yet they need to collaborate and work on the data. A federated data management system will enable data sharing and collaboration, however, the machinery and tools to enable large scale and distributed scientific data analysis are still lacking. The traditional workflow involves querying a data archive using the indexed metadata. Once the files are located, the scientist needs to download the data into a local machine and perform the analysis. Instead, we developed tools to support a distributed system for data analytics and Machine Learning, i.e. server-side or edge computing.
In this talk, I will describe the Earth System Grid Federation (ESGF) software stack, architecture, hosted data, and discuss use cases. I will also describe the ESGF Compute Node (ECN) which allows scientists to build and run data analytics algorithms. The current architecture allows users to log into the Jupyter hub on our server using a GitHub account. Using the Jupyter hub, the user has a choice to either spawn a Unix shell with the ability to access all the regular tools such as python and the climate analysis libraries or spawn a Jupyter notebook with the ability to use ESGF user interface to search for relevant data sets, use basic sub-setting, re-grinding and data reduction constructs to minimize data movement.