Guidance for Trustworthy Data Management in Science Projects
Andrew Adams, Kay Avila, Jim Basney, Laura Christopherson, Melissa Cragin, Jeanetter Dopheide, Terry Fleury, Calvin Frye, Florence Hudson, Manisha Kanodia, Jenna Kim, W. John MacMullen, Mats Rynge, Scott Sakai, Sandra Thompson, Karna Vahi, John Zage
This document, a follow-up to the analysis report, delves into the key findings identified, explores the concerns reported and provides recommendations of existing solutions to address the survey participants’ concerns regarding trustworthiness of data.
@misc{adams_andrew_2020_4323403,
author = {Adams, Andrew and Avila, Kay and Basney, Jim and Christopherson, Laura and Cragin, Melissa and Dopheide, Jeannette and Fleury, Terry and Frye, Calvin and Hudson, Florence and Kanodia, Manisha and Kim, Jenna and MacMullen, W. John and Rynge, Mats and Sakai, Scott and Thompson, Sandra and Vahi, Karan and Zage, John},
title = {{Guidance for Trustworthy Data Management in Science Projects}},
year = 2020,
publisher = {Zenodo},
doi = {10.5281/zenodo.4323403},
url = {https://doi.org/10.5281/zenodo.4323403}
}
Scientific Data Security Concerns and Practices: A survey of the community by the Trustworthy Data Working Group
Andrew Adams, Kay Avila, Jim Basney, Melissa Cragin, Jeanetter Dopheide, Terry Fleury, Florence Hudson, Jenna Kim, W. John MacMullen, Gary Motz, Sean Peisert, Mats Rynge, Scott Sakai, Sandra Thompson, Karna Vahi, Wendy Whitcup, John Zage
In April and May of 2020, the Trustworthy Data Working Group conducted a survey of scientific data security concerns and practices in the scientific community. This report provides an analysis of the survey results.
@misc{adams_andrew_2020_4323141,
author = {Adams, Andrew and Avila, Kay and Basney, Jim and Cragin, Melissa and Dopheide, Jeannette and Fleury, Terry and Hudson, Florence and Kim, Jenna and MacMullen, W. John and Motz, Gary and Peisert, Sean and Rynge, Mats and Sakai, Scott and Thompson, Sandra and Vahi, Karan and Whitcup, Wendy and Zage, John},
title = {{Scientific Data Security Concerns and Practices: A survey of the community by the Trustworthy Data Working Group}},
year = 2020,
publisher = {Zenodo},
doi = {10.5281/zenodo.4323141},
url = {https://doi.org/10.5281/zenodo.4323141}
}
Toward a Data Lifecycle Model for NSF Large Facilities
L. Christopherson, A. Mandal, E. Scott, I. Baldin
PEARC '20: Practice and Experience in Advanced Research Computing
National Science Foundation large facilities conduct large-scale physical and natural science research. They include telescopes that survey the entire sky, gravitational wave detectors that look deep into our universe’s past, sensor-driven field sites that collect a range of biological and environmental data, and more. The Cyberinfrastructure Center for Excellence (CICoE) pilot project aims to develop a model for a center that facilitates community building, fosters knowledge sharing, and applies best practices in consulting with large facilities with regard to their cyberinfrastructure. To accomplish this goal, the pilot began an in-depth study of how large facilities manage their data during the course of their research. Large facilities are diverse and highly complex, from the types of data they capture, to the types of equipment they use, to the types of data processing and analysis they conduct, to their policies on data sharing and use. Because of this complexity, the pilot needed to find a single lens through which it could frame its growing understanding of large facilities and identify areas where it could best serve large facilities. As a result of the pilot’s research into large facilities, common themes have emerged which have enabled the creation of a data lifecycle model that successfully captures the data management practices of large facilities. This model has enabled the pilot to organize its thinking about large facilities, and frame its support and consultation efforts around the cyberinfrastructure used during lifecycle stages. This paper describes the model and discusses how it was applied to disaster recovery planning for a representative large facility—IceCube.
@inproceedings{10.1145/3311790.3396636,
author = {Christopherson, Laura and Mandal, Anirban and Scott, Erik and Baldin, Ilya},
title = {Toward a Data Lifecycle Model for NSF Large Facilities},
year = {2020},
isbn = {9781450366892},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3311790.3396636},
doi = {10.1145/3311790.3396636},
booktitle = {Practice and Experience in Advanced Research Computing},
pages = {168–175},
numpages = {8},
keywords = {data lifecycle, large facilities, disaster recovery, cyberinfrastructure, data management, research computing},
location = {Portland, OR, USA},
series = {PEARC '20}
}
NEON IdM Experiences - Working with the CI CoE Pilot to Solve Identity Management Challenges
R. Kiser, T. Fleury, C. Laney, J. Sampson, S. Sons
2019 NSF Cybersecurity Summit for Large Facilities and Cyberinfrastructure
This paper details experience from collaborative efforts between the NSF Cyberinfrastructure Center of Excellence (CI CoE) Pilot’s Identity Management Working Group and staff from the National Ecological Observatory Network (NEON) to develop improvements to the NEON Data Portal as well as the products of these collaborative efforts.
Cyberinfrastructure Center of Excellence Pilot: Connecting Large Facilities Cyberinfrastructure
E. Deelman, A. Mandal, V. Pascucci, S. Sons, J. Wyngaard, C. F. Vardeman II, S. Petruzza, I. Baldin, L. Christopherson, R. Mitchell, L. Pottier, M. Rynge, E. Scott†, K Vahi, M. Kogank, J. A. Mann, T. Gulbransen, D. Allen, D. Barlow, S. Bonarrigo, C. Clark, L. Goldman, T. Goulden, P. Harvey, D. Hulsander, S. Jacobs, C. Laney, I. Lobo-Padilla, J. Sampson, J. Staarmann, S. Stone
2019 15th International Conference on eScience (eScience)
The National Science Foundation's Large Facilities are major, multi-user research facilities that operate and manage sophisticated and diverse research instruments and platforms (e.g., large telescopes, interferometers, distributed sensor arrays) that serve a variety of scientific disciplines, from astronomy and physics to geology and biology and beyond. Large Facilities are increasingly dependent on advanced cyberinfrastructure (i.e., computing, data, and software systems; networking; and associated human capital) to enable the broad delivery and analysis of facility-generated data. These cyberinfrastructure tools enable scientists and the public to gain new insights into fundamental questions about the structure and history of the universe, the world we live in today, and how our environment may change in the coming decades. This paper describes a pilot project that aims to develop a model for a Cyberinfrastructure Center of Excellence (CI CoE) that facilitates community building and knowledge sharing and that disseminates and applies best practices and innovative solutions for facility CI.
@inproceedings{deelman2019cyberinfrastructure,
title={Cyberinfrastructure Center of Excellence Pilot: Connecting Large Facilities Cyberinfrastructure},
author={Deelman, Ewa and Mandal, Anirban and Pascucci, Valerio and Sons, Susan and Wyngaard, Jane and Vardeman, Charles and Petruzza, Steve and Baldin, Ilya and Christopherson, Laura and Mitchell, Ryan and others},
booktitle={2019 15th International Conference on eScience (eScience)},
pages={449--457},
year={2019},
organization={IEEE},
doi={10.1109/eScience.2019.00058}
}
Exploration of Workflow Management Systems Emerging Features from Users Perspectives
R. Mitchell, L. Pottier, S. Jacobs, R. Ferreira da Silva, M. Rynge, K. Vahi, E. Deelman
2019 IEEE International Conference on Big Data (Big Data)
There has been a recent emergence of new workflow applications focused on data analytics and machine learning. This emergence has precipitated a change in the workflow management landscape, causing the development of new dataoriented workflow management systems (WMSs) in addition to the earlier standard of task-oriented WMSs. In this paper, we summarize three general workflow use-cases and explore the unique requirements of each use-case in order to understand how WMSs from both workflow management models meet the requirements of each workflow use-case from the user's perspective. We analyze the applicability of the two models by carefully describing each model and by providing an examination of the different variations of WMSs that fall under the task driven model. To illustrate the strengths and weaknesses of each workflow management model, we summarize the key features of four production-ready WMSs: Pegasus, Makeflow, Apache Airflow, and Pachyderm. To deepen our analysis of the four WMSs examined in this paper,we implement three real-world use-cases to highlight the specifications and features of each WMS. We present our final assessment of each WMS after considering the following factors: usability, performance, ease of deployment, and relevance. The purpose of this work is to offer insights from the user's perspective into the research challenges that WMSs currently face due to the evolving workflow landscape.
@inproceedings{mitchell2019exploration,
title={Exploration of Workflow Management Systems Emerging Features from Users Perspectives},
author={Mitchell, Ryan and Pottier, Loїc and Jacobs, Steve and Ferreira da Silva, Rafael and Rynge, Mats and Vahi, Karan and Deelman, Ewa},
booktitle={2019 IEEE International Conference on Big Data (Big Data)},
pages={4537--4544},
year={2019},
organization={IEEE},
doi={10.1109/BigData47090.2019.9005494}
}