Project

General

Profile

Introduction to Arvados » History » Revision 9

Revision 8 (Tom Clegg, 04/08/2013 03:35 PM) → Revision 9/17 (Tom Clegg, 04/08/2013 04:19 PM)

h1. Introduction to Arvados 

 h2. Overview 

 Arvados is a platform for storing, organizing, processing, and sharing genomic and other biomedical big data. The platform is designed to make it easier for bioinformaticians to develop analysis, developers to create web applications with genomic data and IT administers to manage large-scale compute and storage resources for genomic data. The platform is designed to run on top of "cloud operating systems" such as Amazon Web Services and OpenStack. Currently, there are implementations that work on AWS and Xen+Debian/Ubuntu. 

 h2. Project 

 The core technology has been under development at Harvard Medical School for many years. We are now in the process of refactoring original code, refactoring the APIs, and developing significant new capabilities. (See the [[Arvados Project History]] and the Backlog for more information on the roadmap.)  

 h2. Why Arvados  

 A set of relatively low-level compute and data management functions are consistent across a wide range of analysis pipelines and applications that are being built for genomic data. Unfortunately, every organization working with these data have been forced to build their own custom systems for these low level functions. At the same time, there are proprietary platforms emerging that seek to solve these same problems. Arvados was created to provide and common solution that could be used across a wide range of applications and would be free and open source.  

 h2. Benefits 

 The Arvados platform seeks to solve a set of common problems that face informaticians and IT Organizations:  

 Benefits to informaticians:  
 * Make authoring analyses and constructing pipelines in any language as efficient as possible 
 * Provide an environment that can run open source and commercial tools (e.g. Galaxy, GATK, etc.)  
 * Enable deep provenance and reproducability across all pipelines  
 * Provide a way to flexibly organize data and ensure data integrity  
 * Make queries of variant and other compact genome data very high-performance 
 * Create a simple way to run distributed batch processing jobs  
 * Enable the secure sharing of data sets from small to very large  
 * Provide a set of common APIs that enable application and pipeline portability across systems 
 * Offer a reference environment for implementation of standards  
 * Standardize file format translation 

 Benefits to IT organizations:  
 * Low total cost of ownership  
 * Eliminate unnecessary data duplication  
 * Ability to create private, on-premise clouds 
 * Self-service provision of resources 
 * Ability to utilize low-cost off the shelf hardware 
 * Easy-to-manage horizontally scaling architecture 
 * Straight-forward browser-based administration 
 * Provide facilities for hybrid (public and private) clouds  
 * Ensure full compliance with security and regulatory standards 
 * Support data sets from 10s of TB to exabytes 

 h2. Functional Capabilities  

 Functionally, Arvados has two major sets of capabilities: (a) data management and (b) compute management. 

 

 h3. Data Management 

 The data management manage services are designed to handle all of the challenges associated with storing and organizing large omic data sets. The heart of theses services is the Data Manager, which brokers data storage. The data management system is designed to handle the following needs:  

 * Store files (e.g. BAM, FASTQ, VCF, etc.) reliably in a reliable way 
 * Store metadata about files for a wide variety of organizational schema 
 * Create collections (sets of files) that can be used in analyses  
 * Ensure files are not unnecessary duplicated 
 * Track Maintain provenance (sources and methods used on files (be able to produce data) 
 identify their origin)  
 * Control who can Secure access which to files 
  
 * Offer reliable Translate files between formats  
 * Make it easy to access files  
 * Leverage large arrays of distributed storage using inexpensive commodity disks 
 * Control drives for reliable storage redundancy based on importance of datasets 


 

 h3. Compute Management  

 The compute management services are designed to handle the challenges associated with creating and running pipelines as large scale distributed processing jobs.  

 * Enable a common way to represent pipelines (JSON)  
 * Support the use of any pipeline creation tool  
 * Keep all pipeline code in a GIT repository  
 * Run pipelines as distributed computations using MapReduce 
 * Easily retrieve and store data from pipelines in the data management system  
 * Store a record of every pipeline that is run  
 * Eliminate the need to re-run pipeline jobs that have already been run 
 * Make it possible to easily and reliably re-run and verify any past pipeline  
 * Create a straightforward way to author web applications that use underlying data and pipelines  
 * Enable easy sharing of pipelines and applications between systems  
 * Be able to run distributed computations across clusters in different data centers to access very large data sets  

 The compute management system also includes a sub-component for doing tertiary analysis. This component has not been built yet, but we envision that it will provide an in-memory database for very high-performance queries of a compact representation of a genome that includes variants and other relevant data needed for tertiary analysis. (Learn more about the [[compact genome]] here.) 

 h2. Virtual Machines  

 Arvados works best in an environment where informaticians receive access to virtual machines (VMs) on a private or public cloud. This approach eliminates the need for managing separate physical servers for different projects, significantly increasing the utilization of underlying hardware resources. It also gives informaticians a great deal of freedom to choose the best operating systems and tools for their work. With virtual machines, each informatician or project team can have full isolation, security, autonomy, and privacy for their work.  

 The Arvados platform provides shared common services that can be used from within a virtual machine. All of the Arvados services are accessible through APIs. 

 h2. APIs and SDKs  

 Arvados is designed so all of the data management and compute management services can be accessed through a set of a consistent APIs and interfaces. Most of the functionality is represented in a set of REST APIs. Some components use native interfaces (notably Keep and git). Arvados provides SDKs for popular languages (Python, Perl, Ruby, R, and Java) as well as a standalone tool for command line use. 

 h2. Workbench 

 Arvados includes a browser-based UI which provides a convenient way to do common browsing and searching tasks. Workbench also serves as an application portal, providing a point of access to applications running on Arvados. 

 h2. Related Articles  

 [[Technical Architecture]] 
 [[Key Components]]