Project

General

Profile

Introduction to Arvados » History » Version 11

Tom Clegg, 04/10/2013 09:37 PM

1 8 Tom Clegg
h1. Introduction to Arvados
2 1 Anonymous
3
h2. Overview
4 2 Anonymous
5 8 Tom Clegg
Arvados is a platform for storing, organizing, processing, and sharing genomic and other biomedical big data. The platform is designed to make it easier for bioinformaticians to develop analysis, developers to create web applications with genomic data and IT administers to manage large-scale compute and storage resources for genomic data. The platform is designed to run on top of "cloud operating systems" such as Amazon Web Services and OpenStack. Currently, there are implementations that work on AWS and Xen+Debian/Ubuntu.
6 1 Anonymous
7
h2. Project
8 2 Anonymous
9 8 Tom Clegg
The core technology has been under development at Harvard Medical School for many years. We are now in the process of refactoring original code, refactoring the APIs, and developing significant new capabilities. (See the [[Arvados Project History]] and the Backlog for more information on the roadmap.) 
10 1 Anonymous
11 8 Tom Clegg
h2. Why Arvados 
12 2 Anonymous
13 8 Tom Clegg
A set of relatively low-level compute and data management functions are consistent across a wide range of analysis pipelines and applications that are being built for genomic data. Unfortunately, every organization working with these data have been forced to build their own custom systems for these low level functions. At the same time, there are proprietary platforms emerging that seek to solve these same problems. Arvados was created to provide and common solution that could be used across a wide range of applications and would be free and open source. 
14 1 Anonymous
15
h2. Benefits
16 2 Anonymous
17 8 Tom Clegg
The Arvados platform seeks to solve a set of common problems that face informaticians and IT Organizations: 
18 1 Anonymous
19
Benefits to informaticians: 
20
* Make authoring analyses and constructing pipelines in any language as efficient as possible
21
* Provide an environment that can run open source and commercial tools (e.g. Galaxy, GATK, etc.) 
22
* Enable deep provenance and reproducability across all pipelines 
23
* Provide a way to flexibly organize data and ensure data integrity 
24
* Make queries of variant and other compact genome data very high-performance
25
* Create a simple way to run distributed batch processing jobs 
26
* Enable the secure sharing of data sets from small to very large 
27
* Provide a set of common APIs that enable application and pipeline portability across systems
28
* Offer a reference environment for implementation of standards 
29
* Standardize file format translation
30
31
Benefits to IT organizations: 
32
* Low total cost of ownership 
33
* Eliminate unnecessary data duplication 
34
* Ability to create private, on-premise clouds
35
* Self-service provision of resources
36
* Ability to utilize low-cost off the shelf hardware
37
* Easy-to-manage horizontally scaling architecture
38
* Straight-forward browser-based administration
39
* Provide facilities for hybrid (public and private) clouds 
40
* Ensure full compliance with security and regulatory standards
41
* Support data sets from 10s of TB to exabytes
42
43
h2. Functional Capabilities 
44 2 Anonymous
45 8 Tom Clegg
Functionally, Arvados has two major sets of capabilities: (a) data management and (b) compute management.
46 1 Anonymous
47
h3. Data Management
48 2 Anonymous
49 9 Tom Clegg
The data management services are designed to handle all of the challenges associated with storing and organizing large omic data sets. The heart of theses services is the Data Manager, which brokers data storage. The data management system is designed to handle the following needs: 
50 1 Anonymous
51 9 Tom Clegg
* Store files (e.g. BAM, FASTQ, VCF, etc.) reliably
52 1 Anonymous
* Store metadata about files for a wide variety of organizational schema
53
* Create collections (sets of files) that can be used in analyses 
54
* Ensure files are not unnecessary duplicated
55 9 Tom Clegg
* Track provenance (sources and methods used to produce data)
56
* Control who can access which files
57
* Offer reliable distributed storage using inexpensive commodity disks
58
* Control storage redundancy based on importance of datasets
59
60 1 Anonymous
61
h3. Compute Management 
62 2 Anonymous
63 1 Anonymous
The compute management services are designed to handle the challenges associated with creating and running pipelines as large scale distributed processing jobs. 
64
65 11 Tom Clegg
* Enable a common way to represent pipelines (JSON)
66
* Support the use of any pipeline creation tool
67
* Keep all pipeline code in a revision control system (git repository)
68 1 Anonymous
* Run pipelines as distributed computations using MapReduce
69 11 Tom Clegg
* Easily and reliably retrieve pipeline outputs
70
* Store a record of every pipeline that is run
71
* Eliminate the need to re-run pipeline components that have already been run
72
* Easily and reliably re-run and verify any past pipeline
73 1 Anonymous
* Create a straightforward way to author web applications that use underlying data and pipelines 
74 11 Tom Clegg
* Easily share results, pipelines, and applications between systems 
75
* Run distributed computations across clusters in different data centers to make use of very large data sets 
76 1 Anonymous
77 11 Tom Clegg
The compute management system also includes a sub-component for doing tertiary analysis. This component provides an in-memory database for very high-performance queries of a compact representation of a genome that includes variants and other relevant data needed for tertiary analysis. (This component is in the design stage.)
78 3 Anonymous
79
h2. Virtual Machines 
80
81 11 Tom Clegg
Arvados works best in an environment where informaticians receive access to virtual machines (VMs) on a private or public cloud. This approach eliminates the need to manage separate physical servers for different projects, significantly increasing the utilization of underlying hardware resources. It also gives informaticians a great deal of freedom to choose the best operating systems and tools for their work. With virtual machines, each informatician or project team has full isolation, security, autonomy, and privacy for their work. 
82 3 Anonymous
83 8 Tom Clegg
The Arvados platform provides shared common services that can be used from within a virtual machine. All of the Arvados services are accessible through APIs.
84 1 Anonymous
85
h2. APIs and SDKs 
86 2 Anonymous
87 8 Tom Clegg
Arvados is designed so all of the data management and compute management services can be accessed through a set of a consistent APIs and interfaces. Most of the functionality is represented in a set of REST APIs. Some components use native interfaces (notably Keep and git). Arvados provides SDKs for popular languages (Python, Perl, Ruby, R, and Java) as well as a standalone tool for command line use.
88 1 Anonymous
89 8 Tom Clegg
h2. Workbench
90 2 Anonymous
91 8 Tom Clegg
Arvados includes a browser-based UI which provides a convenient way to do common browsing and searching tasks. Workbench also serves as an application portal, providing a point of access to applications running on Arvados.
92 1 Anonymous
93
h2. Related Articles 
94 2 Anonymous
95 10 Tom Clegg
[[Technical Architecture]] showing key components