Exploiting Parallel R in the Cloud with SPRINT

M. Piotrowski; G. A. McGilvary; T. M. Sloan; M. Mewissen; A. D. Lloyd; T. Forster; L. Mitchell; P. Ghazal; J. Hill

doi:10.3414/ME11-02-0039

RSS-Feed abonnieren

Bitte kopieren Sie die angezeigte URL und fügen sie dann in Ihren RSS-Reader ein.

https://www.thieme-connect.de/rss/thieme/de/10.1055-s-00035037.xml

PDF herunterladen

Methods Inf Med 2013; 52(01): 80-90
DOI: 10.3414/ME11-02-0039

Original Articles

Schattauer GmbH

Exploiting Parallel R in the Cloud with SPRINT

Autor*innen

M. Piotrowski

¹EPCC, The University of Edinburgh, Edinburgh, United Kingdom
G. A. McGilvary

²Edinburgh Data-Intensive Research Group, School of Informatics, The University of Edinburgh, Edinburgh, United Kingdom
T. M. Sloan

¹EPCC, The University of Edinburgh, Edinburgh, United Kingdom
M. Mewissen

³Division of Pathway Medicine, The University of Edinburgh, Edinburgh, United Kingdom
A. D. Lloyd

⁴The University of Edinburgh Business School, Edinburgh, United Kingdom
T. Forster

³Division of Pathway Medicine, The University of Edinburgh, Edinburgh, United Kingdom
L. Mitchell

¹EPCC, The University of Edinburgh, Edinburgh, United Kingdom
P. Ghazal

³Division of Pathway Medicine, The University of Edinburgh, Edinburgh, United Kingdom
J. Hill

⁵Applied Modelling and Computation Group, Imperial College, London, United Kingdom

Weitere Informationen

Publikationsverlauf

received: 31. Oktober 2011

accepted: 03. Mai 2012

Publikationsdatum:
24. Januar 2018 (online)

Lizenzen und Reprints

Summary

Background: Advances in DNA Microarray devices and next-generation massively parallel DNA sequencing platforms have led to an exponential growth in data availability but the arising opportunities require adequate computing resources. High Performance Computing (HPC) in the Cloud offers an affordable way of meeting this need.

Objectives: Bioconductor, a popular tool for high-throughput genomic data analysis, is distributed as add-on modules for the R statistical programming language but R has no native capabilities for exploiting multiprocessor architectures. SPRINT is an R package that enables easy access to HPC for genomics researchers. This paper investigates: setting up and running SPRINT-enabled genomic analyses on Amazon’s Elastic Compute Cloud (EC2), the advantages of submitting applications to EC2 from different parts of the world and, if resource underutilization can improve application performance.

Methods: The SPRINT parallel implementations of correlation, permutation testing, partitioning around medoids and the multi-purpose papply have been benchmarked on data sets of various size on Amazon EC2. Jobs have been submitted from both the UK and Thailand to investigate monetary differences.

Results: It is possible to obtain good, scalable performance but the level of improvement is dependent upon the nature of the algorithm. Resource underutilization can further improve the time to result. End-user’s location impacts on costs due to factors such as local taxation.

Conclusions: Although not designed to satisfy HPC requirements, Amazon EC2 and cloud computing in general provides an interesting alternative and provides new possibilities for smaller organisations with limited funds.

Keywords

Genomics - computing methodologies

References
1 Heller MJ. DNA microarray technology: devices, systems, and applications. Annual review of biomedical engineering 2002; 4: 129-153.

Crossref PubMed Suche in Google Scholar
Download RIS citation
2 Xu L, German D, Winslow RL. Large-scale integration of cancer microarray data identifies a robust common cancer signature. BMC Bioinformatics 2007; 8: 1471-2105.

Suche in Google Scholar
Download RIS citation
3 Edgar R, Domrachev M, Lash AE. Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res 2002; 30 (01) 207-210.

Crossref PubMed Suche in Google Scholar
Download RIS citation
4 Parkinson, et al. ArrayExpress update - an archive of microarray and high-throughput sequencing-based functional genomics experiments. Nucl Acids Res, Pubmed ID 21071405.

Download RIS citation
5 Shendure J, Ji H. Next-generation DNA sequencing. Nature biotechnology 2008; 26: 1135-1145.

Crossref PubMed Suche in Google Scholar
Download RIS citation
6 Richter BG, Sexton DP. Managing and analyzing next-generation sequence data. PLoS Computational Biology 2009; 5: e1000369

Crossref PubMed Suche in Google Scholar
Download RIS citation
7 Gentleman R, Carey V, Huber W, Irizarry R, Dudoit S. (eds) Bioinformatics and Computational Biology Solutions Using R and Bioconductor. Springer: 2005.

Suche in Google Scholar
Download RIS citation
8 The R Project for Statistical Computing. http://www.r-project.org/.

Download RIS citation
9 Hill J, Hambley M, Forster T, Mewissen M, Sloan TM, Scharinger F, Trew A, Ghazal P. SPRINT: a new parallel framework for R. BMC Bioinformatics 2008; 9: 558

Crossref PubMed Suche in Google Scholar
Download RIS citation
10 Petrou S, Sloan TM, Mewissen M, Forster T, Piotrowski M, Dobrzelecki B, Ghazal P, Trew A, Hill J. Optimization of a parallel permutation testing function for the SPRINT R package Concurrency and Computation: Practice and Experience 2011. http://dx.doi.org/10.1002/cpe.1787.

Crossref PubMed
Download RIS citation
11 Piotrowski M, Forster T, Dobrezelecki B, Sloan TM, Mitchell L, Ghazal P, Mewsissen M, Petrou S, Trew A, Hill J. Optimisation and parallelisation of the partitioning around medoids function in R. International Conference on High Performance Computing and Simulation (HPCS). 2011: 707-713.

Suche in Google Scholar
Download RIS citation
12 Mitchell L, Sloan TM, Mewissen M, Ghazal P, Forster T, Piotrowski M, Trew AS. A parallel random forest classifier for R. Proceedings of the second international workshop on Emerging computational methods for the life sciences (ECMLS ’11). 2011: 1-6.

Suche in Google Scholar
Download RIS citation
13 Amazon Elastic Cloud (EC2). http:// aws.amazon.com/ec2/. Accessed 27th October 27, 2011

Download RIS citation
14 Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. Journal of Molecular Biology 1990; 215: 403-410.

Crossref PubMed Suche in Google Scholar
Download RIS citation
15 Matsunaga A, Tsugawa Ma, Fortes J. CloudBLAST: Combining MapReduce and Virtualization on Distributed Resources for Bioinformatics Applications. IEEE Fourth International Conference on eScience. 2008: 222-229.

Suche in Google Scholar
Download RIS citation
16 Lu W, Jackson J, Barga R. AzureBlast: A Case Study of Developing Science Applications on the Cloud. HPDC ’10: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing 2010; 413-420.

Suche in Google Scholar
Download RIS citation
17 McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, DePristo M. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome research 2010; 20: 1297-1303.

Crossref PubMed Suche in Google Scholar
Download RIS citation
18 Schmidberger M, Morgan M, Eddelbuettel D, Yu H, Tierney L, Mansmann U. State of the Art in Parallel Computing with R. Journal of Statistical Software 2009; 31 (01) 1-27.

Suche in Google Scholar
Download RIS citation
19 Yu H. Rmpi: Parallel Statistical Computing in R. RNews 2002; 2 (02) 10-14.

Suche in Google Scholar
Download RIS citation
20 Amazon EC2 FAQs. http://aws.amazon.com/ec2/faqs/. Accessed Oct 27, 2011

Download RIS citation
21 Clarke L, Glendinning I, Hempel R. The MPI Message Passing Interface Standard. Programming Environments for massively distributed systems: working conference of the IFIP WG10.3. Monte Verita; Switzerland: 1994.

Suche in Google Scholar
Download RIS citation
22 Dobrzelecki B, Krause A, Piotrowski M, Chue Hong N. Managing and Analysing Genomic Data Using HPC and Clouds. Grid and Cloud Database Management. Berlin Heidelberg: Springer; 2011.

Suche in Google Scholar
Download RIS citation
23 MPICH2. http://www.mcs.anl.gov/research/projects/mpich2/. Accessed 31/10/2011

Download RIS citation
24 Nadgowda SJ, Sion R. Cloud Performance Benchmark Series: Amazon Elastic Block Store (EBS); Amazon Simple Storage Service (S3; Amazon EC2 Instance Local Storage v0.8. Cloud Computing Center, Stony Brook Network Security and Applied Cryptography Lab.

Download RIS citation
25 Walker E. Benchmarking Amazon EC2 for high-performance scientific computing. Login 2008; 33: 18-23.

Suche in Google Scholar
Download RIS citation
26 He Q, Shujia Z, Kobler B, Duffy D, McGlynn T. Case study for running HPC applications in public clouds. HPDC’10 Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing. 2010: 395-401.

Suche in Google Scholar
Download RIS citation
27 Hammond M, Hawtin R, Gillam L, Oppenheim C. Cloud computing for research. Technical Report. Curtis and Cartwright; June. 2010

PubMed Suche in Google Scholar
Download RIS citation
28 Iakymchuk R, Napper J, Bientinesi P. Improving high-performance computations on clouds through resource underutilization. Proceedings of the 2011 ACM Symposium on Applied Computing. 2011: 19-126.

Suche in Google Scholar
Download RIS citation
29 Iosup A, Yigitbasi N, Epema D. On the performance variability of production cloud services. Proceedings of the IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing. 2011

PubMed Suche in Google Scholar
Download RIS citation
30 Schad J, Dittrich J, Quian ìe-Ruiz J. Runtime measurements in the cloud: observing, analyzing, and reducing variance. Proc VLDB Endow 2010; 3: 460-471.

Crossref Suche in Google Scholar
Download RIS citation
31 Hill Z, Humphrey M. A quantitative analysis of high performance computing with Amazon’s ec2 infrastructure: The death of the local cluster?. In: 10th IEEE/ACM International Conference on Grid Computing. 2009: 26-33.

Suche in Google Scholar
Download RIS citation
32 Jackson KR, Ramakrishan L, Muriki K, Canon S, Cholia S, Shalf J, Wasserman HJ, Wright NJ. Performance analysis of high performance computing applications on the Amazon Web Services cloud. Proceedings of IEEE Second International Conference on Cloud Computing Technology and Science. 2010

PubMed Suche in Google Scholar
Download RIS citation
33 Armbrust M, Fox A, Griffith R, Joseph AD, Katz R, Konwinski A, Lee G, Patterson D, Rabkin A, Stoica I, Zaharia M. Above the clouds: A Berkeley view of cloud computing. Technical Report, UC Berkeley Reliable Adaptive Distributed Systems Laboratory, February. 2009

PubMed Suche in Google Scholar
Download RIS citation
34 Amazon Web Services High Performance Computing. http://aws.amazon.com/hpc-applications/. Accessed 31/10/2011

Download RIS citation
35 Petrou S. SPRINTing with HECToR, 2010. http://www.hector.ac.uk/cse/distributedcse/reports/sprint/sprint.pdf. Accessed 09/05/2011

Download RIS citation
36 Oosterhuis F, Dodoková A, Gerdes H, Greño P, Jantzen J, Mudgal S, Neubauer A, Rayment M, Stocker A, Tinetti B, van der Woerd Varma A. The use of differential VAT rates to promote changes in consumption and innovation. Final Report under DG Environment, Contract 070307/2007/482673/G1. 2008.

Suche in Google Scholar
Download RIS citation
37 This data is available from the Documents section of. http://www.r-sprint.org/. Accessed 17/03/2011

Download RIS citation
38 Quackenbush J. Computational Approaches to Analysis of DNA Microarray Data. Methods Inf Med. IMIA Yearbook 2006; Assessing Information - Technologies for Health 2006; 1: 91-103.

Suche in Google Scholar
Download RIS citation
39 Bolshakova N, Azuaje F. Computational Approaches to Analysis of DNA Microarray Data. Estimating the Number of Clusters in DNA Microarray Data. Methods Inf Med 2006; 45 (02) 153-157.

Thieme Connect PubMed Suche in Google Scholar
Download RIS citation

Ähnliche Zeitschriften

RSS-Feed abonnieren

Teilen / Bookmarken

Exploiting Parallel R in the Cloud with SPRINT

Autor*innen

Publikationsverlauf

Summary

Keywords

References