"Online" Algorithms for Managing the Next Generation Sequencing Data Flood?
Next Generation Sequencing produces huge quantities of data,currently up to 60 million sequences per file. Algorithms used to analyse these data load all the information from one file into computer memory in order to process it. With the growth in data volumes these algorithms are beginning to slow down. This is a problem noted for algorithms which detect new forms of RNA and quantify them in RNA sequencing experiments.In his talk at the 'High Throughtput Sequencing Special Interest Group' (HitSIG) Adam Roberts from Berkeley, CA discussed his new online algorithm 'EXPRESS', designed to interpret RNA sequencing data. (Roberts and Pachter, 2011 in press, Bioinformatics).Online algorithms process data arriving in real-time. The models generated are updated a sequence at a time. Therefore, the amount of memory required stays constant whatever the volume of data processed and there is no need to save the data if it will not be analysed again later.Online algorithms would fit very naturally in Pipeline Pilot data piplines. They would also fit well with the new real-time sequencing technologies such as the Oxford Nanopore GridION system. The GridION system already uses Pipeline Pilot to control it's 'Run until ..sufficient' workflows.Bringing all three technologies together would allow data interpretation to be generated directly from the sequencing machine and the flood of data could be directed straight into the most useful channels.Fig 1 an RNA sequencing experiment showing a known and a newly discovered form of RNA and the depth of the sequences used to identify them, along one region of the mouse genome.