Sebastien Mondet,
Compose Conference,
Jan 31, 2015.
37_000
employees …Within the Icahn Institute for Genomics and Multiscale Biology
Mission: Better software for biomedicine
Example: somatic variant calling
Example: Broad Institute's recommendations:
A gentle introduction to the world of bioinfomatics software:
First encounter with the most used sequence aligner.
$ bwa -h [main] unrecognized command '-h'
$ bwa --help [main] unrecognized command '--help'
$ bwa -help [main] unrecognized command '-help'
$ bwa help [main] unrecognized command 'help'
$ bwa WTF! [main] unrecognized command 'WTF!'
$ bwa Program: bwa (alignment via Burrows-Wheeler transformation) Version: 0.7.10-r789 Contact: Heng Li <lh3@sanger.ac.uk> Usage: bwa <command> [options] Command: index index sequences in the FASTA format mem BWA-MEM algorithm fastmap identify super-maximal exact matches ...
Don't assume anything:
$ samtools index some-non-existing-file.bam [E::hts_open] fail to open file 'some-non-existing-file.bam'
$ echo $? 0
GATK partying like it's Windows 3.1:
##### ERROR MESSAGE: Couldn't read file
some_bam.target-intervals
because The interval file
some_bam.target-intervals
does not have one of the supported extensions (.bed, .list, .picard,
.interval_list, or .intervals). Please rename your file with the
appropriate extension. If
some_bam.target-intervals
is NOT supposed to be a file, please move or rename the file at location
some_bam.target-intervals
This variant caller did something:
$ ./somaticsniper -h
./somaticsniper: invalid option -- 'h'
Unrecognizd option '-?'.
Need to run 100s of tools, make parameters vary, over 1000s of samples, and
A lot of them.
make
, etc.)So, yes, a new one, with these goals:
with a sane implementation language: OCaml.
Software kills people.
We are pathetically 40 years behind on the safety/security front.
OCaml is:
“Keep Track of Experimental Workflows”
github.com/hammerlab/ketrew
In 0.0.0
:
nohup
/setsid
, or “Python” backendsComing next:
Tested with bioinformatics pipelines, but also running backups, building documentation …
let run_command_with_lsf ~queue cmd = let open Ketrew.EDSL in let host = Host.parse "ssh://MyLSFCluster/pathto/ketrew-playground/?shell=bash"_uri in target "run_command_with_lsf" ~make:(lsf (Program.sh cmd) ~queue ~wall_limit:"1:30" ~processors:(`Min_max (1,1)) ~host)
let fail_because_of_condition ~host = let open Ketrew.EDSL in let make_target ~cmd = target ~make:(daemonize ~using:`Python_daemon Program.(sh cmd) ~host) in let target_with_condition = let impossible_file = file ~host "/some-inexistent-file" in make_target "Failing-target" ~cmd:"ls /tmp" ~done_when:impossible_file#exists in make_target "Won't-run because of failed dependency" ~cmd:"ls /tmp" ~dependencies:[ target_with_condition ]
State machine encoded with sub-typing
module Starting : sig type t = [ | `Starting of Building.t History.t | `Tried_to_start of (t History.t * Run_bookkeeping.t) ] ... module Running : sig type t = [ | `Started_running of (Starting.t History.t * Run_bookkeeping.t) | `Still_running of (t History.t * Run_bookkeeping.t) ] ... module Killable_state : sig type t = [ | `Passive of Log.t | `Active of (Passive.t History.t * ...) | `Building of Active.t History.t | `Still_building of Building.t History.t | `Starting of (Building.t) History.t ... ] ...
⇒ Pure/total/exceptionless
val transition: state -> (do_some_io * some_io_result update_state)
We have open-sourced
hammerlab/biokepi
.
type _ t = | Fastq_gz: File.t -> fastq_gz t | Fastq: File.t -> fastq t | Paired_end_sample: string * fastq t * fastq t -> fastq_sample t | Single_end_sample: string * fastq t -> fastq_sample t | Gunzip_concat: fastq_gz t list -> fastq t | Concat_text: fastq t list -> fastq t | Bwa: bwa_params * fastq_sample t -> bam t | Gatk_indel_realigner: bam t -> bam t | Picard_mark_duplicates: bam t -> bam t | Gatk_bqsr: bam t -> bam t | Bam_pair: bam t * bam t -> bam_pair t | Mutect: bam_pair t -> vcf t | Somaticsniper: [ `S of float ] * [ `T of float ] * bam_pair t -> vcf t | Varscan: [`Adjust_mapq of int option] * bam_pair t -> vcf t
Typed bioinformatics pipelines!
let pipeline_example ~normal_fastqs ~tumor_fastqs ~dataset = let open Biokepi_pipeline.Construct in let normal = input_fastq ~dataset normal_fastqs in let tumor = input_fastq ~dataset tumor_fastqs in let bam_pair ?gap_open_penalty ?gap_extension_penalty () = let normal = bwa ?gap_open_penalty ?gap_extension_penalty normal |> gatk_indel_realigner in let tumor = bwa ?gap_open_penalty ?gap_extension_penalty tumor |> gatk_indel_realigner in pair ~normal ~tumor in let bam_pairs = [ bam_pair (); bam_pair ~gap_open_penalty:10 ~gap_extension_penalty:7 (); ] in let vcfs = List.concat_map bam_pairs ~f:(fun bam_pair -> [ mutect bam_pair; somaticsniper bam_pair; somaticsniper ~prior_probability:0.001 ~theta:0.95 bam_pair; varscan bam_pair;]) in vcfs
Compiled to more than 3000 Ketrew targets, JSON-pipeline, …
Runs for days, then posts to
hammerlab/cycledash
Thanks! Questions?
Will tweet link to the slides: @smondet
.
Next Lucrative Jacket gig: March 18th, 8pm, at the Trash Bar in Brooklyn.
utop> IO.read_file;; - : string -> (string, [> `IO of [> `Read_file_exn of string * exn ] ]) Deferred_result.t utop> IO.write_file;; - : string -> content:string -> (unit, [> `IO of [> `Write_file_exn of string * exn ] ]) Deferred_result.t utop> System.with_timeout;; - : float -> f:(unit -> ('a, [> `System of [> `With_timeout of float ] * [> `Exn of exn ] | `Timeout of float ] as 'error) Deferred_result.t) -> ('a, 'error) Deferred_result.t
utop> let dumb_copy_with_timeout ~seconds ~src ~dest = System.with_timeout seconds ~f:(fun () -> IO.read_file src >>= fun content -> IO.write_file dest ~content );; val dumb_copy_with_timeout : seconds:float -> src:string -> dest:string -> (unit, [> `IO of [> `Read_file_exn of string * exn | `Write_file_exn of string * exn ] | `System of [> `With_timeout of float ] * [> `Exn of exn ] | `Timeout of float ])
Pain:
Pleasure:
Big Windows-oriented runtime but most importantly: null
pointers
PS: funny that F# is at Compose but not Scala …
Haskell:
unsafePerformIO
anyways + less “hackable”It uploads reports to AWS without your consent.
-et,--phone_home Run reporting mode (NO_ET|AWS| STDOUT)
-K,--gatk_key GATK key file required to run with -et NO_ET
Done with Omd and MPP generating an HTML + Reveal.js presentation.
Sat, 31 Jan 2015 11:22:12 -0500
451763c-dirty