Genomic Variant Data Ingest and Manipulation using Glow

– Hi, everybody My name is Denny Lee, and welcome to our data First data and AI online meeting right after labor day So, today’s session is the genomic variant data ingest and manipulation using Glow Glow is an open source toolkit for population genetics And the project is an industry collaboration between Databricks and the Regeneron genetics center In today’s talk we’ll focus on how to use global banks Scale, genomic bearing and annotation can be easily adjusted from legacy flat file formats into Spark DataFrames and stored as Delta tables We will then show that Glow provides multiple functions and transformers to easily perform variant QC and data manipulation at Scale We are gonna actually have a pretty full session today So please ask all of your questions in the Q&A panel, and we’ll do our best to answer them live in the panel If you’re actually on YouTube live, ask your questions in the live chat, and again we’ll do our best to answer them So, my name is Denny Lee, and I’ll be the host for today’s Data and AI online meeting I’m a developer advocate at Databricks, working with Apache Spark since 2007, prior to Databricks, I was at Microsoft and SAP concur, and I also happen to have a background in healthcare life sciences, with a masters in biomedical informatics I’m also really happy to have Amir and Kiavash presenting today’s session I’ll let them introduce themselves So, let’s start with Amir as he’s gonna be presenting the first section of the session – Thank you so much Denny Hi, everyone This is Amir Kermany I’m a solution architect here at Databricks I’ve been here for two years now, to date Prior to Databricks actually I worked as a senior data scientists in Big data and E-commerce space, but my real background was prior to that, I was among the first top scientists at AncestryDNA when we were just starting a million samples And then, my training actually is in mathematical biology and statistical genetics I’ve done, I was a postdoc before that Howard Hughes medical Institute focusing on population genetics I’m really happy to be here – Hi everyone I’m Kiavash Kianfar I’m a senior software engineer at Databricks Working in the health and life sciences team and focusing on the genomics runtime at Databricks and project Glow So along with my colleagues in the engineering team, we have developed the project Glow in collaboration with the Regeneron center Before joining Databricks I was an associate professor at Texas A&M university Where my teaching and research was focused on algorithm development and optimization related to by informatics And I joined Databricks to continue with that path in the industrial setting Thank you to be with you And I guess back to Amir to start the presentation – Thank you Kiavash Okay. So let me just start sharing my screen That’s the fun part of this, starting the presentation Okay Right. So What we are going to cover today is in the beginning, I will start just a refresher for those who are not familiar with project Glow just is, just provide some background about where it comes from What is our philosophy in developing, yet another bioinformatics and computational biology too And I hope, at the end of it, you have an appreciation why this is actually a very different approach in doing genomics and working with genomics data And then for a few minutes, I will do a live demo of it And then I hand it over to Kiavash who is actually the person who has a really heavily been involved

in developing the actual library and the tools And he will go really more having more of a deep dive So, you know, first thing first, you might have seen this graph, why we’re talking about Genomics at Scale and why Scale is an issue now The number one thing, just looking at the history of basically the cost of sequencing from this first draft of human genomes to, the current day We see a dramatic drop in the cost, it really goes down and it is now, cost-wise, it’s hovering around, a thousand dollars per gene As a result and you have a lot of initiative to sequence human every individual to be sequenced because there’s a lot of, health-related data, obviously and we’ll talk about it Like what’s the value in this? But from a purely technical perspective When you look at the size of the data, it’s pretty fascinating That when you look for example, just a ballpark is, we think that by 2025, the size of the genomics data will be 20, the sequence that year would be 25 times more than all the videos we are operating on YouTube This just signals purely from a technical perspective that we have a Scale and a Big data problem Also, you’re seeing a lot of initiatives around the world of the biobank data sets that are coming out, famously UK biobank is more than a half a million whole exome sequencing along with, a lot of phenotypic data, more than 2000, 20,000 phenotypes spread from EHR, EMR Regeneron recently like couple of months ago surpassed a million samples You have the Finngen project also on a commercial side, my old company AncestryDNA, I haven’t really followed up but the last time I heard they have more than 16 million samples, similarly 23andMe, it’s more than 10 million and all of these, although they are array, they’re also capabilities to switch to whole genome sequencing as AncestorDNA actually is doing it, with the assets withheld And why do we have all this, all of these data set, why do you have all these initiatives? Well, obviously, Finngen won’t be heavy hitters of genomics that is like now heading up the effort and this whole initiative to build this amazing journey, research centers And this is something actually I copied from Twitter, As from 23andMe just, they just recently published, a really interesting study on an association, just looking at, genetics and non-genetics causes of factors ariel factors influencing COVID All of these are possible because when you are looking at, from complex diseases in particular, you need a lot of data sets to get any significant For example, when you’re looking at association studies, you need a lot of samples to get any significance results And this is what we are anticipating that this to grow You can have a lot more data sets such as this one So and as I mentioned, this is just the beginning, Well, we have a problem with Scale But when you’re looking at Scale Scale can be viewed from two ways Solving Scale problem is that one of the things is that you’re a true machines addict, right? So we have Big data You can, just better hardware, faster algorithms and distribution This is what we’re talking about Like this is what Spark is shining, just distributing abstracting the way all of these complexities of dealing with algorithms from the user The other side of Scale is it comes to the people’s side So you’ll have more and more I mentioned, for example, UK biobank, you have the genomic side, you have a lot of people who are working on a genomics side sequencing, genomics data that we have specialized tools for it Historically developed within the company of bioinformatics community But also you have data sets For example, I mentioned the electronic medical records and all of these, you do a lot of like national language processing workloads You have even, scraping image details Extracting features from images, whether it is, medical, images or like basically getting texts recognition from doctors hand written a note This is you can extract a lot of phenotype data, but that falls within what traditionally we call data science or amount of work load So in order to facilitate collaboration between teams,

we need that, we need a platform that addresses both, the technical challenge of Scale, but also on the people’s side The people can be working together and having the same tools to work together This is actually regardless of comp bio, I think the history of Spark, when you look at it, it’s coming as a result, as a response to this It is a unified analytics engine for designed for Big data processing And it’s supporting all streaming, SQL, machine learning, graph processing and all the originally actually it started as a source firm within UC Berkeley an uplab And when you say uplab it’s like, algorithms, machines and people So that’s like, that is what the idea is And also looking at Scale and just having out with some people So, you know in 2009, the team developed Spark, but then they later donated that to the Apache foundation and it is an open source project I was just looking actually at some reports like the second only to react in terms of popularity among all of these other open source solutions office So, huge communities are on Spark predominantly because it’s performance and also it’s just fall, support for multiple languages And you can do end-to-end work within the Spark framework As I mentioned in support Scala, Java, Python, R and SQL So, different personas can work in the same space So, collaboration is a reason also We have in terms of contribution, contributors to the project, we have more than 14, 1,500 double present Actually, Denny has one off the data stats on this, and it is like many, there are more than 200 companies are contributing So we identified, this is a really good ecosystem That is basically the de facto It is a big de facto to the platform for doing Big data and machine learning So, later what we are getting is, so now that we are looking into the bioinformatics, space, as I mentioned, we have a Big data issue So, we have problem of Scale What is the best tool out there for addressing Scale, both from, being an adopted equity ecosystem that people are working with, and also be something that can address machines and scalability? We look at Spark So, Glow project that we actually announced, it was last year ASG it’s doubled in partnership with the Regeneron genetics center It is built on top of Spark, but also it supports native Spark API So that you don’t actually have to go basically where it goes on base, what the built on the Spark DataFrames So same thing that you’re going to query any table, as you will see in the ingest a part, that’s why I’m talking about The same way that you’re going to query any data set You’re going to query genomics data sets And also because this is within this DataFrame API, it’s really easy to merge genomics and phenotypic data You can use Spark to scraping or to do feature extraction from your EHR EMR or the image data, and then merge that with the genomic status Also, it is as Kiavish will go through, we have also considered this, that you’re going to have, for example, use cases that you want to bring your own functionality and just use the distributed power of Spark for that So that is what we are going to talk about So in terms of from ingest, so I update work with Glow So, suppose that we are starting from a VCF file, traditional genomics format or BGEN or PLINK, it is all like your data sets arrive in flat files in that form Now they want to ingest it into a Spark DataFrame So this can be easily done just by, similarly the same way that you would say spark.read.format let’s say CSV or XML, you can just write VCF, BGEN or PLINK and specify the path where your data is So keep in mind that we have codex it doesn’t You don’t have to specify anything in terms of including it can be from J zip AGZ code just flat PCX Now that you ingest the data, the next thing is that we are assigning a schema to it

So, familiarity in schema basically in all data prior Anything that data, you have a table, let’s say, and it has some columns you can think about it And then, you know, what are the types of those columns? Are they of strains or arrays, etc When you’re ingesting the VCF file, basically the table that, the DataFrame that you get every row corresponds to the variance So be it, you know, in those or snips, and then you have all the information, all the attributes of that variant physical location, start and end or alternative LDA or reference LDA and all the INFO fields that you see in a VCF file Actually it is ingested, even structured, VCF have separate codes Also corresponding to the genotypes You have the genotypes column that it’s an array of, you can take off if you ask them, get this strap So you can take off like dictionaries in Python Let’s say 2000 and for a 1000 genome project when you’re ingesting it, that genotypes every row has an array of then 2,500 each element corresponds to a sample, which includes sample ID and actually called genotypes And if it is phased, obviously you have order maps This is an example of ingesting, for example, the data So you have, chromosome name or contigName also start and end arrays IDs And you can go further, look at the genotypes values and as I mentioned, your sample ID and comps However, we have a lot of options You can, for example, exclude sample IDs for if you don’t need that directly Plus the order is preserved and other options that you can look at the docs and Kiavash will talk about more in detail The other part is now that you have ingested the data, you want to do some analysis on it So the way we are looking at the analysis of, genomics data, is like any other data analysis You basically, you have an aggregate, you have a grouping off of some data points and then you want to run an aggregation So this grouping can be in this case, for example, on the grain, we have some basic, summary statistics functionality involved included in Glow, the ones that are going per variance So in this case, call summary stats Basically you call that on the genotype column and what you get as an output is basically for every role, you will get Allele frequency is the homozygosity number of heads with heterozygous, etc and similar agenda being that for other functions For example, you want to gather the Harley Weinberg values for pre or GUIs You want to exclude some of those that are outside They are not in party binding Then you can just apply that function And this is something that we are actually adding constantly to this functionality But the whole idea is that you call it statistical genetics process, routine the same way that you would call any aggregate function in Spark SQL Similarly, if you have sample level summary stats, for example, you want to look at homozygozity in your home, homozygous or heterozygous individuals that you can for each individual, let’s say you want to calculate the breeding coefficients or anything. This is really done We did this sample level, some of the statistics, and then there you can do your own analysis, the same thing By the way, I want the, keep in mind If you are also familiar with pandas, for data engineering, you actually, and you don’t want, you do some initial analysis using Spark APIs but then you can convert your data set into a pandas like DataFrame, which is a quality project called waters APIs The whole idea is that you can still run pandas code, but under the hood, it’s gonna run Spark Okay? So now that you have done all this analysis, the next thing that you want to do is you want to write it So you did some analysis, let’s say you have ingested 1000 genomes data Ingested UK biobank You did some QC, some filtering, a subset of the data You want to write it back as let’s say VCF file So you can ingested the BGEN You want to write to the VCF And this is a smaller file that you can just do it in one file You can use the bigvcf format that will output the traditional VCF pathway

Alternatively you can actually show this file, if it’s a bigger file and you want to have the same way that you have a charter like parquet, barley, or CVS, that the CSV that we distribute, you can do that with just writing VCF However, one of the big advantages of Glow in my opinion is it’s a native integration within Spark ecosystem So if you ingested the data set in a DataFrame Spark DataFrame, now you can easily write it into the DeltaLake And DeltaLake is something that, Kiavash we’ll talk about more But the whole idea is that this is the next generation engine on Spark That brings basically it brings a lot of performance And what more importantly, reliability of data warehousing in terms of asset transactions and all of that good stuff into a DeltaLake which is something that scientists are most often all up there All our data is in the DeltaLake But we don’t have much of a governance around For example, we don’t have data versioning, and it’s not easy to do that, but if you’re write your data in a Delta, it comes with it Like for example, you have Delta version and you can go back to query, for example, from specific snapshot of your data set So you just don’t worry about it You just constantly, you get updates from the UK biobank you’re right in the same DeltaLake, but then you can go from any specific frees up the data You can do your analysis And this is super important in terms of reproducibility of results, because you should know now it’s very easy to know what data were used that say for this G bus that we had Finally, I encourage you to check this, we have a slack channel for Glow project And also we have Google groups and this is the GitHub page, but you know, just quickly I talked about all of this, what Glow is about maybe quickly going through a quick live demo It’s, when you have live demos, there’s always a risk, but I take my chances, but anyway, I’m running this on the community edition of Databricks Keep in mind, you don’t have to run Glow and Databricks So any, you know, it’s open source and the documentation attributes geared towards, you have your own installation, but is of use of using Databricks which is, but also this community edition is accessible so it’s free So you can just sign up and start playing around So, the first thing I do, I’m running Glow on a genomic strong time The first thing I do always, this is a Python library So I’m importing the I’m just doing the import on getting the glowing stone As you see in my cluster that I’m running, this is the genomics run time here It just gives you the latest, I’m running the latest run time on Databricks Now we have a Delta is actually, we are at 7.2 So, we didn’t have the Databricks deployment actually have a databricks data sets That is included is in included So we have actually VCF files from 1000 genomes These are just VCF files from chromosome 22 And this is just, to play around and look at the data for you So what you’re seeing here is, I can just give it a a path to let’s say this first file, which is a VCF file in GZ format So I give you the path And then next thing is that I’m going to use, maybe I better have a magnifier I’m going to use this data set and ingest it into Spark DataFrame So I just use spark.read.format I do VCF and then load the path So now that path is loaded You see it is giving the schema that I want So every door corresponds to a Contig, chromosome start and all of that And if you want to visualize it, within Databricks I use Spark IDF display also have Databricks for any Spark And you just say df.show and shows you a snapshot of your data As you see, we have all of these, you know info fields for example, or expanded every column is, what I include here for example, African elude frequencies or other data sets that you, other values of interests that you want to see

So next, you know, now that I have this, Now just say I want to do some basic analysis As I mentioned, you can just call this, for example, select XQL is a Spark SQL command So I’m going to select some of the columns let’s say for simplicity, I take start position And also now I want to calculate the Allele frequencies To do that I’m going to use a utility function within Glow, which is “expand_struct” that basically opens up struct and every key now appears as they call them And then I use, “call_summary_stats” on the genotypes column Now this is done Now we can take a look at this, by the way, if you’re new to Spark, sometimes you see that, for example, this command to stop second and it’s very fast, but then you say, “wow, this is really a fast.” It’s not actually what you’re seeing is that it’s just like it lays out the execution plan And later when you call it an action on this DataFrame, actually that is being executed So now I say display Now, this is where actually the execution of this is done and the Spark jobs around So now you see that, for example, I have start and end, and all these values, for example, the number Heterozygotes etc that I want So from this now I can just go ahead and just go out and quote, Let’s say, I want to look at select from this Actually “filter” is a easier way of doing it All those, they’re the number of heterozygotes is less than 10, for example, right? So this is just the fields that come out So this is to give you an idea of how actually your analysis, you can incorporate genomics and you can incorporate your traditional genomics workloads and any other thing, the other last thing I want to talk about, I mentioned, the same way that you are looking at genomics data sets Let’s say I have I want to get, some phenotype data, right? (Amir speaking indistinctly) I have that, okay Maybe I am not looking at this Just try it again here That should be right Okay Okay Here you go So within the same databricks-dataset, these are simulated data off electronic health records Let’s say I want to get, conditions off patients And hypothetically imagine that these are the same patients that there is sequence for I can just do the same thing, spark.lead.format(“CSV”) but this time format, I can say it is CSV Although for your information at four spot, actually specific, if you have this super different API that you can do let’s say CSV, but to give you an idea is the same as writing a VCF For example, “header is True” and then “load” the path So now this is what I have is that I have all the conditions, right? So if I want to do combining genotype and phenotype data Assuming that were the same data set, I would just, for example, if the sample size samples were the same, IDs at 1000, the sample ID that I would genome, I would simply say, DMVHR.join or merge actually with this other one, Join on this one on what i would call them data So basically it’s a SQL chart Okay. So if you are interested actually you can, If you want to download this, that do degenerated, you can just download and import it within your own community edition and start playing with it This will be the link that you can use to get the data On that note I’m going, pass this to Kiavash Thank you very much – Thank you Amir I think Amir you have to unshare your screen

– Yes Sorry about that – [Kiavash] Okay, great Awesome So I’m gonna pick up where Amir left and talk a bit more about genomic variant annotation, Ingest into Spark DataFrame, using Glow, and then follow up with some overview of the genomic variant manipulation capabilities that we have designed into Glow If you’re familiar with bioinformatics, I think many of the capabilities that we have incorporated here will be of interest to you because we have, as Amir mentioned, our motivation here is to bring virtually unlimited capability to these operations where they can be done easily on biobank Scale variant data sets and patient data sets So, just to complete our data in just a perspective, I may have talked about ingesting variant data into, from VCF region or clink format into Spark DataFrames And I hear a want to mention that we in globally also have Genomic Annotation Data Ingest from GFF3 format at, if you have worked with genomic annotations you know, that this GFF3 format is one of the very common formats to store genomic annotations, and we have a data source in Glow which also can be similar to the lay used to ingest the annotation data related to different parts of a genomic sequence Into Spark DataFrame So just to emphasize here, again, our philosophy in Glow is that we support all data ingest from the traditional flat file formats Like we see VCF, BGEN and PLINK and get the information out of these formats and put them into a data engineering The modern data engineering and database structures that data scientists and data engineers use So that they can be easily queried and manipulated So when you ingest this data into a Spark DataFrame then you can easily work with it as any other database that you worked with using Spark commands, and at the same time, you can store your DataFrames in Delta tables Delta is a project open source by Databricks, where you can refer to it at this link and learn more about it, but essentially gives a lot of optimizations and performance and new features to the Spark ecosystem for the storage and retrieval and querying of the data, including acid transactions and versioning of the data So you can have different versions of your variant or annotation data over time and go back and forth over them And this can be done very seamlessly with very efficient APIs Today I just wanted to do a bit of a more deep dive into the features that Glow has in regard to the ingest of the annotation data and data variant manipulation So annotation data can be ingested into Spark DataFrames And the details of that is given on a blog post

that is linked here And you can find it on Glow documentation website, if you go to “projectglow.io.” But if I have the variants and I have the annotation, the job of annotating your variants will be just as easy as doing a joint between two database tables that you have And these details are explained in this blog I just wanna talk a bit about the Ingest of annotations before moving on to the manipulations So GFF3 format, if you’re familiar with it is a simple tab delimited takes five bit nine columns where the first eight columns are you can read about their definitions in the website of the GFF3, but the last column attributes includes the gist of annotations So for a particular sequence, it talks about different properties that that sequence has like ID, like the parents of that sequence, and many different things that can you can annotate gene ID, transcript ID its examples and so on and so forth So we have designed this GFF reader to be quite versatile So essentially what it does is that it loads this as a Spark DataFrame which will have a schema as follows that the first nine columns will appear with a particular files in your DataFrame But the last column, instead of just taking this text string, which is semi colon separated into a single column, we parse this and generate one column for each one of these tags that are in the last column And this way we create a table, which for you has all the columns, all these tags as separate columns and every variant that has that particular kind of bit will be populated accordingly Some of these tags have been defined in a standard of GFF3 to be the so called official fields And we retrieved them with the very proper types And even if they are array formats, we retrieve them as array, and then you can have your own files as well, your own fields as well, which we just parsed them as string columns and add to the DataFrame So I can quickly show this to you as well in a demo notebook So I don’t want to go into the details, but here I have defined the particular paths to a particular GFF This is the genomic annotation, downloaded it from the reef seek website for human genome and just loading it will be as easy as saying, spark.read.format(“gff”) and load the path And if you just load that, you can see that, you can look at the schema of this file with the this comment print schema, and you can see that the base fields are here Any of the official fields that are parse and exist in the file are here And then there are many other fields specific to this particular GFF that are also parse with the part, with the string type which are added here And if you look at this DataFrame, you can see that all this fields that have been parsed And what is good about this? The good thing is that you can easily query this with any of the Spark APIs that you like on any of these columns at Scale and you have it right in hand You can join this DataFrame with your variant DataFrame and create a annotation of your variance as well So, back to our presentation here, Glow in addition to data ingest offers several APIs, to manipulate variant data

So I may have talked a bit about the variant quality control APIs that we had We also have Sample Quality Control APIs We support Coordinate and Variant LiftOver with just a single transformer or functions We can easily do Variant Normalization at this Scale using Glow or do the Splitting of Multiallelic variants, which is normally done as a precursor to many downstream genomic analysis And we even have a very cool feature that if you have a particular custom Bioinformatics Tools, that is a Command-Line tool We can support Parallelizing it using Glow, which something could we call a pipe transformer, which I will talk about briefly So, one thing I want to mention that most of the APIs in Glow are either of our two formats They are either a design at a Spark SQL functions, and they can be used as functions Sorry, as functions that are applied on columns of a data Spark DataFrame, so you will get to the particular columns and give it as an input to your functions, or they are transformers, which you can apply The concept of a transformer is that you provide a transformer name and the input DataFrame and some options And what you get is an output DataFrame that this transformation is the result of applying this transformation on the input DataFrame So please have these two formats in mind, as we talk about some of these features in the rest of this presentation So let’s talk a little about the quality control APIs in Glow These were covered by Amir So I’m not gonna spend much time on these, but these are of the type that at are SQL functions So if you have, for example, a VCF, ingested VCF, you get a DataFrame with these columns using Glow There are many other fields as well, probably be or info fields in a VCF, but each one of these functions gets some statistics on each variant And they are usually applied on the genotypes column, which has the bulk of the information in the VCF which is includes the sample IDs and calls and the Allele depths for each sample in your variants So these functions are applied on all the samples for a particular variant and that’s why we call them Variant Quality Control Now we similarly have Sample Quality Control functions, which includes sample_call_summary_stats, sample_dp_summary_stats and sample_gp_summary_stats, which you can read more about in the documentation But here I have just summarized the inputs to this You can see that this one takes three columns of your DataFrame, which would be genotypes, reference Allele and alternate Alleles as input, and then produces an output Output is gonna be another column, which is, output is gonna be another column, which is an a extract column that includes all of these statistics Sample Quality Control, is applied across all variants for a single sample So these functions work on sample 1, for example, across all variants and give you a statistics for them And then the same thing on sample 2, sample 3, sample 4 and so on and so forth There are example notice for this on the Glow documentation website, which you can refer to, and they are similar to what Amir showed you a few minutes ago Now, LiftOver is another operations that usually by informaticians do before starting working on the data If your data is not on the right genome assembly

So LiftOver as you know, is the process of getting your coordinates from genomic coordinates from one reference assembly to another, for example, for human genome from assembly 37 to 38 So we have paralyzed lift_over functionality embedded in Glow, which with a single line of code, you can do this Actually we have both their function and a transformer for this So definitely in the function form, you provide, you have a Spark DataFrame You have the name of the function is lift_over_coordinates And this is based on the UCSC LiftOver tool That is the standard usually use the UCSC LiftOver tool that is the standard usually use for LiftOver And the output of this will be a structure of the lifted coordinates, which is the contigName and start and end We also have a LiftOver variants transformer, which is based on the Picard LiftOverVCF Tool, if you have worked with this, but what’s the difference? The difference is that it’s scalable The algorithm is based on this, but you can apply it on gigabytes and terabytes of variant data So, how it’s used is the transform paradigm, lift_over_variants is the name of the transformer you call, you give it an input DataFrame, and some options options are here are a chain file And a reference file, chain file is if you have done LiftOver, you’re familiar with it You need a change file Change file to, from a particular genome to another genome And then you have a reference file which is the target reference genome that you have So we have to provide the path to this file and the path to this one This one does a bit more than just a simple LiftOver That is it does It does reverse-complimenting and left aligning of variants or their reference alt Alternatively swapping just like the Picard tool does The cool thing about this is that it also gives another extra column experience, an extra column in your DataFrame that shows the status of your LiftOver operation Variant Normalization is another new thing that we support This is also implemented as a function and as a task form of variant normalization, if you’re familiar with it just makes sure that your same variants are represented identically in your data set And that means that any variant can have different representations, but by definition, you want to have your variance at the particular representation, which we want them to be parsimonious and left-aligned meaning that they have to be to the left most possible location in the genome the presentation So you can read about this more in the blog that we have regarding this on Glow website, the way this is done is again, using a function or a transformer here I talked about the transformer, for the sake of time So again, you have normalize_variant input DataFrame You have to provide the reference genome path, and there are options to you to talk about whether you want to replace the current columns with the normalized one, or add them as a new column I can show you an example of this on a notebook that we have here We import Glow, register it, and the reference genome and VCF path that we have, we load the VCF As you can see here, just have a look at the DataFrame that is generated And then you recall this transformer, which essentially glow.transform, name of the transformer and normalized variant and original variant DataFrame is the input And the reference_genome_path is the reference_genome_path that we defined, which is the 38th genome path, because this way, this VCF is in 38 assembly

And as a result of this, you can see that the columns are from here are normalized, and you have the new columns and coordinates here And the reference Allele and alternate Allele are also replaced accordingly You can have the option without column replacement, which means that the original columns are kept, but at the same time, an additional column is added, which would talk called normalization results that shows the normalized things Normalized coordinates as an Allele and as well as whether the normalization was successful or not Back to presentation, we have Splitting of Multiallelic variants This is a rather complex operation as well And we have implemented the best of the breed algorithm that is available for this, which comes out of the variant tool package Decomposed tool and I don’t wanna go into the details of this, but essentially if you have a multiallelic variant it, this transformer splits it into, bi allelic variants The use of the transformer is very similar Again the name of the transformer and the input to the input DataFrame What this does is that it splits all the multiallelic attributes, INFO attributes, as well as what you have in your form, in your genotypes fields properly between the two alleles and this process, it happens both on arrays like here three and two is a split between the two variants accordingly in a correct format, as well as the attributes like this in the genotype field, which are the probabilities, which come in the colleagues order, they are split correctly between the two So this complex operation is done And when you can do it at escape and I will wrap up the presentation by talking about the cool pipe transformer that we have in Glow So what is a pipe transformer? So let’s say some of the functionalities we have designed in Glow, they have their own APIs, but let’s say you bring your own bioinformatics command to like, let’s say bedtools, and you want to Scale it to the, so that it can be run in a paralyzed fashion on a cluster So you can easily do that with our pipe transformer What our pipe transformer does is that it gets just again, it’s a transformer So the name is pipe and it gets an input DataFrame But what it does is that you have, after you have your DataFrame, which is partitioned over a cluster It can be a very large representative of a very large data set The pipe transformer gets that And based on an input formatter, converts the DataFrame into the format that your Command-line tool understand Let’s you have bedtools and you want to use one of the bedtools command that gets as an input, a VCF, and gives you as an output VCF to you So, this pipe transformer gets your DataFrame, converts it to a VCF, pipes it through your bedtools, gets the output VCF And then based on an output formatter, converts the output VCR back to the DataFrame that you have And this way you will, and this happens in parallel, every partition of your DataFrame, this is the same, the same thing happens on So as a result, you can essentially apply a bedtool at very large Scale, instead of on a single note So we have an operation that may take several hours of using bedtools or even may never finish because of memory issues can be easily done over a cluster So input formatters and output formatters that we currently support are VSF, CSV and text input So if your tool can understand VCF

or comma separated, or tab separated format or texts, and in as input and output, you are good to go with this tool Let me just quickly show a demo of this as well So The pipe transformer, to be able to run the pipe transformer in your Spark cluster, you have to make sure that the tool that you the command line tool that you are using is installed on every node of the cluster So here, I’m gonna use bedtools, intersect command to find the intersection of my variance with a particular bed file Therefore bedtools should be installed on every note that I have done in this cluster that I have running here before So I import the required glow and other functions that I may use And here you can see that I load a VCF of chromosome 22 of the 1000 genome project And for the demonstration purposes, I’ve just limited it to 1000 rows of this VCF So just looking at the three columns of this VCF, we can see them that is everything that’s on chromosome 22 and start and end are here Now to use the bed, the intersection with bed file I created just a synthetic bed file here So this is a bed file with three rows and three columns, which chromosome 22 and some ranges start and finish that I want to find the intersect bit And I’ve saved it in this path using this command So, what I will do is that I have to create the command that I want the pipe transformer to use That command will be given as essentially a JSON formatted array So, for that command I’ve written a string here So essentially I would like, this is the command I want to run on my file If I was in Command-line, I want to find the intersect of the input So bedtools is my tool intersect of the input to the and come with standard input to the output that is, and my bed file is specified by this And the output is gonna be from the original rows off of the input file So, this will be saved as a command that is passed to the bash So essentially my command would be bash, “-C” and this is script file And then I call my transformer And after calling my transformer, you can see that 427 of that 1000, the rows have intersections with these regions that I defined here And you can look at those by just displaying the data So, that wraps up my presentation here again, make sure you check our Glow website and look at the documentation, join our Slack channel, or contribute to our code base because this is an open project Projectglow.io Thank you very much Thank you Denny – Thanks very much Kiavash Yeah, stop your presentation and I’ll present mine So thanks very much to everybody Who’s asked all these great questions I do wanna show my screen just a bit Let’s go ahead and do that So, some of you was wondering about where you’re gonna see the notebooks And so for the notebooks, actually we have a Databricks tech talk location, github.com/databricks/techtalks So we’re gonna put the notebooks up there So that way you can go ahead and take a look at it We’re also also do note the fact that the session is actually on our Databricks YouTube channel So right now it’s actually being filmed directly there So because it is, we’re gonna go ahead and make sure to put up the link to the notebooks there as well And also if you want to continue

following up with the conversation, because we’re gonna prop the video up there in YouTube, by all means, go ahead and engage with us there And finally, I did want to call out because this is the data and AI online meetup We do have some exciting other sessions that are happening, including this one, “Encoding multi-layered Vega-Lite COVID-19 Geodata visualizations This is from our friends at the Seattle children’s research Institute Where they’re working with COVID-19 data And what’s really cool is they’re also using ML flow as a visualization library in addition to everything else that they’re doing So pretty cool stuff So again, that’s it for today’s session I apologize for running a bit over time, Kiavash, Amir, thank you very much for presenting today’s session I’ll leave them with the last set of words to say bye But otherwise with that I hope to see you guys next week Amir, Kiavash, anything that you’d like to add before we go? – I just want to mention that the notebooks that we showed here, or even a more extensive versions of them are also available on our project Glow website documentation website, which you can easily go to from projectglow.io – Yeah – Perfect – I’m happy, am glad that you made it and joined and I encourage you also to, if you wish, join our Slack channel And the Google groups, don’t be shy, come in, create your PR and contribute – Yeah Great Nice evening – All right awesome guys Thanks very much everybody We will see you next week Take care everybody – [Kiavash] Okay. Bye