Part 12 – The NIEHS Exposure Science and the Exposome Webinar Series – Dr. Carolyn Mattingly

>> Male Speaker: Today’s speaker is Dr. Carolyn Mattingly, an associate professor in the Department of Biological Sciences at North Carolina State University Carolyn is the principal investigator of the comparative toxicogenomics database, an effort to curate literature on the associations between chemicals, genes, and diseases She also leads a research effort using zebra fish as a model organism to investigate the mechanisms of toxicity to endocrine destructing compounds This morning, she will be focusing on their recent efforts to integrate information on exposures into the CTD With a final reminder to submit your questions via email during the presentation to exposome@niehs.nih.gov; I will hand the podium over to Dr. Mattingly Carolyn? >> Carolyn Mattingly: Thanks, David, and it’s a little weird not to be able to see everybody, so thank you for joining us And I’m happy to be here to talk to you today about CTD or the comparative toxicogenomics database, and in particular, a separate initiative that we started a few years ago that is now released and integrated within CTD that is specific to exposure science So I’m hoping to give you a little bit of background about CTD in case there are some of you who are not familiar with it, and I’m going to tell you a little bit more about this new dataset that we’ve integrated and give you some examples about how one might use this data in trying to better understand how the environment is affecting human health So CTD, we’ve been working on this project now since about 2001 It was first publicly released in 2004 And the impetus for it was really knowing that most diseases, most chronic diseases, have some environmental component, but that we really don’t know what most of the chemicals in the environment are doing And at the time we were looking at a lot of databases that were popping up, but that none of them were really focusing on the environmental piece And so we really tried to address that gap And we approached this through a number of different mechanisms, so looking at public datasets that we could bring into a sort of integrated environment, but also bring together information specific to the environment that was coming from the literature And although PubMed is an incredible resource, as we all know, it’s also very, very large, and so trying to pull out relationships that are occurring between small entities within the published literature landscape can be a little bit challenging So what we did was focus our efforts on curation of the literature, pulling out specific pieces of information that we thought would help provide more information about what chemicals were doing, particularly mechanisms by which they may be modulating environmentally related diseases So we looked at the literature and focused on interactions between chemicals and genes, or proteins So when I refer to genes in this talk, I’m also talking about proteins and various forms of the gene We don’t discriminate on that level We also are interested in looking at gene disease relationships and chemical disease relationships, and seeing sort of where the data for these types of binary relationships fell, and thinking about well, if we could pull out these binary relationships, which were really the focus of individual papers There are very few papers that really address all of these relationships, but instead, focus on one aspect but may be disconnected from another but potentially related aspect So we focused on that And the way in which we had to do that was to think about doing it in the most structured way, so that we knew we would be pulling out the data in a very consistent fashion, and then once those data were put into a database, that would be available and helpful to the user who is looking for information And you can put in one term, someone else can put in a synonym of that term, but everybody comes up with the same results And so we chose to use a number of different control vocabularies to address these different aspects of

data relationships for chemicals Currently we use the National Library of Medicine’s MeSH vocabulary for genes We use the ENTRE gene vocabulary And as I mentioned, genes also comprise mRNA protein, promoter regions of genes, et cetera For diseases, we use MESH For our chemical gene interactions, we looked around in the public domain for ontologies that would address this, and actually ended up coming with our own And then for species, I should mention for chemical gene relationships, we curate these interactions across vertebrates and invertebrates, because the majority of toxicology data is not done in humans So we use the species vocabulary from NCBI for that And this will become a little more significant later So why do we use these semantics standards? Well, as a lot of us know, language and data and the literature are not very well standardized And you can see in this first column, all these symbols describe the same exact gene And so you can find all these references within the literature but would not necessarily know unless this is your field that those are the same genes And similarly, all of these chemicals in the right column, or all these terms, rather, describe the same chemical And so using standards that account for all of these synonyms would help us both from a curation perspective so that our curators can go into the literature and be able to identify what gene or chemical or disease is being talked about, and then from a user perspective, you would be able to put in any of these terms and identify the data associated with it The other piece of that is that if we’re using standards to describe each of these entities, whether it’s chemicals, genes, or diseases, that leverages a whole set of other data in the public domain that we can then bring into CTD So genes are a good example of that By using these standard gene symbols, we’re able to incorporate into CTD gene ontology annotations, which describe the function of proteins, Reactome and KEGG, which are pathway databases, and then for example, bioGRID, which describes gene-gene interactions and regulatory mechanisms So these standards, although are not the most exciting things to talk about, they actually are the bread and butter if you’re trying to integrate a lot of different datasets that users are now increasingly expecting to have access to seamlessly in public resources So not only integrating provides access, but it also provides other functionality for analysis And I’ll show you some examples of that a little bit later So our vision then, was that by virtue of curating this information from the literature, we would sort of revive literature that may have really significant findings about chemical interactions on diseases, and bring them into a different context and allow people to make connections So for example, if you were interested in the heavy metal cadmium, you could come to CTD and you would find about 2,400 genes or proteins for which there’s evidence in the literature that cadmium interacts with or affects their expression or function These genes are associated with many different diseases, and by virtue of bringing all of this information together, then you might be able to hypothesize that cadmium is affecting a number of these different diseases And you can get even more specific So if you’re interested in diabetes, for instance, and whether or not cadmium has some connection to it, you would then be able to identify 17 different genes And these 17 genes have curated relationships to cadmium, and they also have curated relationships to diabetes And these 17 genes are known to be enriched in osteogenic and adipogenic pathways, which could be important in diabetes in terms of whether or not stem cell populations

are being pushed into adipogenic directions or osteogenic directions So the goal of this database is essentially to pull together this information and allow people to make connections to information that they might not otherwise be able to make So the way that we do this is to go into the literature So we use PubMed, and we have pretty extensive query mechanisms by which we query it But we also have some automated ranking techniques, and we have extensively documented this And one of our software developers, Tom Wiegers, has been very, very involved in the BioCreative initiative, which some of you may be familiar with — just an international consortia that tries to bring together people who are interested in text mining and natural language processing to figure out how we can better mine the biological literature So we have a number of papers in which CTD data has been really important in some of the challenges that this group puts out So once we triage the literature, we have some automated web-based tools that our curators then go into the literature We have ways of highlighting some of the salient pieces of information, yet at the end of the day, the curators are manually curating this information We have a shorthand that we show here And this tool is quite nifty because these interactions — for example, this one is — C1 is describing a chemical It’s increasing the expression of G1, which is in this case, an ABC transporter, but the mRNA level of it So they can get quite extensive and complicated, and this tool responds in real time to whatever sort of strain the user, or the curator puts in And then we capture a number of other bits of information about the paper We also have some QC measures that interact directly with our curation application, and then once they pass through, the data gets loaded monthly into the database So once that happens, you have access to it And this is just a quick view of where sort of what we refer to now as core CTD is We have over a million chemical gene interactions The data reflects over 500 different organisms, and you can read the numbers — pretty extensive, gene and chemical disease associations And essentially the bottom line is this is where the landscape is as far as curated data that we have for chemicals, diseases, and genes And again, genes, we have 39,000 because of the diversity of organisms that are represented in the database Okay So this is the url for the database, which is completely open and free So you have a number of different ways you can access the information Most people go in through a keyword search You can put in chemicals, genes, diseases, GO terms, pathways, or you can go to more advanced search queries, which allow you to build more complex searches But most people, as I mentioned, use the keyword search So once you’re in the database — and I’ll give you a view from a chemical perspective, because this is typically how our users are entering the database, but if you can imagine, there are comparable pages for genes and diseases, as well as gene ontology and pathway data So for the cadmium information, we bring in some data from other public databases So for instance, MESH, we provide all of the information you would sort of expect for different chemicals, so CAS IDs We provide you with a brief definition, a structure We give you a quick view of the top genes that we have curated data for, and then of course we provide you with all kinds of links to other databases that have additional information about the chemicals that you might be interested in So the curated data that we provide is presented along

the top here in a tabbed format So gene interactions, you would go to a page such as this And you see cadmium, you see your interacting gene, and then this is the output from that curation application that I showed you So while the curators put in sort of a shorthand, you actually get a readable sentence And these can again, get very complex They can be as simple as binding or affecting other types of reactions You always have access to the references that are associated with that curated interaction, as well as the organisms By and large, the data is mostly from rats and mice, as you would expect, but there is an increasing number of data for other organisms, model systems, drosophila, zebrafish and C. elegans A lot of people really like the disease page, so this is I think one of the most unique aspects of CTD that we can bring together the information about potential chemical influence on diseases, and at the end of the day, that’s what we’re trying to help elucidate So what we’ve done is because we have pretty extensive disease vocabulary — and I should mention, our disease vocabulary and our chemical vocabulary are both hierarchical, so that allows users to ask questions about very broad categories of chemicals or diseases, or very specific What we show you down here is at the specific level, but these are also mapped into categories So if you want sort of a quick view, what sorts of diseases seem to be associated with cadmium You can click on that, and what you would see is a table such as this And we distinguish between diseases that have curated relationships versus those that are inferred, and I’ll explain that in a moment But this kind of gives you a quick view Nervous system diseases seem to be the most highly correlated with cadmium, and so on So if you look at the main part of that page, we have this Direct Evidence column And what this is telling you is that there is an association between cadmium, and in this case, prostate cancer, that is founded in one or more pieces of literature So this M is indicating that that relationship may be mechanistic and that cadmium may contribute to prostate cancer You can also have a T here So we do have many drugs in the database as well And so if you see a T, then that indicates that that chemical is used as a therapy for that particular disease So this would be considered a curated relationship, if there is some icon in this direct evidence column However, we also create what we call inferred relationships, and that is based on this inference network And so in this particular case, what you see, we would have — even in the absence of this direct evidence, we would have created a relationship between cadmium and prostate cancer, because we have curated information between cadmium and this set of 149 genes, as well as between prostate cancer and this 149 gene set So that would be an inferred relationship We have developed an inference score to try to essentially rank these inferences And this is a point of a lot of discussion, and we’re always interested in peoples’ feedback on how to do this And we developed this initially because people would ask us, well, is this inference network true? And well, of course, we don’t know that It’s intended to be hypothesis driving But what this inference score is a reflection of, is the set of genes and essentially its uniqueness with respect to its relationship to cadmium, and in this case, prostate cancer And so inference scores typically will track higher with larger sets of genes, as you might expect And we’re doing some analysis now, which we were just talking about this morning, that is sort of evaluating whether or not those inference scores are really turning out to correlate with relationships

that have direct evidence in the literature And it looks like it’s actually highly predictive of those So I don’t have the data in this presentation to show you that at the moment, but hopefully we will be putting something together soon for the community to get a better handle on what these inference scores mean So once you’re looking at — oh, and then you can always access the references that are underlying these curated relationships So if we look at one particular example, just to show you what other information we provide for you, these icons in here provide enrichment analysis, and these are based on these genes So if you look at a set of 81 genes, that’s not particularly helpful, but you can get a little better idea about what that set of genes might be associated with by doing a GO enrichment So this is just a snapshot of what that might look like, so response to lipids In this case we’re looking at cadmium and hypertension You have significant associations with things like blood circulation that you might expect in hypertension This icon here gives you an idea of enriched pathways that are associated with this set of genes And then this icon here, is based on gene-gene interactions associated with these genes So it’s basically telling you, among this set of 81 genes, what do we know in terms of whether or not these genes actually talk to one another in a cellular context And all this information is hyperlinked to the details, and these circles are these particular genes in here, and their shading and size reflect the amount of data that underlies those relationships So this is sort of that — that was just sort of the backdrop to the resource that we were hoping to build on to address some of the needs in the exposure community So what I just described to you, as I mention, we call core CTD now And a number of years ago, we were involved in an exposure meeting and were approached and asked if we thought we would be able to do something similar for exposure science data And none of us on the team are exposure scientists, so we spent some time trying to figure out what that actually meant And since then, as we know, this has been increasing in interest to the community through the exposome project And what we did at the time was to try to understand what the need was of the exposure community And what was communicated to us was that that there was really a need to centralize exposure data, that this information was being published but it was sort of getting lost in the landscape of PubMed And it was very difficult to sort of look across exposure studies, to identify them at all, and that many of these exposure studies were also sort of in isolation relative to a broader biological framework, so epidemiological studies that may correlate a chemical and a disease outcome but in the absence of molecular data, or measurements of chemicals in the environment that may be separate from actual disease outcomes And we felt like CTD may actually have some of that broader framework that we could contribute to some of the interpretation or expanding out some of the implications of exposure science studies And then on the flip side of that, we have a large amount — the majority of the data in CTD was really based on experimental studies And for those of us that do studies in the laboratory, we know there’s always that question of you know, whether what we’re doing is really reflective of the real world And so we saw this as an opportunity to bring in some of the community and population-based data more deliberately, and sort of bring that real world context to the experimental data on which we had been previously focusing So that raised an issue of well, what exactly do we need to curate and how do we do this, because we’re talking about very different — there’s some commonality but some very different pieces of data than what we had been curating So we brought together a working group of exposure

scientists, as well as oncology folks, and underwent this sort of iterative process for what amounted to about a year and a half of work So we had assess, really, what the landscape of exposure science data looked like, and then started looking at what actually these papers were reporting, and what kinds of terms and what kinds of categories of data could be identified from what turned out to be an incredibly diverse set of information And we had the exposure scientists who were working with us who were not curators, sort of take on the task of trying to curate these data, which was a lot of fun for us, because I think it helped them gain some appreciation for the fact that this is not an easy task And it was really helpful for us, because they are the exposure scientists, and they could advise us on what were the most important pieces of information that should come out of these particular papers So we went around and around with this, and kept expanding on terms and categories that needed to be reflected in the curation process, and eventually came out with the fundamental structure, which consisted of an exposure stressor So in our world, that is primarily chemicals, but there was this — what we came up with was very much of a skeleton that could be expanded and a lot of detail could be added So obviously stressors could be more than chemicals, and there’s room within the structure to add things like demographic information, psychosocial stressor, things like that Exposure receptor would consist of population-based information so if you’re looking at a study of workers in a factory, for instance, an exposure event really describe the types of measurements that these papers were providing information about, so levels of chemicals in blood and urine, biomarkers that were looked at, geography of where the exposure took place, et cetera, and then exposure outcomes So these could be phenotypes, it could be diseases, and again, they could be expanded out for other uses And this became what we called the structure for the exposure ontology And we put this out to the community through a manuscript and made the ontology public with the hope that you know, people would expand it as needed So this helped us sort of figure out the how It framed the type of information that we needed to provide, but then at the end of the day, we have to come up with a paradigm that allows us to curate data in a reasonable way, make it useful to the community, but it can’t be so onerous that days are spent on a particular paper, because that wouldn’t be so efficient And the landscape of exposure data is super diverse, so we have — so this is an example of a paper, and again, we have our stressor receptor event and outcome We have papers where all of these aspects are reflected and studied in a paper, but we have other ones where you know, maybe a stressor or a receptor are included So a particular chemical is measured in participants, but there’s no outcome data And then yet, other examples where particular pollutants may be measured, and they’re really not in the context of a population So we wanted a structure that would allow us to capture all of these different scenarios And that took some time, too And eventually, what we ended up with was figuring out not only how to incorporate this structure into CTD, so these stressors meshed really nicely with the chemicals that we were already curating in core CTD And these exposure outcomes overlap really nicely with diseases that we were already curating in CTD, yet they added all of this new information And as we started curating, of course, a lot of detail was added to each of these particular sort of high level categories And so XO has now expanded quite a bit, just based on

the process of our curation And so at the end of the day, we’re now capturing 34 different data points So I’ll give you a little view of what that looks like, sort of the administration information We provide information about the stressors, so what is it that was being measured, what is the source of that particular — and in this case, I’m going to focus on chemicals, because that’s really the focus of our curation at the moment We have details about the stressor sources, so these can be factories For example, for our exposure receptors, we have a lot of different information about the population, so — and we also have information that is free text, but for the most part, we try to capture all of this information with very structured terms And hopefully you’ll appreciate why when I show you some examples The biggest component of this is the exposure event, so we capture all of the actual levels of chemicals and biomarkers that are measured And you can see the extent of the information here, and then finally, the exposure outcome So in addition to just diseases, we also include anatomical sites, so a level of specificity that we hadn’t been previously capturing A number of these — so although we — we sort of developed the XO structure We’re using a lot of terms that overlap with existing ontologies, and so we’re leveraging those where possible, to reduce redundancy And this updated version of XO should be available soon, both on our database as well as a number of other public resources that are interested in sort of putting ontologies out there to the community So what does this look like in CTD now? So again, you can still search for chemicals Again, here’s sort of a high level If you were interested in metals, you would get your typical view as you would in core CTD So this is — all of these data, our first phase of curated data, is fully integrated with CTD So you would come to CTD and see no difference until you start digging into the data The query mechanisms are all the same So we’ve added a new icon, so this icon indicates that there is exposure data associated with these particular terms If you were to focus more specifically on heavy metals, what you would see is again, a chemical page similar to what I showed you before, and a new tab that we call exposure studies And what that shows you then, is for this particular pGe, for heavy metals, we have 198 exposure studies that have been curated We provide the stressor agent So what that means is the chemical that the actual group was trying to identify — and again, whatever your query term was would be highlighted Okay If you looked at a particular example, such as this one — so we’re showing you the receptor So this is sort of the high level view of it, so what was studied in this paper So residents and workers, so residents, you get information about the study location, what was actually measured and in what sort of medium, and then any sort of disease or phenotype outcomes We provide an author summary So this is sort of a help as in the case where you might have something like 198 results, you might want sort of a quick view as to what the goal of that particular paper was And as always, you have access to the original reference So you have this one study per view, and again, here are all the details of the information you can get Now, if you want actual details of what the measurements are that this particular study provided, you can click on that particular paper, or you can

go to Details, and what you would see is something like this So this page is now specific to that study And you can see, this is the exposure details tab for that study You have the number of receptors, so there are different categories So in this case, you have residents in a non-polluted area versus residents in a moderately polluted area You can see the actual levels of cadmium that were measured in this particular population, and various statistics that were used to sort of contextualize these groups of individuals in the population We also provide the outcomes or disease phenotypes, so we tell you whether there was a positive correlation or a negative correlation to a particular disease, such as kidney disease, albuminuria or phenotypes, so renal system processes in this case These — as throughout CTD, everything is sort of reciprocally curated So if you go to a disease page, you would be able to access this information, or a phenotype page So you can always get to the information through a number of different mechanisms So in this case, if you were interested in renal system processes, again, you could go to exposure studies You would find that there are five studies that have been curated for exposure, data that in this case, here’s the study we had been looking at, and here are some other studies for which renal system process was a phenotype that was focused on in a particular study Okay So then you can also go to particular disease pages for the same reason, find other studies that look at this same disease Now, core CTD, by virtue of its integration, we can add more information, as I said So this is sort of the reciprocal providing real world context to core CTD, but core CTD can also provide information for the exposure So in this example, we can bring together the previously curated information for this group of exposure stressors, and what this does is for exposure studies, we may have 53 diseases that are associated with particulate matter In core or more experimentally based data, there are more diseases, so we provide sort of an additional context for that In some cases, these are highly related They may be more granular diseases that were looked at in experimental context Similarly, we can provide genes In many of the exposure studies, there’s very little molecular data They tend to be epidemiological studies or studies in which a chemical is measured in household dust, for example And that’s great for providing sort of real world context, but it doesn’t tell you a whole lot about mechanism But once you bring these two pieces together, then you can grab the mechanism data from core CTD and add that to the exposure context And then again, a lot of these data are coming from cross-species studies And so our goal, too, is that we can sort of help figure out which of these model systems are going to be corroborated by exposure data in the literature, or help to inform which model systems may be the best ones for looking at more mechanistic studies following exposure reports So I wanted to show you couple use cases as to how we envision this information being able to be used And again, these data have only been integrated for about two months, and we’re continuing to curate these data So as the datasets become more robust, there will be a lot more we can do with this So this is an interesting paper that was published by Lyle Burgoon, who some of you probably know And he was looking at race and socioeconomic factors associated with particular diseases, and in this case it was Type 2 diabetes And they found that there were different genetic susceptibilities to Type 2 diabetes that were connected

to a particular snip in a transporter, this SLC 3088 And this was associated with particular subsets of the population and specifically Mexican Americans So they looked then at where these populations may be enriched in a particular geographic area And they were focusing on California, and found that there may be — their conclusion at the end of the day, was that there may be a higher percentage of susceptible individuals living in the Los Angeles area versus San Francisco with susceptibility to Type 2 diabetes And what they suggested in this paper at the end was that a natural extension would be then to look at — and the interesting thing was — I should step back — they took a lot of public datasets to come up with this analysis, which seemed like a great idea There’s an amazing amount of information out there that, when integrated, can give us different views about either health disparities or susceptibilities And the pieces that was not yet incorporated into this was really the environmental piece And so are there datasets out there that can tell us something about, okay, if this population in Los Angeles does have a higher susceptibility to diabetes, is there environmental information there that might actually support that and indicate that these people have an even higher risk due to environmental exposures that they may be experiencing? So Allen Davis, who’s our lead curator, put together this analysis, which I thought was very clever And just looking at this data, the exposure data we’ve curated so far, there are 63 exposures studies that focus in California These papers have curated relationships between about 100 and 90 exposure stressors in almost 100 different counties and towns And these stressors have been associated with a number of different diseases So if you focus particularly on studies in the Los Angeles area, this included eight different articles with 27 different stressors And here are a number of those particular stressors And what he did was he took these stressors then and said, what do we know about these based on core CTD data, and found that in fact, there were quite a few correlations between these particular environmental stressors and diabetes-related conditions consistent with what Lyle had shown as being potentially a higher susceptibility population in this area So it could be at the end of the day that we could take some of this geographic information and exposure information and connect it to try to understand then what the mechanisms might be So core CTD can take that even farther and say, okay, what do we know about these particular relationships, such as between particulate matter and diabetes, and go back into our core dataset, where we have, again, here’s the particulate matter page and core CTD on your disease page You would see that there was a correlation between particulate matter and diabetes, there is evidence in the literature for that And then we provide you with this inference network So these are 44 genes that may help to explain that connection between particulate matter and diabetes And then what we can do for that is look at what those genes are — and in fact, there are more transporters, particularly glucose transporters as well as people, that we know to be involved in lipid balance — and conduct some further analyses and identify some potential pathways that might be worth exploring further, in thinking about how particulate matter may be influencing diabetes incidence So — and then further, you can go back into — here, I jumped — quickly, so this is just particulate matter and diabetes If you go to particulate matter and the exposures studies tab up here, then you get a whole list — and this is just a partial list — of other disorders that

may be associated with particulate matter So another example, and we just submitted a paper to EHP that’s under review at the moment, talking about this exposure module to CTD And in it, we included this example So one of the studies that we have focused on heavily is the agricultural health study, which many of you may be familiar with And Jane Hoppin, who is now in our toxicology program here at State and was critical in this study and continues to be, helped us a lot with this So we set out to curate — the whole of the AG health study, there are now 111 publications associated with this study, yet to our surprise, there hadn’t been a meta analysis done And so we wondered if the information that we had curated, what we could actually do with it at this initial stage So among the 111 publications, we reviewed them all Ninety-nine of them contained eligible data, eligible in the CTD world, which meant that it had to have some sort of measurable chemical, and have specific chemicals or diseases implicated So these 99 publications had 62 chemical stressors that we’re focused on, 46 disease outcomes, and these exposure statements So these are things akin to those measurements I showed you on an earlier page And so we took those data and said, well, what does the AG health study data actually look like, if you looked at it from, you know, 1,000 foot view? So we wanted to just take a simple approach and use something like a heat map And we looked across the data, and of course, in some cases there are positive correlations; in some cases there are negative In some cases, end points hadn’t been looked at with certain chemicals but had been in others So there’s a range of types of different interactions that have been curated And in some cases, there were conflicting data And so we developed a metric, and I’m just showing you what these colors mean for the heat map here And I realize this is probably impossible to read, depending on what screen you have But essentially, it’s a heat map that is showing you chemicals on the bottom, diseases and phenotypes on the right And we’ve categorized them here So this upper category are neurological disorders In yellow here, we have cancers, respiratory disorders, thyroid-related, metabolic-related disorders, and then some other sort of one-offs And this starts to get pretty interesting if you start looking at, well, where are you seeing overlaps And you get a set of chemicals here that might not be too surprising that they’re clustering together, but their associations range from diabetes to respiratory diseases to prostate cancer And then again, some of these overlap between prostate cancer and Parkinson’s disease And so this we felt was kind of a neat way of sort of getting a high level view at where the data are And the goal is to have this sort of functionality eventually in CTD, where you could take multiple studies and look at how the data is falling out with respect to disease and phenotype associations So we also took this a little bit farther and said, okay, we have these pesticides that were associated with prostate cancer here, and you can look across this row And we leveraged the inference networks that we have in CTD, because again, there’s not a lot in the studies that are underlying these connections There’s not a lot of molecular data So we can go to those inference networks, such as this in CTD and say which genes are associated with each of those chemical disease associations And so what we came up with was a set of about 200 unique genes that were associated with 16 different pesticides And in order to keep this a little more manageable, then, we restricted the number of genes to those that interacted with at least three of these chemicals that showed up in the AG health study

And that whittled our list down to about 21 genes, which we show here And in another sort of graphics saying, well, which of these chemicals were these genes associated with, and using CTD functionality, we can look at which of those genes are actually interacting with one another So here is one view of what that pathway could look like So again, trying to show how we can take exposure data from a set of epidemiological studies and get to a point where we have some potential or hypothetical mechanisms that might help to explain some of the chemical disease connections So those are just two examples Again, this is early in the project with respect to — we have quite a bit of data We have about 2,000 curated papers for a landscape that we estimate to be about 4 or 5,000 papers And a lot of this time has been establishing our curation protocol So now that the data is in CTD, what we’re focusing on now is figuring out ways to provide better access to the data So for example, we will be adding in more specific query mechanisms So you could for example, look at just the data associated with an AG health study, or just an AINS [phonetic sp] data Or if you wanted to look at particular attributes of a population across many different studies, we’ll give you the capability to do that We also have a lot of additional data that we’re curating that just get too big to look at in a single page And so we’ll be implementing some mechanisms by which you can filter the data that you want to look at for a particular study And then finally, just some visualization tools that we’re working on where you can get an idea of, for a particular chemical, for instance, where you might be able to find higher incidences of diseases or exposure, both in the U.S or globally So those are just a few examples of things that we’re working on at the moment And hopefully will be released in the coming months for users to access And then I want to just end with one plug for semantic challenges And again, I know this isn’t the most exciting thing, but going through this process of trying to figure out how to capture these really important studies, and as the exposome becomes a bigger piece of environmental health research, I think these epidemiological studies are going to become really critical parts of the equation Yet, their information is very, very challenging to fit into a computable form And a lot of that is because of the lack of semantic standards that are being used in papers and studies that are being published, and then the lack of standards or consensus among the community about how to maybe capture this information And I just want to give you a couple of examples of challenges that we’ve had to face in terms of figuring out how to capture this information So just the diversity of the study objectives, as I’ve mentioned, we go from epidemiological studies to measuring compounds and house dust Dose measurements, these can be as diverse as defined as distance from an exposure source, or the time exposed, estimated consumption of contaminated food source, particles per hand wipes So you try to normalize all that and look at results from a particular compound across studies that are using these very, very different types of measurements, it gets pretty hairy fast Biomarker measurements are very, very diverse Statistics have been incredibly difficult And while everybody wants us to include that information, it’s pretty challenging, because we don’t want to get in the business of re-evaluating peoples’ data We don’t think it’s productive We don’t necessarily have the expertise to do that We defer to the authors of these studies Yet when you look at the range of statistical approaches that people are using, it’s quite challenging And related to that is determination of statistical significance

So some groups use P values; other groups use odds ratios And within those odds ratios, different people define significance differently Things like smoking status that you might think would be pretty straightforward, can be very diversely described, and things like age So if you want to look at studies of children, well, how do you define children? Well, some people call them children; other people give age ranges or means And so I think these are really important issues that the community really needs to address if we want these datasets to be incorporated into emerging data and integrated into resources going forward, so that we can actually do cross study analyses And we had a workshop that NIEHS supported that we hosted here at NC State a year ago, a little more than a year ago, where we tried to start the conversation about how to think forward about where the standards are for environmental health issues and how we want to address this so that the community can be on board and report their data in a more standardized way, if possible And to date, we have a listserv, which I invite you to join if you’re interested in being a part of that conversation So with that, I will stop And I just want to acknowledge the CTD team, which is an amazing group of scientists and software and statistician folks who do all of the hard work in curating and putting the database together, collaborators that we have here at NC State and outside of NC State We’ve been able to really leverage growing epidemiological expertise here, particularly with Jane Hoppin and bioinformatics expertise, and of course, NIEHS has been incredibly supportive of this project for which we’re grateful So I will stop there, and I would be happy to entertain any questions Thank you