Proof of Principle for Shotgun DNA Mapping (Redux)

You might have noticed all the recent activity with lack of context, well it’s Spring Break here at UNM and Steve and I are going into overdrive to finalize some papers that were left in preprint limbo. We have three projects to work on of various priority: (1) Shotgun DNA Mapping (what this article is about), (2) a paper that models Kinesin motility, and (3) a paper about the Repeating Crumley experiment that will be self-published via Google Docs.

So all the stuff that has been appearing in my notebook from both myself and Steve is all about the Shotgun DNA Mapping (SDM) paper. Our goal is to complete the paper, with some extra experiments that show how useful the SDM software is, and complete it by the end of the week (I’m guessing by Sunday because that’s when my parents arrive).

Right now we are refamiliarizing ourselves with the software. The original code and resulting paper were written in September of 2008 (yikes I’ve been in grad school for a while). The code was originally written by Larry Herskowitz and back then he had talents as a programmer, but he had no talents as an organized human being. So getting familiar with his mindset and looking for important programs and files is no easy task. There may be a lot we need to do and that’s what we are trying to figure out. Hopefully tomorrow we’ll be able to move on from this step.

But what code am I talking about? You probably thought I was just an experimentalist. Well you’d be mostly correct, but I have dabbled in programming some.

The code in question is the heart of Shotgun DNA Mapping. The SDM project was a two step experiment:

  1. Generate clones of random yeast genomic DNA sequences and unzip those sequences using our optical tweezers.
  2. Compare the force curves from the unzipped DNA to a library of simulated force vs extension curves. The library is generated from the yeast genome.

The paper that we are working on was a proof of principle for step 2. The results from step 1 would be published once we had working tweezers and unzippable DNA. Unfortunately we couldn’t get unzippable DNA, but maybe this summer I’ll be able to try again. In this paper we discuss how the simulation software works, how we match genetic information, and we present results using some old data Steve had from grad school.

Aside: It just occurred to me that this could be a success of open science if other groups had DNA unzipping data that they shared online, but alas the rest of the world is closed šŸ™

Let’s discuss the software briefly so there is some context as to what Steve and I may write here for the rest of the week, and so you can understand the basic premise of the paper.

To get a brief understanding of how the tweezers work and how we are able to unzip DNA check out my intro to SDM here.

Now that you understand all that stuff let’s get into the software:

  1. The first step to SDM is to create a library of unzipping force vs extension curves for the yeast genome. We chose yeast because it’sĀ DNA is bundled as chromatin which could be used for the next level of SDM, Shotgun Chromatin Mapping (SCM), and because our collaborators are expert yeastĀ geneticists.
    1. We downloaded the yeast genome sequence from yeastgenome.org (back in 2008) and did a simulated restriction digest. By this I mean we looked for the XhoI recognition sequence (CTCGAG) in the yeast genome. From each recognition site, we created “fragments” that are 2000bp in length to use for the unzipping simulation.
    2. The unzipping simulation used a very simple equation to calculate the energy contained in a double stranded DNA sequence (dsDNA). The hamiltonian (as is Ā the term for energy terms) contained the energy of the freely-jointed chain (which is a model that describes a chain of paperclips, google it) and the base-pairing energy (ie A-T/G-C bond energies). That’s it! Remarkably that worked exceptionally well. The reasoning is that when you are unzipping you have two sequences of single-stranded DNA held together by the unzipped DNA (base-paired bonds).
    3. After the energies contained in an unzipping sequence is calculated we needed to extract force information. We did this by solving an integral of x(F’)dF’ numerically, where x(F’) is the extension depending on the freely jointed chain model. I don’t expect that to mean too much right now and I may have incorrectly explained it myself, but this will be made more clear later.
  2. Once we have simulated unzipping data we can begin to match data to the library we just created.
    1. I don’t understand the mathematics behind the matching algorithm all that much myself (right now), but from what I understand the matching uses a normalization that is routed in the difference between a polyA strand and a polyG strand (polyA has forces of ~9pN and polyG has forces of ~19pN, everything else is in between).
    2. Because of the nature of the simulation, at low extensions there is a lot of unreadable data, so we use a window between bases from 1200 and 1700 (we call this number index) to get better analysis. We also chose a 500 base window size above index 1000 because of the unzipping data that we were analyzing.
    3. The algorithm generates a score based on the difference between real and simulated force profiles in the window size stated above. In our system great matches are close to 0 and bad matches are close to 1.
  3. In graduate school, Steve had unzipped a plasmid known as pBR322 (purchasable from NEB). Using the software Larry had simulated the force profile for this sequence and hid the data in the library of yeast genomic simulated data. The matching algorithm managed to match the real unzipping data to the simulated data every time.

Like I said a lot, I don’t expect all of this to make sense, but as we rewrite the paper I’ll add thoughts about the project to the notebook. On top of that I’m sure we’ll have a lot of supplemental information to add to the paper via the notebook (so we can just site it). So while all this is new and confusing (if it isn’t then that’s awesome) be aware that the confusion will subside by the end of the week. We have a lot of exciting things coming in the next few days and you’ll all be the third to know about it (after Steve and I of course).