Vomeronasal gene project
Selected email correspondence, latest first:
4/4/02 from Christina
Hi everyone, Just to suggest more random ideas quickly -- there are a number of machine learning possibilities for the datasets you describe. To search for new members of a gene family and predict gene structure, we could try to use available gene-finding models and train them on the known members of the gene family. I think people haven't done gene-finding + structure prediction on particular gene families because it takes some work to train the model -- so instead, the experts train a generic model on standard training sets and put it up on the web. Also, there's the problem that most of the gene-finders are not open source. Bill Noble is working on a gene-finder called G-Known, but I talked to him yesterday and he says it doesn't work yet. There's also a new extension of GENSCAN (from the Brent Lab at Washington University) called TWINSCAN (uses 2 homologous sequences from different organisms, e.g. human and mouse, to predict genes) -- they are supposedly making the code "freely available for academic use" soon. The models themselves aren't that complicated, but it would take some work the implement from scratch (and Bill says that the details are "a morass"). For less elaborate approaches, we can continue with computational signal finding, intron-vs-exon classification/boundary definition using SVMs and other methods, various splice site and binding site recognition approaches. I'm also interesting in alternate splicing, though I don't have any specific learning approach to propose. Michael Pearce is in my Computational Genomics class & is doing his group project on this dataset, which will help me learn more about the problem! (I understand that the final report can't be posted due to the agreement with Celera.) If possible, maybe I can talk quickly with one of you to understand more about the biology for this problem, in order to offer better suggestions to the student group. Anyway, I'm looking forward to meeting everyone and discussing the project. Best, Christina Prof. Christina Leslie Email: cleslie@cs.columbia.edu Department of Computer Science Tel: 212.939.7043 Columbia University Fax: 212.666.0140 On Wed, 3 Apr 2002, Lawrence Chasin wrote:
Hi all- Stuart cannot make a meeting next Wednesday, and he is central to this. He will be quite busy until April 29. Since we are thining of an October 1 submission, we have time, so we will wait unitl early M. I'll contact you all about this in late April. Just to give you something to think about in the meantime, here's some random thought of mine and others at this point: The mouse odor receptor family has some 1300 members (the world champion) ( Zhang X, Firestein S. <http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=11802173&dopt=Abstract> The olfactory receptor gene superfamily of the mouse. Nat Neurosci. 2002 5:124-33. <http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=11802173&dopt=Abstract> They are found both dispersed on many chromosomes and also with clusters where they are found. Their gene structure consists of a 3' exon that includes the entire protein coding region and a variable nuber of exons that comprise a 5' UTR. The size of this intron-riddled 5' UTR ranges from about 2 to 10 kb, as I remember it from the results from the 10 or so genes that have been worked out. (B. Trask a common author). It is difficult to find (predict) these exons since they are nor protein-coding. Xinmin tried the first exon-finding program of Michael Zhang on these genes without success. For this reason of undefined gene structure, despite its large size this family presents problems for a bioinformatics approach A seoond type of chemical receptor lies in the vomeronasal organ, also somewhere in the nose. Whereas the odor recptors bind to volatile small molecules, the VM receptors bind to proteins or peptides. Tereligands are pheromones of various sorts that signal the presence of a sutiable mate or of a territory that may be defended, and perhaps many other things yet to be discovered. These genes are represented by two families of about 150 to 200 genes each, according to Stuart's report of Xinmin's latest findings mining the Celera mouse genome database. The VM receptors have a structure similar to the odor receptors and other G-protein coupled receptors (GCPRs), except for an extra ~500 amino acids at the N-terminal. That part is extracellualr and binds the pheromone. That part is also riddles with introns, and because they code for conserved protein sequences, they can be ferreted out. Mike Pearce is working with Xinmin now to collect these genes, or at least pieces of these genes. These genes are a good family for looking at intron conservation (our top goal) and may be better for looking at first exons and thus transcription signals. I don't know if the first exons here are also in the 5'UTR, it may not be known.. Since 40% of human genes do have introns in their 5' UTRs, that's the odds, not great. So any attempt to to go after transcription signals will probably require some wet lab work to map a few transcription start sites. That task could go perhaps attractively into a grant as work for Stuart and or me. Harman and Stuart are both (independently) excited about alternative splicing. harman is already starting to think or work on this in general. Stuart thinks that alternative splcing may underlie a great diversification of the VM genes beyond the mere 350. He thinks that this pheromone business could get very fancy, with mice recognizing individual other mice this way. So this is pretty interesting biology, although as far as I know there is as yet no evidence that alternative splcing is involved here. If we are shooting for an October submission, then it would be nice to have some preliminary data. Gathering the genes and then looking for homologies for splicing signals will be going along, but we could also consider trying to get evidence for alternative splicing and mapping transcriptional start sites, focusing on those genes already described. Although I have experience in both of these areas, the task may be technically non-trivial becaus eof the heterogeneity of gene expresson in the organ as a whole. I am wondering if machine learnig has been applied to looking at gene families before, for instance to distinguish genes from pseudo genes (too easy), or to find transcrption signals by comparing genes (the positive set) with pseudo genes (the negative set), making the dangerous (but testable) assumption that the pseudo genes are no longer transcribed. Feel free to answer to the group if you have anything to say, as we await our first meeting. Larry