Vomeronasal gene project

Selected email correspondence, latest first:

4/4/02 from Christina
Hi everyone,

Just to suggest more random ideas quickly -- there are a number of machine
learning possibilities for the datasets you describe.  To search for new
members of a gene family and predict gene structure, we could try to use
available gene-finding models and train them on the known members of the
gene family.  I think people haven't done gene-finding + structure
prediction on particular gene families because it takes some work to train
the model -- so instead, the experts train a generic model on standard
training sets and put it up on the web.  Also, there's the problem that
most of the gene-finders are not open source.  Bill Noble is working on a
gene-finder called G-Known, but I talked to him yesterday and he says it
doesn't work yet.  There's also a new extension of GENSCAN (from the Brent
Lab at Washington University) called TWINSCAN (uses 2 homologous sequences
from different organisms, e.g. human and mouse, to predict genes) -- they
are supposedly making the code "freely available for academic use" soon.
The models themselves aren't that complicated, but it would take some work
the implement from scratch (and Bill says that the details are "a
morass").

For less elaborate approaches, we can continue with computational signal
finding, intron-vs-exon classification/boundary definition using SVMs and
other methods, various splice site and binding site recognition
approaches.  I'm also interesting in alternate splicing, though I don't
have any specific learning approach to propose.

Michael Pearce is in my Computational Genomics class & is doing his group
project on this dataset, which will help me learn more about the problem!
(I understand that the final report can't be posted due to the agreement
with Celera.)  If possible, maybe I can talk quickly with one of you to
understand more about the biology for this problem, in order to offer
better suggestions to the student group.  Anyway, I'm looking forward to
meeting everyone and discussing the project.

Best,
Christina

Prof. Christina Leslie                       Email: cleslie@cs.columbia.edu
Department of Computer Science               Tel: 212.939.7043
Columbia University                          Fax: 212.666.0140

On Wed, 3 Apr 2002, Lawrence Chasin wrote:
> Hi all-
> Stuart cannot make a meeting next Wednesday, and he is central to this.
>  He will be quite busy until April 29.  Since we are thining of an
> October 1 submission, we have time, so we will wait unitl early M.  I'll
> contact you all about this in late April.
>
> Just to give you something to think about in the meantime, here's some
> random thought of mine and others at this point:
>
> The mouse odor receptor family has some 1300 members (the world
> champion) ( Zhang X, Firestein S.
> <http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=11802173&dopt=Abstract>
>  The olfactory receptor gene superfamily of the mouse.
> Nat Neurosci. 2002  5:124-33.
> <http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=11802173&dopt=Abstract>
>   They are found both dispersed on many chromosomes and also with
> clusters where they are found.  Their gene structure consists of a 3'
> exon that includes the entire protein coding region and a variable nuber
> of exons that comprise a 5' UTR.  The size of this intron-riddled 5' UTR
> ranges from about 2 to 10 kb, as I remember it from the results from the
> 10 or so genes that have been worked out. (B. Trask a common author).
>  It is difficult to find (predict) these exons since they are nor
> protein-coding.  Xinmin tried the first exon-finding program of Michael
> Zhang on these genes without success. For this reason of undefined gene
> structure, despite its large size this family presents problems for a
> bioinformatics approach
>
> A seoond type of chemical receptor lies in the vomeronasal organ, also
> somewhere in the nose. Whereas the odor recptors bind to volatile small
> molecules, the VM receptors bind to proteins or peptides.  Tereligands
> are pheromones of various sorts that signal the presence of a sutiable
> mate or of a territory that may be defended, and perhaps many other
> things yet to be discovered.  These genes are represented by two
> families of about 150 to 200 genes each, according to Stuart's report of
> Xinmin's latest findings mining the Celera mouse genome database.
>
> The VM receptors have a structure similar to the odor receptors and
> other G-protein coupled receptors (GCPRs), except for an extra ~500
> amino acids at the N-terminal.  That part is extracellualr and binds the
> pheromone.  That part is also riddles with introns, and because they
> code for conserved protein sequences, they can be ferreted out.  Mike
> Pearce is working with Xinmin now to collect these genes, or at least
> pieces of these genes.  These genes are a good family for looking at
> intron conservation (our top goal) and may be better for looking at
> first exons and thus transcription signals. I don't know if the first
> exons here are also in the 5'UTR, it may not be known..  Since 40% of
> human genes do have introns in their 5' UTRs, that's the odds, not
> great.  So any attempt to to go after transcription signals will
> probably require some wet lab work to map a few transcription start
> sites.  That task could go perhaps attractively into a grant as work for
> Stuart and or me.
>
> Harman and Stuart are both (independently) excited about alternative
> splicing. harman is already starting to think or work on this in
> general.  Stuart thinks that alternative splcing may underlie a great
> diversification of the VM genes beyond the mere 350.  He thinks that
> this pheromone business could get very fancy, with mice recognizing
> individual other mice this way.  So this is pretty interesting biology,
> although as far as I know there is as yet no evidence that alternative
> splcing is involved here.
>
> If we are shooting for an October submission, then it would be nice to
> have some preliminary data.  Gathering the genes and then looking for
> homologies for splicing signals will be going along, but we could also
> consider trying to get evidence for alternative splicing and mapping
> transcriptional start sites, focusing on those genes already described.
>  Although I have experience in both of these areas, the task may be
> technically non-trivial becaus eof the heterogeneity of gene expresson
> in the organ as a whole.
>
>
> I am wondering if machine learnig has been applied to looking at gene
> families before, for instance to distinguish genes from pseudo genes
> (too easy), or to find transcrption signals by comparing genes (the
> positive set) with pseudo genes (the negative set), making the dangerous
> (but testable) assumption that the pseudo genes are no longer transcribed.
>
> Feel free to answer to the group if you have anything to say, as we
> await our first meeting.
>
> Larry