Abstract:
Transcription is a complex biological phenomenon, whereby RNA is transcribed from single
stranded template DNA by assembling targeted regulatory inputs at the promoter region.
Transcription is regulated through many hierarchically organised mechanisms, including
chromosome positioning and organisation, the binding of transcription factors, and DNA’s
secondary and tertiary structures at the region of transcription initiation. The core promoter is
the distinct functional unit of DNA overlapping the transcription start site, which possesses
linear regulator capacity and renders DNA permissive to transcription. In plants, core
promoter and enhancer studies are of particularly high impact for those traits which under
strong transcriptional control. Cellulose biosynthesis in immature xylem, the tissue which
forms wood, is one such trait, and is studied extensively in the herbaceous model plant
organism, Arabidopsis thaliana, and the economically important woody perennial,
Eucalyptus grandis. The release of the E. grandis genome sequence has provided a muchneeded
reference to study transcriptional control, not only for those traits that make it a
dominant fibre crop, but genome-wide. We aimed to use empirical transcript evidence to
perform a high-throughput genome-wide curation of the 5’ UTR annotations and empirically
infer transcription start sites (TSSs) of the nascent E. grandis genome annotation. We then
aimed to use the curated TSSs to define core promoter classes based on their sequence
Magister composition and to determine the putative expression profiles and functional associations of each.
We used deep E. grandis mRNA sequencing data across seven diverse tissues and PASA
assembled E. grandis ESTs to empirically curate 5’ UTR annotations. We improved 17,085
annotations, added 7,596 for which there was no previous annotation and retained 3,675 that
possessed only a predicted TSS without empirical evidence. These complementary data were
used to define distal transcription start sites (dTSS) by a novel, prioritising, computational
rule-based method. From these dTSS annotations, we extracted the core promoters (from
-100 to +50) and described the core promoter landscape by hexamer positional overrepresentation
analysis. We found three types of hexamer over-representation in the core
promoter, that being broad, spiked and low. Broad hexamers were classified into 5 distinct
core promoter classes, including TA, CT, GA, W and S. These were further assessed for
putative expression profiles (specificity and level) and functional associations. TA resembles
the conserved TATA-box core promoter, although displays a bimodal distribution, low
expression levels and the greatest tissue specificity. CT and GA are over-represented both up
and downstream of the dTSS and show narrow windows of greater enrichment with phasic
constraint. W and S occur in close proximity to the dTSS, with S displaying the most
constitutive and highest expression profile. Spiked hexamers occur in close proximity to the
dTSS and low hexamers are enriched for those pyrimidine-rich hexamers found in
Arabidopsis thaliana and Oryza sativa core promoters as the Y Patch. We found that E.
grandis core promoters include those such as the TATA-box class which is conserved across
kingdoms, the CT and GA classes, which are conserved in Arabidopsis, and a number of
classes which, thus far, appear unique to Eucalyptus. We postulate possible underlying
mechanisms of each core promoter class based on their sequence composition and suggest
regulation by TBP binding (TA), nucleosome positioning (W), DNA stability (S), and non-BDNA
conformation (CT and GA). This research provides a basal understanding of cistranscriptional
regulation at the core promoter in this economically important woody plant
species and provides insight into the mechanisms of permissive transcription across plant
species.