There are two main ways to use the Ensembler command-line interface. The
quickmodel function performs the entire modeling pipeline in one go, and is
designed to work with a single target and a small number of templates. For
generating larger numbers of models (such as entire protein families), the main
pipeline functions should be used. These perform each stage of the modeling
process individually, and the most computationally intensive stages can be run
in parallel to increase performance.
For further details on their usage, see the main command-line interface documentation.
Example using the quickmodel function¶
$ ensembler quickmodel --target_uniprot_entry_name EGFR_HUMAN --uniprot_domain_regex '^Protein kinase' --template_pdbids 1M14,4AF3 --no-loopmodel
Models human EGFR onto two templates selected via PDB IDs. The
quickmodel function executes the entire modeling pipeline in one go, and is designed to work with only a few targets and templates. For generating larger numbers of models (such as entire protein families), the main pipeline functions should be used.
Example using the main pipeline functions¶
$ ensembler init
This sets up an Ensembler project in the current working directory. It creates a number of directories and a metadata file (meta0.yaml).
$ ensembler gather_targets --gather_from uniprot --query 'domain:"Protein kinase" AND taxonomy:9606 AND reviewed:yes' --uniprot_domain_regex '^Protein kinase(?!; truncated)(?!; inactive)'
Queries UniProt for all human protein kinases, and selects the domains of interest, as specified by the regular expression (“regex”) passed to the final flag. At the time this documentation was written, five types of protein kinase domain were returned by the UniProt search, annotated as “Protein kinase”, “Protein kinase; 1”, “Protein kinase; 2”, “Protein kinase; truncated”, and “Protein kinase; inactive”. The above regex selects the first three types of domain, and excludes the latter two. Sequences are written to a fasta file:
Targets are given IDs of the form
[UniProt mnemonic]_D[domain id], which consists of the UniProt name for the target and an identifier for the domain (since a single target protein may contain multiple domains of interest). Example:
$ ensembler gather_templates --gather_from uniprot --query 'domain:"Protein kinase" AND reviewed:yes' --uniprot_domain_regex '^Protein kinase(?!; truncated)(?!; inactive)'
Queries UniProt for all protein kinases (of any species), selects the relevant domains, and retrieves sequence data and a list of associated PDB structures (X-ray and NMR only), which are then downloaded from the PDB. Template sequences are written in two forms - the first contains only residues resolved in the structure (
templates/templates-resolved-seq.fa); the second contains the complete UniProt sequence containined within the span of the structure, including unresolved residues (
templates/templates-full-seq.fa). Template structures (containing only resolved residues) are extracted and written to the directory
templates/structures-resolved. Templates containing the full sequences can optionally be generated with a subsequent step - the
Templates are given IDs of the form
[UniProt mnemonic]_D[domain id]_[PDB id]_[chain id], where the final two elements represent the PBD ID and chain identifier. Example:
$ ensembler loopmodel
Reconstruct template loops which were not resolved in the original PDB structure, using
Rosetta loopmodel. This tends to result in higher quality models. The reconstructed template structures are written to the directory
$ ensembler align
Conducts pairwise alignments of target sequences against template sequences. These alignments are used to guide the subsequent modeling step, and are stored in directories of the form
models/[target id]/[template id]/alignment.pir. The
.pir alignment format is an ascii-based format required by
loopmodel function was used previously, then templates which have been successfully remodeled will be selected for this alignment and the subsequent modeling steps. Otherwise, Ensembler defaults to using the template structures which contain only resolved residues.
$ ensembler build_models
Creates models by mapping each target sequence onto each template structure, using the
Modeller automodel function.
$ ensembler cluster
Filters out non-unique models by clustering on RMSD. A default cutoff of 0.06 nm is used. Unique models are given an empty file
unique_by_clustering in their model directory.
$ ensembler refine_implicit
Refines models by performing an energy minimization followed by a short molecular dynamics simulation (default: 100 ps) with implicit solvent (Generalized Born surface area), using
OpenMM. The final structure is written to the compressed PDB file
$ ensembler solvate
Determines the number of waters to add when solvating models with explicit water molecules. The models for each target are given the same number of waters. The function proceeds by first solvating each model individually, given a padding distance (default: 1 nm). A list of the number of waters added for each model is written to a file
nwaters.txt in the
models/[target_id] directory. A percentile value from the distribution of the number of waters is selected as the number to use for all models, and this number is written to the file
$ ensembler refine_explicit
Solvates models using the number of waters determined in the previous step, then performs a short molecular dynamics simulation (default: 100 ps), using
OpenMM. The final structure is written to the compressed PDB file:
explicit-refined.pdb.gz, as well as serialized versions of the OpenMM System, State and Integrator objects.
$ ensembler validate
(Optional; requires MolProbity command-line tools) Validates model quality using MolProbity, which uses criteria such as Ramachandran angles, backbone distortions, and atom clashes. The
package_models command can filter models based on validation score, using the
$ ensembler package_models --package_for FAH --nfahclones 3
Packages models in the necessary directory and file structure to be run as Folding@Home projects. Files are written in the directory tree