We need to decide how the information about what chemical matter the system to be simulated will be specified. Given this information, we can establish software tools, pipelines, and best practices that try to "do the right thing" by making use of available structural data to determine good initial positions, protonation states, etc. Deciding on how this information will be represented---either output from a helper utility or specified by a user---is the first step.
We also think that it will be useful to provide a means for expert users to provide "hints" or extra information that will help the pipeline in making choices (or constrain the choices it can make in order to achieve desired outcomes), but that is a separate question.
Essential information
The essential information we need to capture is:
- What biopolymers are present in the system? How many copies of each?
- What small molecules are present in the system, and how many (or at what concentration)?
- What additional salts/cofactors/buffers are present?
- What are the relevant thermodynamic parameters? (e.g. temperature, pressure, pH, redox potential)
Biopolymers
For biopolymers, there are a multitude of ways to specify the system, but fundamentally we need to capture the following information:
- It's critical we know exactly what construct (sequence of amino acids or nucleic acids) is used
- Any post-translational modifications must be known
- If there are non-natural or synthetic residues, we need some way of specifying these
- We need a way of specifying more than one biomolecule is present in the system
Some thoughts on specifying this information:
- One-letter codes are convenient but restricted to the 20 naturally occurring amino acids
- Three-letter codes in principle allow access to all of the residue components in the PDB, made available via the ligand expo, but may present challenges in describing branched topologies or chemically modified amino acids where the chain is represented by an amino acid that is connected to two natural amino acids and a HETATM modified residue via the sidechain. We also can't necessarily encode protonation states this way, and may need some other manner to describe protomer/tautomer variants.
- A
Topology like object that has atomic elements, connectivity, and bond orders or formal charges would be more flexible, but harder to produce.
Format
Something that could be converted to a Python dict may be useful.
We need to decide how the information about what chemical matter the system to be simulated will be specified. Given this information, we can establish software tools, pipelines, and best practices that try to "do the right thing" by making use of available structural data to determine good initial positions, protonation states, etc. Deciding on how this information will be represented---either output from a helper utility or specified by a user---is the first step.
We also think that it will be useful to provide a means for expert users to provide "hints" or extra information that will help the pipeline in making choices (or constrain the choices it can make in order to achieve desired outcomes), but that is a separate question.
Essential information
The essential information we need to capture is:
Biopolymers
For biopolymers, there are a multitude of ways to specify the system, but fundamentally we need to capture the following information:
Some thoughts on specifying this information:
Topologylike object that has atomic elements, connectivity, and bond orders or formal charges would be more flexible, but harder to produce.Format
Something that could be converted to a Python
dictmay be useful.