IDSM/MS User Manual

Introduction

Used ontologies

To fully represent the data, we opted for a general upper-level ontology, namely the Semanticscience Integrated Ontology (SIO). This ontology, which specializes in biomedical research and knowledge discovery, provides users with general descriptions of objects, processes, and their attributes. Using SIO, each entry of a mass spectrum database is represented as an experiment, which generates a mass spectrum from an input compound.

Attributes in the SIO ontology represent independent entities, enabling types from other ontologies to be assigned. Specifically, we assigned types from the PSI–MS controlled vocabulary for attributes related to mass spectrometry, and from the Chemical Information Ontology (CHEMINF) (Hastings et al. 2011) for attributes related to compound properties.

Another advantage of the SIO ontology is that it is used by the PubChemRDF and ChEMBL datasets, so representing the selected mass spectrum databases in this way seamlessly integrates with the overarching data model used in IDSM.

In addition to the SIO ontology, the following ontologies are employed to represent selected datasets: the Units of Measurement Ontology (UO) (Rijgersberg et al. 2011) for units of measured values; Dublin Core Metadata Initiative Metadata Terms (DCMI Usage Board 2020) to express basic information about mass spectrum libraries and experiments; the vCard ontology (W3C 2014b) to express information about submitters; and the Simple Knowledge Organization System (SKOS) ontology (W3C 2009) to cross-link entities from different datasets.

The prefixes related to the ontologies used are as follows:

prefixnamespace
rdfhttp://www.w3.org/1999/02/22-rdf-syntax-ns#
rdfshttp://www.w3.org/2000/01/rdf-schema#
xsdhttp://www.w3.org/2001/XMLSchema#
siohttp://semanticscience.org/resource/
obohttp://purl.obolibrary.org/obo/
dctermshttp://purl.org/dc/terms/
vcardhttp://www.w3.org/2006/vcard/ns#
skoshttp://www.w3.org/2004/02/skos/core#

Data model

A record from a source mass spectroscopy dataset is represented as the mass spectrometry experiment (class sio:SIO_001180). The measured compound (class sio:SIO_011125) is related to the experiment as its input (property sio:SIO_000230). Similarly, the mass spectrum (class obo:MS_1000294) is related to the experiment as its output (property sio:SIO_000229). The mass spectrum entity refers (via property sio:SIO_000300) to the mass spectrum literal containing its own measured data, that is, the intensities of the mass-to-charge ratios.

The experiment, input compound, and output mass spectrum contain attributes encoded as separate entities. Each attribute has a type (property rdf:type) and a value (property sio:SIO_000300). Where necessary, the attribute also has a unit (property sio:SIO_000221) in addition to the value.

The experiment parameters, such as ion mode or collision energy, are encoded as attributes of appropriate types derived from the PSI–MS vocabulary and linked to the experiment (property sio:SIO_000553). Attributes representing chemical qualities and compound identifiers are categorized based on appropriate types from the CHEMINF ontology and linked to a compound (properties sio:SIO_000011 and sio:SIO_000672 respectively). In the case of the mass spectrum, its SPLASH identifier is represented in a similar way.

If the original record contains annotations of peaks, each annotated peak is represented as a separate entity (class obo:MS_1000231) and connected to the spectrum as its component part (property sio:SIO_000313) on a given mass-to-charge ratio position (property sio:SIO_000056).

Annotations of experiments (called tags in the MoNA database) and peaks are encoded as attributes of the type annotation (class sio:SIO_001166) and related to the corresponding entity (property sio:SIO_000254).

To preserve information about the origins of the records, the experiments are organized (property sio:SIO_001278) into datasets (class sio:SIO_000089) based on their original sources. Each experiment is also connected (property sio:SIO_000066) to a person (class vcard:Individual) who submits the corresponding original record into the original database.

The basic relationships between main entities are expressed in the following graph:

Entities

Experiment

The experiment entity describes a mass spectrometry experiment (process) that generates a mass spectrum (process output) for a given compound (process input). Each experiment entity can have the following (outgoing) properties:
propertynamevalue
rdf:type type sio:SIO_001180 (mass spectrometry experiment)
sio:SIO_000230 has input instance of compound
sio:SIO_000229 has output instance of spectrum
sio:SIO_000066 has provider instance of submitter
sio:SIO_001278 is data item in instance of library
sio:SIO_000255 has annotation instance of annotation
sio:SIO_000552 has parameter instance of parameter
dcterms:created created xsd:date literal
dcterms:dateAccepted curated xsd:date literal
dcterms:modified updated xsd:date literal

Experiment parameters

Process parameters are coded as separate entities. Each entity has its own type and, depending on this type, possibly a value and, optionally, a unit of that value. So the following are the possible properties:
propertyname
rdf:type type
sio:SIO_000300 has value
sio:SIO_000221 has unit
Below are the allowable parameter types, the corresponding value types, and possible units, if applicable:
parametertypevalue classunit class
ms levelobo:MS_1000511xsd:int no unit
ionization modeobo:MS_1000129 (negative)
obo:MS_1000130 (positive)
no value no unit
ionization type obo:MS_1000070 (APCI)
obo:MS_1000071 (CI)
obo:MS_1000073 (ESI)
obo:MS_1000074 (FAB)
obo:MS_1000075 (MALDI)
obo:MS_1000258 (FI)
obo:MS_1000389 (EI)
obo:MS_1000398 (nanoES)
no value no unit
precursor typeobo:MS_1002813xsd:string no unit
precursor mzobo:MS_1000744xsd:float obo:MS_1000040 (mz)
instrument typeobo:MS_1000463xsd:string no unit
instrument modelobo:MS_1000031xsd:string no unit
retention timeobo:MS_1000894xsd:float obo:UO_0000010 (second)
obo:UO_0000031 (minute)
or none
collision energyobo:MS_1000045xsd:float obo:UO_0000218 (volt)
obo:UO_0000248 (kilovolt)
obo:UO_0000266 (electronvolt)
or none
collision energy rampobo:MS_1002013 (start)
obo:MS_1002014 (end)
xsd:float obo:UO_0000218 (volt)
obo:UO_0000266 (electronvolt)
or none
normalized collision energyobo:MS_1000138xsd:float obo:UO_0000190 (ratio unit)
normalized collision energy rampobo:MS_1002218 (start)
obo:MS_1002219 (end)
xsd:float obo:UO_0000190 (ratio unit)

Compound

Each compound is represented by a standalone entity. The following (outgoing) properties are possible:
propertynamevalue
rdf:type type sio:SIO_011125 (molecule)
ClassyFire class
ChEBI class
skos:closeMatch related to MeSH class
sio:SIO_000231 is input in instance of experiment
sio:SIO_000008 has attribute instance of desctiptor
sio:SIO_000671 has identifier instance of identifier

Compound desciptors

Various properties of a compound (arising from its structure) are coded as its descriptors. Each descriptor has its own type and value, and optionally a unit of that value. So the following are the possible properties:
propertyname
rdf:type type
sio:SIO_000300 has value
sio:SIO_000221 has unit
Below are the allowable descriptor types, the corresponding value types, and possible units, if applicable:
descriptor namedescriptor classdescriptor value classdescriptor unit class
molfilesio:SIO_011120xsd:string no unit
namesio:CHEMINF_000043xsd:string no unit
InChIsio:CHEMINF_000113xsd:string no unit
InChIKeysio:CHEMINF_000059xsd:string no unit
molecular formulasio:CHEMINF_000042xsd:string no unit
SMILESsio:CHEMINF_000018xsd:string no unit
exact masssio:CHEMINF_000217xsd:float obo:UO_0000055 (molar mass)
monoisotopic masssio:CHEMINF_000218xsd:float obo:UO_0000055 (molar mass)

Compound identifiers

A compound identifier is represented as a separate entity having its own type and string value:
propertyname
rdf:type type
sio:SIO_000300 has value
The following types of compound identifiers are possible:
identifier nameidentifier class
CAS registry numbersio:CHEMINF_000446
HMDB identifiersio:CHEMINF_000408
ChEBI identifiersio:CHEMINF_000407
ChemSpider identifiersio:CHEMINF_000405
KEGG identifiersio:CHEMINF_000409
LipidMaps identifiersio:CHEMINF_000564
PubChem compound identifier (CID)sio:CHEMINF_000140
PubChem substance identifier (SID)sio:CHEMINF_000141

Spectrum

Each mass spectrum is represented as a separate entity, which has the measured spectrum as its literal value. If the spectrum contains annotated peaks, these are represented as component parts of the spectrum.
propertynamevalue
rdf:type type obo:MS_1000294 (mass spectrum)
obo:MS_1000579 (MS1 spectrum)
obo:MS_1000580 (MSn spectrum)
sio:SIO_000232 is output of instance of experiment
sio:SIO_000671 has identifier instance of identifier
sio:SIO_000369 has component part instance of peak
sio:SIO_000300 has value ms:spectrum literal

Spectrum identifiers

A spectrum identifier is represented as a separate entity having its own type and string value:
propertyname
rdf:type type
sio:SIO_000300 has value
The following types of spectrum identifiers are possible:
identifier nameidentifier class
SPLASH keyobo:MS_1002599

Annotated peak

If a spectrum contains an annotation of a peak, this peak is represented as a standalone entity, which allows the annotation to be attached to the peak. The mz value of the peak represents its position in the spectrum.
propertynamevalue
rdf:type type obo:MS_1000231 (peak)
sio:SIO_000313 is component part of instance of spectrum
sio:SIO_000255 has annotation instance of annotation
sio:SIO_000056 position xsd:float literal

Annotation

Annotations are represented as separate entities that can be attached either to a peak or to an experiment (in this case, they are equivalent to MoNA tags).
propertynamevalue
rdf:type type sio:SIO_001166 (annotation)
sio:SIO_000254 is annotation of instance of experiment or peak
sio:SIO_000300 has value xsd:string literal

Library

Individual experiments (spectra) are grouped into datasets based on their origin. These datasets are represented as (data) libraries.
propertynamevalue
rdf:type type sio:SIO_000089 (dataset)
sio:SIO_001277 has data item instance of experiment
dcterms:title title xsd:string literal
dcterms:description description xsd:string literal

Submitter

This entity describes information about the person who submitted the record to the original dataset.
propertynamevalue
rdf:type type vcard:Individual
sio:SIO_000064 is provider of instance of experiment
vcard:given-name first name xsd:string literal
vcard:family-name last name xsd:string literal
vcard:hasEmail email address email IRI
vcard:organization-name insttitution xsd:string literal

Mass spectrometry functions

Greedy Cosine

Calculate ‘cosine similarity score’ between two spectra.

ms:cosineGreedy(spectrum 1, spectrum 2, tolerance, mz power, intensity power)

The cosine score aims at quantifying the similarity between two mass spectra. The score is calculated by finding best possible matches between peaks of two spectra. Two peaks are considered a potential match if their m/z ratios lie within the given ‘tolerance’. The underlying peak assignment problem is here solved in a ‘greedy’ way. This can perform notably faster, but does occasionally deviate slightly from a fully correct solution (as with the Hungarian algorithm, see Hungarian Cosine). In practice this will rarely affect similarity scores notably, in particular for smaller tolerances. The implementation of this function (as well as this description) was taken from the matchms project.

The function has these optional parameters:

tolerance
Peaks will be considered a match when <= tolerance apart. Default is 0.1.
mz power
The power to raise m/z to in the cosine function. The default is 0, in which case the peak intensity products will not depend on the m/z ratios.
intensity power
The power to raise intensity to in the cosine function. The default is 1.

Hungarian Cosine

Calculate ‘cosine similarity score’ between two spectra (using Hungarian algorithm).

ms:cosineHungarian(spectrum 1, spectrum 2, tolerance, mz power, intensity power)

The cosine score aims at quantifying the similarity between two mass spectra. The score is calculated by finding best possible matches between peaks of two spectra. Two peaks are considered a potential match if their m/z ratios lie within the given ‘tolerance’. The underlying peak assignment problem is here solved using the Hungarian algorithm. This can perform notably slower than the ‘greedy’ implementation in Greedy Cosine, but does represent a mathematically proper solution to the problem. The implementation of this function (as well as this description) was taken from the matchms project.

The function has these optional parameters:

tolerance
Peaks will be considered a match when <= tolerance apart. Default is 0.1.
mz power
The power to raise m/z to in the cosine function. The default is 0, in which case the peak intensity products will not depend on the m/z ratios.
intensity power
The power to raise intensity to in the cosine function. The default is 1.

Modified Cosine

Calculate ‘modified cosine score’ between mass spectra.

ms:modifiedCosine(spectrum 1, spectrum 2, shift, tolerance, mz power, intensity power)

The modified cosine score aims at quantifying the similarity between two mass spectra. The score is calculated by finding best possible matches between peaks of two spectra. Two peaks are considered a potential match if their m/z ratios lie within the given ‘tolerance’, or if their m/z ratios lie within the tolerance once a mass-shift is applied. The mass shift is simply the difference in precursor-m/z between the two spectra. See Watrous et al. (PNAS, 2012) for further details. The implementation of this function (as well as this description) was taken from the matchms project.

The function has these optional parameters:

tolerance
Peaks will be considered a match when <= tolerance apart. Default is 0.1.
mz power
The power to raise m/z to in the cosine function. The default is 0, in which case the peak intensity products will not depend on the m/z ratios.
intensity power
The power to raise intensity to in the cosine function. The default is 1.

Examples

The following examples of SPARQL queries illustrate how mass spectra stored in IDSM can be searched and combined with other data stored in IDSM or in other databases (available via a SPARQL endpoint).

Query 1

The example query selects all compounds whose spectra are similar to a given mass spectrum query. To calculate similarities between mass spectra, the cosine similarity algorithm is used.
PREFIX sio: <http://semanticscience.org/resource/>
PREFIX ms: <http://bioinfo.uochb.cas.cz/rdf/v1.0/ms#>

# select compounds whose spectra are similar to a given spectrum
SELECT ?COMPOUND ?SCORE WHERE
{
  # bind the spectrum query into variable ?MSQUERY
  BIND("""39:5 40:6 44:1 63:6 76:6 91:7 113:6
          114:2 115:5 142:3 170:3 171:7 182:6
          210:100 212:2"""^^ms:spectrum AS ?MSQUERY)

  # select all spectra and corresponding input compounds
  ?EXPERIMENT sio:SIO_000230 ?COMPOUND. # has input 
  ?EXPERIMENT sio:SIO_000229 ?SPECTRUM. # has output
  ?SPECTRUM sio:SIO_000300 ?MSVALUE. # has value

  # compute similarities between the spectrum query and the selected spectra
  BIND(ms:cosineHungarian(?MSQUERY, ?MSVALUE) as ?SCORE)

  # filter out results with similarity scores less than 0.85
  FILTER(?SCORE >= 0.85)
}
ORDER BY DESC(?SCORE)
Run in ChemWeb Run in SPARQL GUI

Query 2

The example query selects mass spectra of compounds that contain the structure of aspirin as their substructure. The query structure is encoded by SMILES.
PREFIX sio: <http://semanticscience.org/resource/>
PREFIX sachem: <http://bioinfo.uochb.cas.cz/rdf/v1.0/sachem#>

# select spectra of compounds that contain a given substructure
SELECT ?COMPOUND ?SPECTRUM ?MSVALUE WHERE 
{
  # search for compounds containing the substructure
  ?COMPOUND sachem:substructureSearch [
      sachem:query "CC(=O)OC1=CC=CC=C1C(=O)O";
      sachem:tautomerMode sachem:inchiTautomers ].

  # select mass spectra of the compounds found
  ?EXPERIMENT sio:SIO_000230 ?COMPOUND. # has input 
  ?EXPERIMENT sio:SIO_000229 ?SPECTRUM. # has output
  ?SPECTRUM sio:SIO_000300 ?MSVALUE. # has value
}
Run in ChemWeb Run in SPARQL GUI

Query 3

The example query selects mass spectra of compounds that were tested positive against the target protein:ACCQ9UPY5 (cystine/glutamate transporter).
PREFIX sio: <http://semanticscience.org/resource/>
PREFIX bao: <http://www.bioassayontology.org/bao#>
PREFIX obo: <http://purl.obolibrary.org/obo/>
PREFIX protein: <http://rdf.ncbi.nlm.nih.gov/pubchem/protein/>
PREFIX vocab: <http://rdf.ncbi.nlm.nih.gov/pubchem/vocabulary#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

# select spectra of compounds that were tested positive against a given target
SELECT DISTINCT ?COMPOUND ?SPECTRUM ?MSVALUE WHERE 
{
  # select mass spectra of compounds
  ?EXPERIMENT sio:SIO_000230 ?COMPOUND. # has input 
  ?EXPERIMENT sio:SIO_000229 ?SPECTRUM. # has output
  ?SPECTRUM sio:SIO_000300 ?MSVALUE. # has value  

  # select pubchem equivalents of the compounds
  ?COMPOUND skos:closeMatch ?PUBCHEM_COMPOUND.

  # limit them to those that were tested positive against the target
  ?BIOASSAY bao:BAO_0000209 ?MEASUREGROUP. # has measure group
  ?MEASUREGROUP obo:RO_0000057 protein:ACCQ9UPY5. # has participant
  ?MEASUREGROUP obo:OBI_0000299 ?ENDPOINT. # has specified output
  ?ENDPOINT vocab:PubChemAssayOutcome vocab:active.
  ?ENDPOINT obo:IAO_0000136 ?SUBSTANCE. # is about
  ?SUBSTANCE sio:CHEMINF_000477 ?PUBCHEM_COMPOUND. # has normalized counterpart
}
Run in ChemWeb Run in SPARQL GUI

Query 4

The example query selects mass spectra that are similar to the mass spectra of compounds that were tested positive against the target protein:ACCQ9UPY5 (cystine/glutamate transporter).
PREFIX sio: <http://semanticscience.org/resource/>
PREFIX bao: <http://www.bioassayontology.org/bao#>
PREFIX obo: <http://purl.obolibrary.org/obo/>
PREFIX protein: <http://rdf.ncbi.nlm.nih.gov/pubchem/protein/>
PREFIX vocab: <http://rdf.ncbi.nlm.nih.gov/pubchem/vocabulary#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX ms: <http://bioinfo.uochb.cas.cz/rdf/v1.0/ms#>

# select spectra similar to spectra of compounds tested positive against a given target
SELECT DISTINCT ?SPECTRUM ?MSVALUE WHERE
{
  # select spectra of compounds that were tested positive against the target
  {
    SELECT DISTINCT ?MSQUERY WHERE
    {
      # select mass spectra of compounds
      ?EXPERIMENT sio:SIO_000229 ?SPECTRUM. # has input
      ?EXPERIMENT sio:SIO_000230 ?COMPOUND. # has output
      ?SPECTRUM sio:SIO_000300 ?MSQUERY. # has value

      # select pubchem equivalents of the compounds
      ?COMPOUND skos:closeMatch ?PUBCHEM_COMPOUND. # has value

      # limit them to those that were tested positive against the target
      ?BIOASSAY bao:BAO_0000209 ?MEASUREGROUP. # has measure group
      ?MEASUREGROUP obo:RO_0000057 protein:ACCQ9UPY5. # has participant
      ?MEASUREGROUP obo:OBI_0000299 ?ENDPOINT. # has specified output
      ?ENDPOINT vocab:PubChemAssayOutcome vocab:active.
      ?ENDPOINT obo:IAO_0000136 ?SUBSTANCE. # is about
      ?SUBSTANCE sio:CHEMINF_000477 ?PUBCHEM_COMPOUND. # has normalized counterpart
    }
  }

  # select all mass spectra
  ?SPECTRUM sio:SIO_000300 ?MSVALUE.

  # compute similarities between the spectrum queries and the selected spectra
  BIND(ms:cosineGreedy(?MSQUERY, ?MSVALUE) as ?SCORE)

  # filter out results with similarity scores less than 0.85
  FILTER(?SCORE >= 0.85)
}
Run in ChemWeb Run in SPARQL GUI

Query 5

The example query selects mass spectra of compounds that occur in reactions that involve L-glutamate (CHEBI:29985). To select compounds based on their occurrence in reactions, the Rhea service is used.
PREFIX sio: <http://semanticscience.org/resource/>
PREFIX chebi: <http://purl.obolibrary.org/obo/chebi/>
PREFIX sachem: <http://bioinfo.uochb.cas.cz/rdf/v1.0/sachem#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rh: <http://rdf.rhea-db.org/>
PREFIX CHEBI: <http://purl.obolibrary.org/obo/CHEBI_>

# select mass spectra of compounds that occurre in reactions which involves CHEBI:29985
SELECT ?COMPOUND ?SPECTRUM WHERE
{
  # use Rhea to select compounds occurring in reactions in which CHEBI:29985 occurs
  SERVICE <https://sparql.rhea-db.org/sparql>
  {
    SELECT DISTINCT ?CHEBI WHERE
    {
      ?rhea rdfs:subClassOf rh:Reaction .
      ?rhea rh:side/rh:contains/rh:compound ?compound .
      ?rhea rh:side/rh:contains/rh:compound ?compound1 .
      ?compound (rh:chebi|(rh:reactivePart/rh:chebi)|(rh:underlyingChebi/rh:chebi)) CHEBI:29985.
      ?compound1 rh:chebi ?CHEBI
    }
  }

  # select InChI descriptors of the ChEBI compounds found
  ?CHEBI chebi:inchi ?INCHI.

  # select all spectra and corresponding input compounds
  ?EXPERIMENT sio:SIO_000230 ?COMPOUND. # has input 
  ?EXPERIMENT sio:SIO_000229 ?SPECTRUM. # has output
  ?SPECTRUM sio:SIO_000300 ?MSVALUE. # has value

  # filter out compounds having the propper InChI descriptors
  ?COMPOUND sio:SIO_000008 ?DESCRIPTOR. # has attribute
  ?DESCRIPTOR rdf:type sio:CHEMINF_000113. # InChI descriptor
  ?DESCRIPTOR sio:SIO_000300 ?INCHI. # has value
}
Run in ChemWeb Run in SPARQL GUI