You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: standard.md
+74-21Lines changed: 74 additions & 21 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -201,28 +201,40 @@ See the machine-readable table with precise data type definitions here: [ssf_col
201
201
202
202
<!-- tabs:start -->
203
203
204
-
#### **v2.7.1**
204
+
#### **v3.0.0**
205
205
206
-
# The Poseidon Standard
206
+
Poseidon is a solution for archaeogenetic genotype data organisation. It is geared towards human data, but is to a large extent species-agnostic and can be used to track archaeogenetic data also of non-human species.
207
207
208
-
Poseidon is a solution for archaeogenetic genotype data organisation. This standard defines the core components of the Poseidon package.
208
+
This standard defines a data structure: the **Poseidon package**. A Poseidon package stores genotype data with meta- and context information.
209
209
210
210
A .pdf version of the latest instance of this document can be downloaded [here](https://github.com/poseidon-framework/poseidon-schema/blob/master/poseidon_package_specification.pdf).
211
211
212
212
Further details on [genotype data](https://poseidon-framework.github.io/#/genotype_data), the [.janno file](https://poseidon-framework.github.io/#/janno_details) and the [.ssf file](https://poseidon-framework.github.io/#/ssf_details) are documented on the Poseidon website.
213
213
214
-
The website also features a changelog documenting the changes across different schema versions [here](https://poseidon-framework.github.io/#/changelog).
214
+
A changelog documents the changes across different schema versions [here](https://github.com/poseidon-framework/poseidon-schema/blob/master/changelog.md).
215
215
216
216
The key words *MUST*, *MUST NOT*, *REQUIRED*, *SHALL*, *SHALL NOT*, *SHOULD*, *SHOULD NOT*, *RECOMMENDED*, *MAY*, and *OPTIONAL* in this document are to be interpreted as described in [RFC 2119](https://datatracker.ietf.org/doc/html/rfc2119).
217
217
218
+
### Primary entities of a Poseidon package
219
+
220
+
The main operational entities in a Poseidon package are discrete sets of genotype data attributed to a single human or non-human individual, scientifically generated for archaeogenetic research questions. Within a Poseidon package each of these sets gets attributed a unique identifier: the `Poseidon_ID`.
221
+
222
+
Generally, archaeogenetics operates on depositional contexts, e.g. graves, with one or multiple (ancient) human or non-human individuals. Usually, it is possible to attribute the (skeletal) remains within these contexts to individuals based on archaeological evidence and physical-anthropological analysis. Each individual can get sampled one or multiple times, either by directly probing their preserved tissue, or by sampling any reagent that contains their DNA (through whatever pathway or taphonomic process). From one such sample one or multiple extracts can be derived, which can be transformed into one or multiple libraries, which may or may not be subjected to a DNA capture protocol and then sequenced one or multiple times. The raw sequencing data can undergo various different forms of computational processing and eventually genotyping to produce the data relevant for most derived analyses and thus stored in a Poseidon package.
223
+
224
+
While the wetlab-processes yield a relatively predictable tree of separate physical and digital products for any given sample, the computational data-processing breaks the conceptual tree-ness by allowing for arbitrary conflation of sequencing data obtained through potentially separate means: Data from different libraries, for example, may be merged if they are from the same individual, even if they are not from the same sample.
225
+
226
+
`Poseidon_ID`s therefore represent one consciously selected end-point in the complex data preparation graph laid out above. Typically this end-point corresponds to an optimal result for a given individual, research question and publication.
227
+
228
+
For the sake of convenience and despite the lack of conceptual clarity, below we sometimes use the term *sample* to denote `Poseidon_ID` entities. Data aggregation on the level of physical samples is often sensible, and the term is conventionally used for analysis endpoints in the community of practice.
229
+
218
230
### The Poseidon package structure
219
231
220
-
A Poseidon package stores genotype data with context information for DNA samples from (ancient) (human) individuals. Packages are defined by the POSEIDON.yml file, which holds relative paths to all other files in a package.
232
+
A Poseidon package is defined by the POSEIDON.yml file, which holds relative paths to all other files in the package.
221
233
222
234
A package therefore MUST contain:
223
235
224
236
- A `POSEIDON.yml` file to formally define the package
225
-
- Genotype data in PLINKor EIGENSTRAT format
237
+
- Genotype data in PLINK, EIGENSTRAT or VCF format
226
238
227
239
It SHOULD additionally contain:
228
240
@@ -237,7 +249,7 @@ It MAY also contain:
237
249
238
250
Here is an example of a package `Switzerland_LNBA_Roswita` in one directory:
All text files in the package MUST be UTF-8 encoded.
264
+
### Text encoding
265
+
266
+
All text files in the package MUST be UTF-8 encoded. They SHOULD use Unix-style line endings, so a single Line Feed (LF, `\n`) character, NOT a Carriage Return and Line Feed (CRLF) pair (`\r\n`) as in MS DOS and Windows.
267
+
268
+
`Poseidon_ID`s and `Group_Name`s, so the primary sample and group identifiers across `.janno`, `.ssf`, and genotype data files, SHOULD contain only characters of a subset of the 7-bit ASCII code set. Specifically the alphanumeric characters `A-Z`, `a-z`, `0-9`, and the symbols `_` (underscore), `-` (hyphen-minus), and `.` (period, dot or full stop).
253
269
254
270
### The `POSEIDON.yml` file
255
271
@@ -260,7 +276,7 @@ The `POSEIDON.yml` file defines Poseidon packages by listing metainformation and
260
276
261
277
Here is an example for a `POSEIDON.yml` file:
262
278
263
-
```
279
+
```yml
264
280
poseidonVersion: 2.7.1
265
281
title: Switzerland_LNBA_Roswita
266
282
description: LNBA Switzerland genetic data not yet published
@@ -272,6 +288,10 @@ contributor:
272
288
email: paul.panther@example.edu
273
289
packageVersion: 1.1.2
274
290
lastModified: 2021-01-28
291
+
license:
292
+
name: CC BY 4.0
293
+
url: https://creativecommons.org/licenses/by/4.0/
294
+
file: license.md
275
295
genotypeData:
276
296
format: PLINK
277
297
genoFile: Switzerland_LNBA_Roswita.bed
@@ -322,27 +342,44 @@ When the `packageVersion` is changed, then the `lastModified` date MUST be updat
322
342
323
343
Packages SHOULD start at `packageVersion` `0.1.0`.
324
344
345
+
### Data licensing and the license.md file
346
+
347
+
Data licences are a common way to grant the public permission to use a dataset under copyright law.
348
+
349
+
Poseidon packages MAY specify a license, and if so, SHOULD use [Creative Commons licences](https://creativecommons.org/share-your-work/cclicenses).
350
+
351
+
Licences are documented in the `POSEIDON.yml` file in the `license` section, either with just the `name`, or with a license `file`, or with both the `name` and a `file`. `name` SHOULD include a short string with name and version of the license, e.g. `CC BY 4.0`. The `file`, typically named `license.md`, MAY include the full text of a license, or a short notifier further contextualizing the entry in the `name` field. For example:
Both PLINK and EIGENSTRAT formats require three files to be specified. In PLINK, the genotype file is binary (with 2 bits per genotype), while in Eigenstrat, the genotype file is text-based (with 8 bits per genotype). The SNP and individual files are text-based for both formats (see links behind the file endings in the table above). The EIGENSTRAT format specifically is common within archaeogenetics, compatible with many important tools, e.g. [EIGENSOFT](https://github.com/DReichLab/EIG) and [ADMIXTOOLS](https://github.com/DReichLab/AdmixTools). Finally, the VCF format is the most formally specified format, with properly versioned specifications being released regularly. VCF is well established in the wider genetics community and the de-facto standard to store variants in the field of medical genetics.
334
368
335
-
In addition to these files (and optionally their checksums), the POSEIDON.yml file SHOULD also provide a `snpSet` entry which determines the shape of the genotype file.
369
+
VCF files, as well as genotype and SNP files in PLINK and EIGENSTRAT can be stored in gzipped form, signifified by an additional file ending (`*.gz`).
370
+
371
+
To make VCF files fully convertible to PLINK and EIGENSTRAT, they MUST be biallelic and contain only genotypes coded as `0/0`, `0/1`, `1/1`, `./.`. Furthermore, they CAN encode group names and genetic sex for all samples through special header fields `##group_names=name1,name2,...` and `##genetic_sex=F,U,M,...`, respectively. If these fields are not present, then group names are assumed to be "unknown" and genetic sex "U" (unknown) for all samples.
336
372
337
373
### The `.janno` file
338
374
339
375
The `.janno` file is a tab-separated text file with a header line. It holds context information (variables/columns) for each sample (objects/rows) in a package.
340
376
341
377
- A set of strictly defined core variables (defined by column name) and their possible content are documented here: [janno_columns.tsv](https://github.com/poseidon-framework/poseidon-schema/blob/master/janno_columns.tsv)
342
378
- A `.janno` file MAY have all of these core variables, or only a subset of them.
343
-
- Only three columns MUST be present to make the file valid: **Poseidon_ID**, **Group_Name** and **Genetic_Sex**
379
+
- Only three columns MUST be present to make the file valid: **Poseidon_ID**, **Group_Name** and **Genetic_Sex**.
344
380
- Arbitrary columns not defined here MAY be added as long as their column names do not clash with the defined ones.
345
-
- The column order is irrelevant.
381
+
- Arbitrary, additional free-text information directly related to a column **<Column_Name>** from the set of specified core variables in [janno_columns.tsv](https://github.com/poseidon-framework/poseidon-schema/blob/master/janno_columns.tsv) SHOULD be added in a column whose name has the form **<Column_Name>_Note**. Example: `Contamination_Note`.
382
+
- The column order is not fixed, but MAY follow the order in [janno_columns.tsv](https://github.com/poseidon-framework/poseidon-schema/blob/master/janno_columns.tsv). **<Column_Name>_Note** columns SHOULD be placed directly after the respective column they are refering to.
346
383
- If information is unknown or a variable does not apply for a certain sample, then the respective cell(s) MAY be filled with `n/a` or simply an empty string.
347
384
- The order of the samples (rows) in the `.janno` file MUST be equal to the order in the genetic data files (`.ind`, `.fam`) in the package.
348
385
- The values in the columns **Poseidon_ID**, **Group_Name** and **Genetic_Sex** MUST be equal to the terms used in the genetic data files (`.ind`, `.fam`).
@@ -355,7 +392,7 @@ A [BibTeX](http://www.bibtex.org/) file with all references listed in the `.jann
@@ -379,7 +416,7 @@ A simple [markdown](https://daringfireball.net/projects/markdown) file with info
379
416
380
417
Example:
381
418
382
-
```
419
+
```default
383
420
This package contains a rather interesting set of samples relevant for the peopling of the Territory of Christmas Island in the Indian Ocean. We consider this especially relevant, because ...
384
421
```
385
422
@@ -389,7 +426,7 @@ A markdown file to document changes in the history of a package.
389
426
390
427
Example:
391
428
392
-
```
429
+
```default
393
430
- V 1.1.1: Fixed a spelling mistake in one site name: "Hosenacker" -> "Rosenacker"
394
431
- V 1.1.0: Added mtDNA contamination estimation to the .janno file
395
432
- V 1.0.0: Added spatial coordinates and age information to the .janno file and finalized a first stable version of the package
@@ -412,6 +449,22 @@ The `.ssf` file is another tab-separated text file with a header line. It stores
412
449
- Multiple predefined columns of the `.ssf` file are list columns that can hold multiple values (either strings or numerics) separated by `;`.
413
450
- The decimal separator for all floating point numbers MUST be `.`.
414
451
452
+
### Details
453
+
454
+
#### The `Capture_Type` .janno column
455
+
456
+
The following protocols are specified:
457
+
458
+
- `Shotgun`: Sequencing without any enrichment (whole genome sequencing, screening etc.).
459
+
- `1240K`: Target enrichment with hybridization capture optimised for sequences covering the 1240k SNP array, see [@Fu2015](https://doi.org/10.1038/nature14558), [@Haak2015](https://doi.org/10.1038/nature14317), [@Mathieson2015](https://doi.org/10.1038/nature16152).
460
+
- `ArborComplete`, `ArborPrimePlus`, `ArborAncestralPlus`: Target enrichment with hybridization capture as provided by Arbor Biosciences in three different kits branded [myBaits Expert Human Affinities](https://arborbiosci.com/genomics/targeted-sequencing/mybaits/mybaits-expert/mybaits-expert-human-affinities).
461
+
- `TwistAncientDNA`: Target enrichment with hybridization capture as provided by Twist Bioscience [@Rohland2022](https://doi.org/10.1101/gr.276728.122).
462
+
- `WISC2013`: Whole genome capture as described by [@Carpenter2013](10.1016/j.ajhg.2013.10.002).
463
+
- `OtherCapture`: Target enrichment with hybridization capture for any other set of sequences.
0 commit comments