Data harmonization

In this section instructions are provided to harmonize your individual HBM and accompanying data, and as such making them more FAIR. Data harmonization will improve the comparability of your data with data from other HBM studies, and will make your data interoperable for use with tools such as MCRA (Monte Carlo Risk Assessment software). or the tool for the calculation of summary statistics of the HBM data, which can be made available via IPCHEM and/or integrated into the European HBM dashboard.

User manual

Different files are provided to guide the data harmonization process, the names of the files referred to in the steps below are the following:

BasicCodebook_v2.3.xlsx
BiomarkerList_v2.26.xlsx
PersonalizedEmptyDataTemplate_BasicCodebook_vx.xlsx
ExampleData_BasicCodebook_v2.3.xlsx

Following steps are required to put your data in the right format.

Step 1: Fill the biomarker list
Step 2: Fill the empty data template using instructions of the codebook
Step 3: Data validation

Step 1. Fill the biomarker list:

In the Excel file BiomarkerList, indicate in the sheet BiomarkerList which biomarkers have been measured in which matrix for your data collection (put a “1” in the cross-cell of the row with the biomarker measured in your data collection and the column of the matrix in which the biomarker was measured). Send the filled biomarker list to the VITO HBM Data management team: PARC.DATAMANAGEMENT@vito.be. Upon receipt of this information, the VITO Data management team will provide you with a personalized data template adapted for your data collection (Excel format) with a column for the actual measurement value of each indicated biomarker, and two additional columns (for LOD and LOQ respectively). If preferred, you can also request the R version of the data template.

Additionally, extra information must be supplied by the data owner in the Information sheet. Please fill in Institute and data collection ID, name, description and acronyms in the blue boxes, and fill in with an x, the preferred format in which you would like to receive your data template (both formats can be indicated).

Note: In case you have exposure data available for biomarkers not included in the biomarker list please fill in the PARC input form for additional biomarkers and send it to PARC.DATAMANAGEMENT@vito.be. The additional biomarkers will be evaluated in included in a new biomarker list.

Step 2. Fill the empty data template using instructions of the basic codebook:

As indicated in step 1, VITO will provide the data owner/provider a personalized empty data template. The personalized data template should be filled with your data following the coding instructions of the codebook. In the codebook, all information regarding how each variable should be encoded is provided.

The data must be entered in different sheets of the excel template, this structure was chosen to avoid the need for replication of data. The structure is based on considering the sample as the main unique identifier in the dataset.

Information of the study: In the sheet “STUDYINFO”, the fields on study ID, study name and study description will automatically be filled based on the information you provided in the Biomarker list.

Information on the sample is provided in the sheet “SAMPLE”. Each sample must be mapped to a time point and to a subject. TIMEPOINT has been included to indicate if a collection of samples must be considered together, e.g. relevant in repeated sampling design. In the sheet “TIMEPOINT”, a short description of the different times in which samples were collected (in case of repeated sampling) has to be provided. It is important to also include a timepoint description when only one time is considered (sheet TIMEPOINT).

More samples may be mapped to the same subject, e.g. when blood and urine samples for a given subject are available. In the codebook, you will find under the corresponding sheet, a description of the variables, and the format in which they need to be provided.

Information on the subject is provided in two sheets “SUBJECTUNIQUE” and “SUBJECTTIMEPOINT”. This has been split to enable the capturing of values that may change through time, e.g. BMI of the individual at first versus second sampling if there is some time in between the measurements.

All information that is considered unique to the person from the first sampling onwards, has been included in the table “SUBJECTUNIQUE”, e.g. sex of the subject.

All information that may change, has been included in table “SUBJECTTIMEPOINT”. The link with timepoints shall be indicated in id_timepoint referring to the TIMEPOINT sheet.

Please note that in the sheet “SUBJECTUNIQUE”, linkage between the subjects is also created. This enables to identify e.g. mother-child pairs.

The sample data obtained from chemical measurements of the samples is to be provided in the sheets with prefix SAMPLETIMEPOINT_. For each matrix, a separate sheet is provided. Please note that the id_sample shall be extracted for the specified matrix from the sheet “SAMPLE”.

We did not include all biomarkers and matrices in the example data template since the list of possible biomarkers is quite long. Instead we ask the data provider to fill out the biomarker list (step 1).

Filled out example template with simulated data: To guide the data providers, we compiled an example data template filled out with simulated data of teenagers over two sampling points and in two matrices (urine spot (US) and blood serum (BS)).

Step 3. Data validation

See the section Tool for data validation.

The webpage and its content have been developed within the context of H2020 project HBM4EU (2017-2022; Grant Agreement No 733032) and Horizon Europe partnership PARC (2022-2029; Grant Agreement No 101057014).