Preparing the data

The following objects should help you prepare the data to then get VEnCodes.

Note that this step is not essential, depending on your starting data. Nonetheless, the objects listed here provide easy methods to prep your data and work well with the objects in the next section.

Objects for general data

internals.py: Classes module for the VEnCode project

class VEnCode.internals.DataTpm(inputs, files_path=None, sep=';', nrows=None, **kwargs)

An Object representing a data set to retrieve VEnCodes from. Contains optional filtering methods and other tools. Create this object to help prepare the data for VEnCode generation. The essential and recommended methods to call before feeding this data set to a VEnCode object is shown in the Methods section. All the other methods are helper functions that facilitate the preparation of the data.

data

This object is a pandas DataFrame representation of the initial input data set.

Type

pd.DataFrame

target

The celltype or celltypes that are going to be the target of the VEnCode search algorithms. By calling the method make_data_celltype_specific(), the user can define this object and then apply activity, inactivity, sparseness filters, and other methods.

Type

str

target_replicates

The target celltype/s replicates in the data.

shape
Parameters
  • inputs (str, pd.DataFrame) – The input containing the data. It can be one of two types; 1- A file containing the data set to convert into DataTpm object. This can be a complete path to the file, or just the file name, provided the path is given in the argument files_path. Supported file formats are .csv, .txt, .tsv, or any format supported by the pandas read_csv function. 2- A pandas DataFrame object supplied in this parameter instead of any file.

  • sep (str) – The column separator used in the input file. Default is ‘,’.

  • nrows (int, None) – The number of rows to open in the file. Default is ‘None’, which will open the entire file.

  • files_path (str, None) – In case the argument file does not contain a complete path, input that path here. This argument is also useful to access the module’s test files by inputting ‘test’. Default is ‘None’.

  • kwargs – Optional keyword arguments available to use are any used by pandas DataFrame object. Please refer to the pandas DataFrame documentation for specific details.

load_data()

Essential method to call after DataTpm class object generation. Data is not automatically accessed at object generation to give this class more flexibility to subclassing.

make_data_celltype_specific(target_celltype, replicates=True)

Method recommended to provide the VEnCode object with the information on which celltype is the target.

add_celltype(data_from, celltypes=False, **kwargs)

Adds expression data for celltypes from other data sets (with similar regulatory element information). Examples include adding data from a cancer celltype to a primary celltype data set.

Parameters
  • data_from (str, DataTpm) – Data containing the celltypes to add. Can be either another DataTpm object or the path to a file eligible to be converted into a DataTpm object.

  • celltypes (str, list, dict) – Celltypes to merge with the DataTpm data. If false it will add all provided data.

  • kwargs – Are used to create a new DataTpm object from data_from if data_from is an incomplete file path. So, if that is the case, check DataTpm documentation.

binarize_data(threshold=0)

Converts all data to 0 and 1, where 1 is any value above threshold.

Parameters

threshold (int) – Maximum expression value for a RE to be considered inactive.

copy(deep=True)

Method to generate a shallow, or deep copy of DataTpm object.

Parameters

deep (bool) – True if deep copy.

define_non_target_celltypes_inactivity(threshold=0)

Converts the non-target celltypes’ data to binary (0 - inactive; 1- active) given a threshold.

Parameters

threshold (int, float) – Maximum TPM that non-target celltypes can have to be considered inactive.

drop_target_ctp(inplace=True)

Shortcut function to drop the target celltypes from the data set. It handles the fact that the data may have been merged or not.

Parameters

inplace (bool) – True modifies the class attribute data in place, and the function returns None. False tells the function to return the data after dropping the target celltypes.

Returns

The data without the target celltypes. But also modifies it in place

Return type

DataTpm

filter_by_reg_element_sparseness(threshold=90, min_re=50, exclude_target=True)

Applies a filter to the Data, retaining only the regulatory elements in which xth percentile (x being the threshold variable) value is 0 (that is: not expressed). It will exclude the target celltype from the calculations. This filter will, then, retain only the REs with most 0 TPM for all non-target celltypes. The data must be made celltype specific first.

Parameters
  • threshold (int) – Percentile value used to filter the data.

  • min_re (int) – Minimum number of regulatory elements (RE) to keep in the data.

  • exclude_target (bool) – Usually we want target expression to be the opposite of sparse, but in case the opposite is true, inputting False in this parameter will include the target in the sparseness filter.

filter_by_target_celltype_activity(threshold=1, replicates='all', binarize=True)

Applies a filter to the Data, retaining only the regulatory elements that are expressed in the celltype of interest at >= x TPM, x being the threshold variable.

Parameters
  • threshold (int, float) – TPM value used to filter the data.

  • replicates (list) – Used to select only a few replicates from all the target celltype replicates. Can be the full name of the replicates to use in the filter, or their column index numbers relative to all that celltype’s replicates.

  • binarize (bool) – Convert target cell type expression to 0 and 1, for values below or above the threshold, respectively.

load_data()

Opens the data file with the previously provided arguments, storing the data set into the class attribute data. This method is not called during initialization to allow the DataTpm object to be easily extended by users.

make_data_celltype_specific(target_celltype, replicates=True)

Determines celltype/replicate (columns) of interest to analyse later.

Parameters
  • target_celltype (str, dict) – The celltype to target for analysis, as a string. If the celltype has replicates in the data, either supply target_celltype with a dictionary in the shape dict[celltype] = [replicates], or let the function guess the replicates by supplying the argument replicates as True.

  • replicates (bool) – If the celltype to target have replicates in the data, use True. Else, use False. Default is True.

merge_reg_elements(validate_with, splits=(':', '-'))

Main method to filter the REs in the data, leaving in the data only those that match the external data set.

merge_replicates(replicate_suffix=None, celltype_list=None, replicate_dict=None, exclude_target=False, not_include=None)

Merges replicate samples into one celltype. A more conservative, but faster approach to data set mining. Cell type columns are created by merging all replicates for that cell type. The value for the merged column corresponds the average of all donors.

Parameters
  • replicate_suffix (str, None) – If the replicates have a defined suffix, this parameter helps the algorithm to find the correct replicates. e.g. if the samples are in the format - celltype_rep1, then use replicate_suffix=’_rep’. Note that after the suffix there must be the unique number for that replicate.

  • celltype_list (list) – Alternatively, provide the common characters in each group of replicates to merge and the function will try to merge by inference. Make sure to provide the list of characters as a complete list of columns to merge.

  • replicate_dict (dict) –

    As a last alternative, provide a dictionary with the names for the merged celltypes and their corresponding replicates. Use full names for the replicates here. e.g.:

    rep_dict = {celltype1: [rep1, rep2, rep3], celltype2: [rep1, rep2]}
    

  • exclude_target (bool) – True if the target celltype replicates are not to be merged. Otherwise, False.

  • not_include (dict) – Dictionary containing key:value pairs where key are the celltype names (as provided in the first arguments) and values are partial- or complete-matching strings to columns that are not to be merged with the others for that celltype, but could be getting caught up by the algorithm. e.g. celltype “adipocyte” could be merging all replicates for the pre-adipocytes. In this case supplying the dictionary {adipocyte: [“pre”] would exclude all pre-adipocyte replicates from merging with the adipocyte replicates}. Default is None.

remove_celltype(celltypes)

Removes a specific celltype (column) from data.

Parameters

celltypes (str, list) – celltype(s) to remove (columns).

remove_element(elements)

Removes a specific regulatory element (row) from data.

Parameters

elements (str, list) – Regulatory element/s to remove (rows).

property shape

Outputs the shape of the data’s data frame.

Returns

The shape of the data in (rows, cols).

Return type

list

sort_columns(col_to_shift=None, pos_to_move=None)

Sorts columns alphabetically.

sort_sparseness(exclude_target=True, descending=True)

Sorts the data by sparsest RE.

Parameters
  • exclude_target (bool) – Usually we want to sort the sparseness of just the non-target celltypes, but in case the opposite is true, inputting False in this parameter will include the target in the sorting method.

  • descending (bool) – True if the data is to be sorted in descending sparseness (most sparse appear on top). False otherwise.

to_csv(*args, **kwargs)

Generates a csv file. args and kwargs passed must be compatible to Pandas DataFrame.to_csv()

Parameters
  • args – Arguments to be passed on to pandas DataFrame.to_csv()

  • kwargs – Keyword arguments to be passed on to pandas DataFrame.to_csv()

Objects for FANTOM5 data

internals.py: Classes module for the VEnCode project

class VEnCode.internals.DataTpmFantom5(inputs, sample_types='primary cells', data_type='promoters', keep_raw=False, nrows=None, files_path='test', *args, **kwargs)

An Object specifically representing the initial FANTOM5 CAGE-seq data set with some universal data treatment and with optional filtering methods. Create this object to help prepare the FANTOM5 CAGE-seq data for VEnCode generation. The recommended method to call before feeding this data set to a VEnCode object is shown in the Methods section. All the other methods are helper functions that facilitate the preparation of the data.

data

This object is a pandas DataFrame representation of the initial input data set.

Type

pd.DataFrame

target

The celltype or celltypes that are going to be the target of the VEnCode search algorithms. By calling the method make_data_celltype_specific(), the user can define this object and then apply activity, inactivity, sparseness filters, and other methods.

Type

str

target_replicates

The target celltype/s replicates in the data.

sample_type

The origin/type of samples to be analysed from the CAGE-seq data.

Type

str

data_type

The type of RE that comprises the data.

Type

str

shape
Type

tuple

Parameters
  • inputs (str, pd.DataFrame) – The input containing the data. It can be one of two types; 1- The file containing the data set to convert into DataTpm object. This can be a complete path to the file, or just the file name, provided the path is given in the argument files_path. Supported file formats are .csv, .txt, .tsv, or any format supported by the pandas read_csv function. 2- A pandas DataFrame object supplied in this parameter instead of any file.

  • sep (str) – The column separator used in the input file. Default is ‘,’.

  • nrows (int, None) – The number of rows to open in the file. Default is ‘None’, which will open the entire file.

  • sample_types ({'primary cells', 'cell lines', 'time courses'}, optional) – The origin/type of samples to be analysed from the CAGE-seq data. Currently offering full support for primary cells and cell lines.

  • data_type ({'promoters', 'enhancers'}, optional) – The type of RE that comprises the data.

  • files_path (str, None) – In case the argument file does not contain a complete path, input that path here. This argument is also useful to access the module’s test files by inputting ‘test’. Default is ‘test’.

  • kwargs – Optional keyword arguments available to use are any used by pandas DataFrame object. Please refer to the pandas DataFrame documentation for specific details.

make_data_celltype_specific(target_celltype, replicates=True)

Method to provide the VEnCode object with the information on which celltype is the target.

add_celltype(data_from, celltypes=False, sample_types='cell lines', fantom=True, **kwargs)

Adds expression data for celltypes from other data sets (with similar regulatory element information). Examples include adding data from a cancer cell type to a primary cell type data set.

Parameters
  • data_from (str, DataTpm) – Data containing the celltypes to add. Can be either another DataTpm object or the path to a file eligible to be converted into a DataTpm object.

  • celltypes (str, list, dict) – Celltypes to merge with the DataTpmFantom5 data. If false it will add all provided data.

  • sample_types ({'primary cells', 'cell lines', 'time courses'}, optional) – Sample type of the data set to add.

  • fantom (bool) – Is your data to add from FANTOM5 CAGE-seq? if so put True. Else, False.

  • kwargs – Are used to create a new DataTpmFantom5 object from “data_from” to add to the data set. So, if that is the case, check DataTpmFantom5 documentation.

make_data_celltype_specific(target_celltype, supersets={'CD14+ CD16+ Monocytes': 'CD14+ Monocytes', 'CD14+ CD16- Monocytes': 'CD14+ Monocytes', 'CD4+CD25+CD45RA+ naive regulatory T cells': 'CD4+ T Cells', 'CD4+CD25+CD45RA- memory regulatory T cells': 'CD4+ T Cells', 'CD4+CD25-CD45RA+ naive conventional T cells': 'CD4+ T Cells', 'CD4+CD25-CD45RA- memory conventional T cells': 'CD4+ T Cells'})

Determines celltype/donors (columns) of interest to analyse later. For previously parsed files, opens the specific file for that celltype.

Parameters
  • target_celltype (str, dict) – The celltype to target for analysis.

  • supersets (dict) – When a celltype is a subset of other, we must remove that superset celltype to analyse the subset.

merge_donors_primary(exclude_target=True)

Merges replicate samples into one celltype. Specific method to use when dealing with FANTOM5 primary celltypes.

A more conservative, but faster approach to data set mining. Celltype columns are created by merging all replicates/donors for that celltype. The value for the merged column corresponds the average of all replicates/donors.

Parameters

exclude_target (bool) – True if the target celltype replicates are not to be merged. Otherwise, False.

remove_celltype(celltypes, merged=True)

Removes a specific celltype from the data.

Parameters
  • celltypes (in, list) – Celltype(s) to remove.

  • merged (bool) – If the data has been previously merged into celltypes, True. If columns represent replicates/donors, False.