CLIC Help Topics
What types of gene name/identifiers does CLIC recognize?
The online version of CLIC takes the user input genes and compares them to a lookup table of gene symbols and aliases, downloaded from NCBI Gene (ftp://ftp.ncbi.nih.gov/gene/). EnsemblIDs and UniProtIDs can also be used. The alias lookup table is static (initially downloaded July 2015) but will be updated at regular intervals. The input gene symbols and aliases are case-insensitive, and we discard a gene alias if it conflicts with another gene official symbol. Genes that are not mapped (or are not on the microarray platform) will be removed from the input list and you will have the option to run with the smaller list excluding all unmapped genes. If your gene ID or symbol is not recognized, then please try to use the Entrez Gene ID or current official gene symbol according to NCBI Gene (http://www.ncbi.nlm.nih.gov/gene). Orthologous genes between human and mouse are mapped using Best Reciprocal Hit (Blastp, Expect <1e-3), so that users can use one species symbols as input to the other species platform. Below are examples of gene names and symbols:
Platform Organism Gene Symbol Entrez ID Aliases HG-U133_Plus_2 (GPL570) Human MICU1 10367 CALC; EFHA3; CBARA1 Mouse430_2 (GPL1261) Mouse Micu1 216001 Calc; Cbara1; C730016L05Rik
What does CEM Strength in the Co-expression Summary page mean?
The CEM strength summarizes how well the genes in the CEM co-expressed with each other compared to the null model and how many datasets support their co-expression. CEM strength ψk for CEM k is defined as the average of Bayes factors over the datasets, weighted by the probability that the CEM is selected for each dataset. For dataset d, the Bayes factor is calculated using the foreground model H1 [genes in CEM k are in same CEM, thus their pairwise correlations have the normal mean θd,k and standard deviation σd,k] and the background model H0 [genes are in the null CEM, thus their pairwise correlations have mean θd,0 and standard deviation σd,0].
How are the output CEMs ordered?
CEMs are ordered by CEM Strength (see above).
How long should it take for my gene set to run?
CLIC takes ~ 10 minutes for small gene sets, ~ 30 minutes for medium sized gene sets (<25 genes), about several hours for larger gene sets (50-250 genes). However, the online tool launches jobs on a compute cluster, and if the cluster is busy there may be a delay of several minutes to several hours before the CLIC job is launched. Jobs should complete within 24 hr.
Is there limit of input gene set size?
The online version of CLIC has a limit of gene set size 250. Larger sets must be run using the command line version of the program downloaded onto your computer.
How can I run CLIC on more than 250 genes?
You need to download either CLIC command-line executable file on your computer, and then run it from the command line on your input gene set. You can also download the C++ source code and compile it locally. CLIC requires the GNU scientific library (http://www.gnu.org/software/gsl/) to be installed in the computer. The software package includes the pre-processed 1774 datasets on Mouse430_2 (GPL1261) platform and 1887 datasets on human HG-U133_Plus_2 (GPL570) platform, as well as a couple of example gene sets, for your convenience.
Why did we choose Mouse430_2 (GPL1261) platform HG-U133_Plus_2 (GPL570) platform?
These are the most popular mouse and human microarray platforms. After filtering out bad quality datasets, we end up having 1774 datasets for Mouse430_2 and 1887 datasets for HG-U133_Plus_2, which contain a large enough quantity of data to provide strong power to predict co-expressed genes. In future, we plan to add a few more popular platforms for other model organisms, as well as RNA-seq datasets for mouse and human.
How did we preprocess the datasets?
The Series Matrix File for each GEO series dataset was downloaded via FTP from GEO website in August 2014. We mapped Affymetrix probesets to NCBI Entrez ID identifiers using Affymetrix Mouse430_2 and HG-U133_Plus_2 annotation files downloaded in August 2014. For cases in which multiple Affymetrix probesets map to one gene, we chose the probeset with the least potential for cross-hybridization according to Affymetrix probeset annotations. Specifically, we used the following Affymetrix probeset suffix hierarchy (at > a_at > s_at > x_at). In cases where there were ties, we chose the lower numbered probeset to represent a gene. Each dataset is a matrix where each row is a gene and each column is a sample for that GEO series.
We then pre-processed the GEO series matrices by the following protocol:
- Remove duplicated datasets and sub-datasets, i.e. remove one GEO series if it the subset of another GEO series.
- Remove datasets with less than 6 samples, since small sample size leads to huge uncertainty in estimating gene-gene pairwise co-expression correlation.
- Find datasets whose matrices are in log-scale, and unlog (take exponentiation with base 2) them. The motivation is to make the matrices in the same scale.
- Low signal filtering: Remove datasets with maximum expression value < 1000
- Normalization: scale each column to have same mean.
How did we filter out the low quality datasets?
We define the quality of a dataset using the distribution of Fisher z-transformed expression correlation between two randomly selected genes (termed as the background null distribution for a dataset). There is a large variability in the quality of datasets. Some of the datasets are relatively more “well-shaped”, namely two random genes usually have small correlation around zero. However for some datasets, a large proportion of the gene pairs have moderately or high correlations. There are a couple of underlying reasons for this: (1) small sample size; (2) existence of multiple tissue types in the dataset; (3) poor data normalizations. All of these reasons can make a large proportion of genes in the dataset correlated trivially, and we definitely want to filter out the bad quality datasets in the preprocessing stage to avoid false findings.
In CLIC, we quantify the quality of GEO datasets with a probability distribution based criterion. Specifically, we found that a high quality dataset has Gaussian null distribution with small variance, whereas a low quality dataset has non-Gaussian null distribution with large variance. We fit the null distribution under both Gaussian model and non-parametric kernel model, and calculate their L1 (total variation) distance. We remove datasets with L1 distance > 0.1 or null distribution standard deviation > 1.
Can I use my own expression datasets?
In the command line version of CLIC, users can input their own expression dataset matrices. The dataset matrix format is described in the README.txt file in the command-line software package.
Why is the expression profile for my gene incorrect?
In our data preprocessing, we chose one probeset if multiple probesets map to the same gene (see above). Even though we chose the probeset with the least potential for cross-hybridization according to Affymetrix probeset annotations, there is still possibility that the mapped probeset is not the optimal one for some genes. In real data analysis experience, we found most of the genes should be mapped to the correct probesets. If you find one gene should co-expressed strongly with some other genes, but CLIC results do not support that, please send email to us (email@example.com).
Why do I get different results when I add/remove one gene to an input gene set?
CLIC partitions the entire input gene set into CEMs using a Bayesian method to perform clustering. Addition or deletion of a single gene from a gene set can alter all of the clustering results. Due to the stochastic nature of the underlying Markov chain Monte Carlo (MCMC) algorithm, for large gene sets CLIC could actually return slightly different partitioning results just from running the program multiple times. However, the default behavior of CLIC is to use a fixed seed so that all results are reproducible (although the command-line software user can instead specify the random seed in the parameter file).
Why are there no CEM+ results for my gene(s)?
By default, CLIC returns genes with LLR scores greater than 0. For inferred CEMs with small number of selected datasets, the CEM+ gene LLRs can be very low, and thus no results will be returned as part of the CEM+. The command-line software user can specify the LLR score threshold in the parameter file.
How can I tell if a particular CEM is meaningful?
- The CEM Strength score indicates the coherence of the different member genes. High scores indicate CEMs where member genes are highly co-expressed with each other in a large number of datasets. Strength greater than 1 indicate co-expression and strength greater than 10 indicate strong co-expression.
- In addition, larger number of selected datasets gives more confidence to the CEM, especially when the selected datasets are microarray experiments related to (targeting) the input query gene set.
By our experience, we found an CEM will be very informative if it satisfies the following two conditions: (1) CEM strength > 5. (2) More than 20 datasets are selected with posterior probability > 50%. Furthermore, CEM+ genes will be confident prediction to co-express with the CEM if LLR > 50. In general, we do not value predictions with LLR <10.
What is the Summary view?
The Summary view shows you the results for CEM1 by default. To see other CEMs, use the drop-down menu at the top right.
- The large blue/green matrix in the Summary view shows you all the genes in CEM 1 (blue text) as well as CEM1+ predictions that have similar co-expression (green text). The rows are CEM1 genes present on the microarry chip. The columns are the top 200 datasets in which your CEM genes are co-expressed. The color intensity of blue/green in each matrix cell indicates the strength with which this gene is co-expressed with the CEM1 genes in this dataset.
- The upper panel in the Summary view shows the scores of all the >1000 datasets. Strongly co-expressed CEMs (eg OXPHOS, ribosome) are strongly co-expressed in hundreds of datasets. Other input gene sets may be co-expressed in only a handful of datasets. This upper panel lets you see how many datasets are relevant for your CEM.
What is the Top Datasets view?
By default, the Top Datasets view shows you the results for the top dataset for CEM1. To see other CEMs use the left-hand drop-down menu. To see other co-expressed datasets use the right-hand drop-down menu.
- The red/green matrix shows the Z-score expression values for your CEM and CEM+ genes (rows) against all the samples in this dataset (columns). Red color indicates high expression (max Z-score 3), green color indicates low expression (minimum Z-score -3), black color indicates average expression (Z-score 0).
Where can I download C++ Source code?
From the main CLIC page, scroll down and click a button labeled "C++ Source"
Where can I download CLIC executables?
From the main CLIC page, scroll down and you will find a button to download either the Mac OSX or Linux executable
Where can I download preprocessed GEO microarray datasets?
Use the executable links from the main CLIC page. These will download the executable software along with the preprocessed GEO microarray datasets and examples of how to run the tool
How do I run the downloaded tool?
Use the executable links from the main CLIC page to download the executable software and the preprocessed GEO microarray datasets. There is included a user_manual.txt text file with additional instructions on how to run the tool.
CLIC input / output
Interpreting the results
You have a pathway of interest, say the heme biosynthesis pathway, and you want to know:
- Are the heme biosynthesis genes tightly co-regulated? Are they all co-expressed or do some show different expression?
- What tissues or conditions are the heme genes varying in? What might be good cell models for studying this system?
- What other genes co-express with the heme biosynthesis pathway?
A compendium of RNA expression profiles. By default we have 4 RNA expression compendia from GEO to choose from. In general, we get best results from mouse GPL1261, however in some cases all the signal comes from multi-tissue datasets and in these cases we find it useful to use the mouse GPL1261_single_tissue subset. Here are the 4 RNA expression compendia:
- mouse GPL1261: 1774 datasets (total of 28628 samples) performed on the Mouse430_v2 affymetrix microarray
- mouse GPL1261_single_tissue: Subset of above compendium excluding multi-tissue datasets;
- human GPL570: 1887 datasets (total of 45158 samples) performed on the HG-U133_Plus_2 affymetrix microarray
- human GPL570_single_tissue: Subset of above human compendium excluding multi-tissue datasets;
- A user gene set of interest (in this case the heme biosynthesis genes)
- Go to www.gene-clic.org.
- Click on the big green button "Submit a CLIC job"
Now we will submit a job:
- Under the Email field, enter your email address – where the results will be emailed to you
- Under the Job description field enter: HemeMouse
- In the Gene List field enter your genes of interest. This can be gene symbols, NCBI Entrez IDs, Ensembl GeneIDs, Ensembl TranscriptIDs, UniProtIDs. The names can be either human or mouse names -- and genes will be mapped across species using best-bidirectional matches. Try cutting & pasting these IDs
Hmbs Urod Uros Alad Alas2 Ppox Cpox Fech BADNAME
- Click “I’m not a robot” and then click "Submit CLIC Job"
- An error arises indicating that it cannot map BADNAME, and that this identifier has been removed
- Click "Submit CLIC Job" again (now without BADNAME)
- You will get a message that your job has been submitted. Now you wait. CLIC is a sophisticated algorithm and requires computationally intensive processing, thus is run on a cloud cluster. Expect it to take from a few minutes to a few hours to get your results back (depending on the size of your input set and how busy the compute cluster is). When I tried this, it took ~10 minutes for the jobs to complete.
You will receive an email with the results including:
- PDF complete results: large PDF file containing expression profiles for input genes and predictions across top datasets, including heat map of top genes across the top datasets.
- PDF brief results: smaller PDF file containing results for input genes and predictions; does not include heatmaps showing expression for top datasets.
- Text results: small text file containing the assignment of input genes to “Co-expression modules” (CEMs) and predictions
- Input gene lists: list of genes that you input to CLIC
First let us look at the Text results
- This text file contains information on (i) the number of Co-Expression Modules (CEMs), (ii) the strength of each CEM, and (iii) top gene predictions for each CEM.
- The Text results file contains one line for every input gene assigned to a CEM (the first lines), and one line for every strong CEM+ prediction (following lines)
- For HemeMouse, 7/8 input genes were clustered into 1 CEM with strength 3.538. In general, CEM strength >1 is okay co-expression and CEM strength >10 is very good co-expression. In addition to these 7 input genes, there were 35 top predictions with
- similar co-expression ranked by LLR (log likelihood ratio of co-expression with the input genes versus a background null model). LLR>10 is a good prediction, and LLR>100 shows very good co-expression. You can see here many genes co-expressed with LLR>100.
- You can see there is only 1 CEM output because the 3rd column “CEM ID” is always 1
Now let us look at the PDF Complete Results.
- The PDF has 3 sections: Overview, CEM Coexpression Summary, and top GEO datasets.
The PDF starts with an Overview page which shows a heatmap of the input genes clustered into Co-Expression Modules (ECMs).
- For HemeMouse, we see the first 7 genes were combined into a single CEM (Hmbs, Urod, Uros, Ppox, Alas2, Cpox, Alad) while Fech was not assigned a Co-Expression module.
- Next is the CEM Coexpression Summary for each CEM
The final contains details on the top GEO datasets.
- This shows one page per top GEO dataset. Top Datasets are the datasets in which the input CEM genes are the most co-expressed and varying. Here is displayed a description of the dataset, a histogram showing correlations between all pairs of genes (blue and green histograms), and below in red are tickmarks for the CEM input genes – typically on the right hand side showing that these genes are far more co-expressed with each other than random pairs of genes. At the bottom, a heat map shows the CEM genes (rows) and the samples in this GEO dataset (columns) with red color indicating high expression and green indicating low expression. These heatmaps show which samples the CEM genes are high and lowly expressed in – and hopefully replicates will show similar patterns.
For MouseHeme, the top 1 dataset is GSE42548 TH-MYCN Mice with Caspase-8 Deficiency Develop Advanced Neuroblastoma with Bone Marrow Metastasis.
- This dataset contains 29 microarray samples (columns). The expression of the 6 heme CEM genes are shown in the first 7 rows, where red indicates high relative expression and green indicates low relative expression (based on z-score). You can see that the 7 heme CEM genes are highly expressed in only 5 samples, all corresponding to total bone marrow.
- Next you can look at the microarray expression for the CEM+ genes, which co-express with the input CEM genes over potentially hundreds of datasets. In this case, the CEM+ genes, such as Rhd, Kel, Kif1 and Trim10 show the exact same pattern of high expression in the same 5 microarray samples.
Next go to the second dataset: GSE20604 Downstream Targets of HOXB4 in the primitive hematopoietic progenitor EML Cell line.
- This dataset has 6 microarray samples. The 6 input CEM genes are highly expressed in 3 samples (Control Clone) and lowly expressed in 3 samples (HOXB4 Overexpressing). This dataset may tell you something interesting about when your input genes are varying together.
- Again in this dataset, the top CEM1+ genes show the same pattern as the CEM genes.
Each CLIC query creates a PDF output (along with a text file output described below). This PDF has three sections described below.
CEMs summary: Overview of Co-Expression Modules with Dataset Weighting
- Heatmap shows pairwise correlations between all genes in the input query gene set G across all datasets X, where each dataset is weighted based CLIC’s posterior dataset probability. Each row shows one gene, and the brightness of squares indicates degree of correlation.
- Red boxes indicate the partition of input genes G into CEMs, ordered by CEM strength.
CEM: Details of each CEM and its expansion CEM+
- Top panel shows the posterior selection probability (dataset weights) for top 200 GEO series datasets.
- Bottom panel shows the CEM genes (blue rows) as well as expanded CEM+ genes (green rows).
- Each column is one GEO series dataset, sorted by their posterior selection probability (dataset weight).
- The brightness of squares indicates the gene’s correlations with CEM genes in the given dataset.
- CEM+ includes genes that co-express with CEM genes in high-weight datasets, measured by LLR score.
GEO Series: Details of each GEO series dataset and its expression profile:
- Top panel shows the detailed information (e.g. title, summary) for the GEO series dataset.
- Middle panel shows the background null distribution of expression correlation between two randomly selected genes in this dataset, which is estimated in the preprocessing stage of CLIC.
- Bottom panel shows the expression profiles for all input genes G in this dataset. Red indicates high relative expression and green indicates low relative expression measured by row Z-score.
The text file contains the same information as the PDF file. This text file can be useful to view in Excel and to link gene symbols to other annotations.