CLIC Help Topics
- What types of gene name/identifiers does CLIC recognize?
The online version of CLIC takes the user input genes and compares them to a lookup table of gene symbols and aliases, downloaded from NCBI Gene (ftp://ftp.ncbi.nih.gov/gene/). EnsemblIDs and UniProtIDs can also be used. The alias lookup table is static (initially downloaded July 2015) but will be updated at regular intervals. The input gene symbols and aliases are case-insensitive, and we discard a gene alias if it conflicts with another gene official symbol. Genes that are not mapped (or are not on the microarray platform) will be removed from the input list and you will have the option to run with the smaller list excluding all unmapped genes. If your gene ID or symbol is not recognized, then please try to use the Entrez Gene ID or current official gene symbol according to NCBI Gene (http://www.ncbi.nlm.nih.gov/gene). Orthologous genes between human and mouse are mapped using Best Reciprocal Hit (Blastp, Expect <1e-3), so that users can use one species symbols as input to the other species platform. Below are examples of gene names and symbols:
Platform Organism Gene Symbol Entrez ID Aliases HG-U133_Plus_2 (GPL570) Human MICU1 10367 CALC; EFHA3; CBARA1 Mouse430_2 (GPL1261) Mouse Micu1 216001 Calc; Cbara1; C730016L05Rik
- What does CEM Strength in the Co-expression Summary page mean?
The CEM strength summarizes how well the genes in the CEM co-expressed with each other compared to the null model and how many datasets support their co-expression. CEM strength ψk for CEM k is defined as the average of Bayes factors over the datasets, weighted by the probability that the CEM is selected for each dataset. For dataset d, the Bayes factor is calculated using the foreground model H1 [genes in CEM k are in same CEM, thus their pairwise correlations have the normal mean θd,k and standard deviation σd,k] and the background model H0 [genes are in the null CEM, thus their pairwise correlations have mean θd,0 and standard deviation σd,0].
- How are the output CEMs ordered?
CEMs are ordered by CEM Strength (see above).
- How long should it take for my gene set to run?
CLIC takes ~ 10 minutes for small gene sets, ~ 30 minutes for medium sized gene sets (<25 genes), about several hours for larger gene sets (50-250 genes). However, the online tool launches jobs on a compute cluster, and if the cluster is busy there may be a delay of several minutes to several hours before the CLIC job is launched. Jobs should complete within 24 hr.
- Is there limit of input gene set size?
The online version of CLIC has a limit of gene set size 250. Larger sets must be run using the command line version of the program downloaded onto your computer.
- How can I run CLIC on more than 250 genes?
You need to download either CLIC command-line executable file on your computer, and then run it from the command line on your input gene set. You can also download the C++ source code and compile it locally. CLIC requires the GNU scientific library (http://www.gnu.org/software/gsl/) to be installed in the computer. The software package includes the pre-processed 1774 datasets on Mouse430_2 (GPL1261) platform and 1887 datasets on human HG-U133_Plus_2 (GPL570) platform, as well as a couple of example gene sets, for your convenience.
- Why did we choose Mouse430_2 (GPL1261) platform HG-U133_Plus_2 (GPL570) platform?
These are the most popular mouse and human microarray platforms. After filtering out bad quality datasets, we end up having 1774 datasets for Mouse430_2 and 1887 datasets for HG-U133_Plus_2, which contain a large enough quantity of data to provide strong power to predict co-expressed genes. In future, we plan to add a few more popular platforms for other model organisms, as well as RNA-seq datasets for mouse and human.
- How did we preprocess the datasets?
The Series Matrix File for each GEO series dataset was downloaded via FTP from GEO website in August 2014. We mapped Affymetrix probesets to NCBI Entrez ID identifiers using Affymetrix Mouse430_2 and HG-U133_Plus_2 annotation files downloaded in August 2014. For cases in which multiple Affymetrix probesets map to one gene, we chose the probeset with the least potential for cross-hybridization according to Affymetrix probeset annotations. Specifically, we used the following Affymetrix probeset suffix hierarchy (at > a_at > s_at > x_at). In cases where there were ties, we chose the lower numbered probeset to represent a gene. Each dataset is a matrix where each row is a gene and each column is a sample for that GEO series.
We then pre-processed the GEO series matrices by the following protocol:
- Remove duplicated datasets and sub-datasets, i.e. remove one GEO series if it the subset of another GEO series.
- Remove datasets with less than 6 samples, since small sample size leads to huge uncertainty in estimating gene-gene pairwise co-expression correlation.
- Find datasets whose matrices are in log-scale, and unlog (take exponentiation with base 2) them. The motivation is to make the matrices in the same scale.
- Low signal filtering: Remove datasets with maximum expression value < 1000
- Normalization: scale each column to have same mean.
- How did we filter out the low quality datasets?
We define the quality of a dataset using the distribution of Fisher z-transformed expression correlation between two randomly selected genes (termed as the background null distribution for a dataset). There is a large variability in the quality of datasets. Some of the datasets are relatively more “well-shaped”, namely two random genes usually have small correlation around zero. However for some datasets, a large proportion of the gene pairs have moderately or high correlations. There are a couple of underlying reasons for this: (1) small sample size; (2) existence of multiple tissue types in the dataset; (3) poor data normalizations. All of these reasons can make a large proportion of genes in the dataset correlated trivially, and we definitely want to filter out the bad quality datasets in the preprocessing stage to avoid false findings.
In CLIC, we quantify the quality of GEO datasets with a probability distribution based criterion. Specifically, we found that a high quality dataset has Gaussian null distribution with small variance, whereas a low quality dataset has non-Gaussian null distribution with large variance. We fit the null distribution under both Gaussian model and non-parametric kernel model, and calculate their L1 (total variation) distance. We remove datasets with L1 distance > 0.1 or null distribution standard deviation > 1.
- Can I use my own expression datasets?
In the command line version of CLIC, users can input their own expression dataset matrices. The dataset matrix format is described in the README.txt file in the command-line software package.
- Why is the expression profile for my gene incorrect?
In our data preprocessing, we chose one probeset if multiple probesets map to the same gene (see above). Even though we chose the probeset with the least potential for cross-hybridization according to Affymetrix probeset annotations, there is still possibility that the mapped probeset is not the optimal one for some genes. In real data analysis experience, we found most of the genes should be mapped to the correct probesets. If you find one gene should co-expressed strongly with some other genes, but CLIC results do not support that, please send email to us (email@example.com).
- Why do I get different results when I add/remove one gene to an input gene set?
CLIC partitions the entire input gene set into CEMs using a Bayesian method to perform clustering. Addition or deletion of a single gene from a gene set can alter all of the clustering results. Due to the stochastic nature of the underlying Markov chain Monte Carlo (MCMC) algorithm, for large gene sets CLIC could actually return slightly different partitioning results just from running the program multiple times. However, the default behavior of CLIC is to use a fixed seed so that all results are reproducible (although the command-line software user can instead specify the random seed in the parameter file).
- Why are there no CEM+ results for my gene(s)?
By default, CLIC returns genes with LLR scores greater than 0. For inferred CEMs with small number of selected datasets, the CEM+ gene LLRs can be very low, and thus no results will be returned as part of the CEM+. The command-line software user can specify the LLR score threshold in the parameter file.
- How can I tell if a particular CEM is meaningful?
- The CEM Strength score indicates the coherence of the different member genes. High scores indicate CEMs where member genes are highly co-expressed with each other in a large number of datasets. Strength greater than 1 indicate co-expression and strength greater than 10 indicate strong co-expression.
- In addition, larger number of selected datasets gives more confidence to the CEM, especially when the selected datasets are microarray experiments related to (targeting) the input query gene set.
By our experience, we found an CEM will be very informative if it satisfies the following two conditions: (1) CEM strength > 5. (2) More than 20 datasets are selected with posterior probability > 50%. Furthermore, CEM+ genes will be confident prediction to co-express with the CEM if LLR > 50. In general, we do not value predictions with LLR <10.
- What is the Summary view?
The Summary view shows you the results for CEM1 by default. To see other CEMs, use the drop-down menu at the top right.
- The large blue/green matrix in the Summary view shows you all the genes in CEM 1 (blue text) as well as CEM1+ predictions that have similar co-expression (green text). The rows are CEM1 genes present on the microarry chip. The columns are the top 200 datasets in which your CEM genes are co-expressed. The color intensity of blue/green in each matrix cell indicates the strength with which this gene is co-expressed with the CEM1 genes in this dataset.
- The upper panel in the Summary view shows the scores of all the >1000 datasets. Strongly co-expressed CEMs (eg OXPHOS, ribosome) are strongly co-expressed in hundreds of datasets. Other input gene sets may be co-expressed in only a handful of datasets. This upper panel lets you see how many datasets are relevant for your CEM.
- What is the Top Datasets view?
By default, the Top Datasets view shows you the results for the top dataset for CEM1. To see other CEMs use the left-hand drop-down menu. To see other co-expressed datasets use the right-hand drop-down menu.
- The red/green matrix shows the Z-score expression values for your CEM and CEM+ genes (rows) against all the samples in this dataset (columns). Red color indicates high expression (max Z-score 3), green color indicates low expression (minimum Z-score -3), black color indicates average expression (Z-score 0).
- Where can I download C++ Source code?
From the main CLIC page, scroll down and click a button labeled "C++ Source"
- Where can I download CLIC executables?
From the main CLIC page, scroll down and you will find a button to download either the Mac OSX or Linux executable
- Where can I download preprocessed GEO microarray datasets?
Use the executable links from the main CLIC page. These will download the executable software along with the preprocessed GEO microarray datasets and examples of how to run the tool
- How do I run the downloaded tool?
Use the executable links from the main CLIC page to download the executable software and the preprocessed GEO microarray datasets. There is included a user_manual.txt text file with additional instructions on how to run the tool.
CLIC input / output
Interpreting the results
You have a pathway of interest, say the heme biosynthesis pathway, and you want to know:
- Are the heme biosynthesis genes tightly co-regulated? Are they all co-expressed or do some show different expression?
- What tissues or conditions are the heme genes varying in? What might be good cell models for studying this system?
- What other genes co-express with the heme biosynthesis pathway?
- A compendium of RNA expression profiles. By default we have 4 RNA expression compendia from GEO to choose from. In general we get best results from mouse GPL1261, however in some cases all the signal comes from multi-tissue datasets and in these cases we find it useful to use the mouse GPL1261_single_tissue subset. Here are the 4 RNA expression compendia:
- mouse GPL1261: 1774 datasets (total of 28628 samples) performed on the Mouse430_v2 affymetrix microarray
- mouse GPL1261_single_tissue: Subset of above compendium excluding multi-tissue datasets;
- human GPL570: 1887 datasets (total of 45158 samples) performed on the HG-U133_Plus_2 affymetrix microarray
- human GPL570_single_tissue: Subset of above human compendium excluding multi-tissue datasets;
- A user gene set of interest (in this case the heme biosynthesis genes)
- Go to www.gene-clic.org. This will require a user login, since results will be emailed to you.
- Create a new user login
- Click on the big green button "Submit a CLIC job"
- Now we will submit a job:
- Under the Name field enter: HemeMouse
- In the Gene List field enter you genes of interest. This can be gene symbols, NCBI Entrez IDs, Ensembl GeneIDs, Ensembl TranscriptIDs, UniProtIDs. The names can be either human or mouse names -- and genes will be mapped across species using best-bidirectional matches. Try cutting & pasting these IDs
- Hmbs Urod Uros Alad Alas2 Ppox Cpox Fech BADNAME
- Click "Submit CLIC Job"
- An error arises indicating that it cannot map BADNAME, and that this identifier has been removed
- Click "Submit CLIC Job" again (now without BADNAME)
- You will get a message that your job has been submitted. Now you wait. CLIC is a sophisticated algorithm and requires computationally intensive processing, thus is run by a scheduler on a large compute cluster. Expect it to take from a few minutes to a few hours to get your results back (depending on the size of your input set and how busy the compute cluster is). When I tried this, it took ~10 minutes for the jobs to complete.
- Go to the page My CLIC Jobs, which summarizes the status and results of all your submitted jobs.
- First let us look at the Summary view. From the web page My CLIC Jobs, click the dark blue Summary button for HemeMouse.
- This Summary view shows you (i) the number of Co-Expression Modules (CEMs), (ii) the strength of each CEM, (iii) top gene predictions for each CEM, (iv) top datasets for each CEM.
- For HemeMouse, 6/8 input genes were clustered into 1 CEM with strength 2.774. In general, CEM strength >1 is okay co-expression and CEM strength >10 is very good co-expression.
- Next, Summary shows the Top Genes. First it shows you the names & descriptions of the 6 input Heme genes, followed by a list of genes with similar co-expression ranked by LLR (log likelihood ratio of co-expression with the input genes versus a background null model). LLR>10 is a good prediction, and LLR>100 shows very good co-expression. You can see here many genes co-expressed with LLR>100. You will see many of the top genes are involved in red blood cells.
- Note that only 6/8 input genes were clustered into CEM1: Hmbs, Urod, Uros, Alas2, Cpox, Alad. This means that Fech and Ppox showed RNA expression more similar to the background distribution than to these 6 CEM genes. This could either be for biological reasons or for technical reasons (eg the microarray probes were not sensitive or specific).
- Next, in the same Summary view, look at the Top Datasets (you can scroll down or use the blue Top Datasets button at the top of the page).
- Here are listed the top 150 datasets in which the input CEM genes are co-expressed and varying. A quick look tells you the top datasets are involved in hematopoietic lineages -- consistent with known functions of the heme biosynthesis pathway. Hyperlinks to GEO give more information about each experimental dataset.
- Now let us look at the Top Datasets view to look at expression of our input Heme genes in each of these top datasets. Go to the My CLIC Jobs page (eg by clicking the Back button at the top of the Summary view). For the HemeMouse input set, click the green button Top Datasets.
- Top Datasets are the datasets in which the input CEM genes are the most co-expressed and varying. Note that the CEM+ predictions are based on integrating across many datasets (where datasets are weighted based on how tightly the CEM genes themselves are co-expressed), and may not co-express with the CEM genes in any given dataset.
- For MouseHemeAll, the top 1 dataset is shown in the drop-down menu at the top right: GSE42548 TH-MYCN Mice with Caspase-8 Deficiency Develop Advanced Neuroblastoma with Bone Marrow Metastasis
- This dataset contains 29 microarray samples (columns). The expression of the 6 heme CEM genes are shown in the first 6 rows, where red indicates high relative expression and green indicates low relative expression (based on z-score). You can see that the 6 heme CEM genes are highly expressed in only 5 samples, all corresponding to total bone marrow.
- Next you can look at the microarray expression for the CEM+ genes, which co-express with the input CEM genes over potentially hundreds of datasets. In this case, the CEM+ genes, such as Kif1 and Trim10 show the exact same pattern of high expression in the same 5 microarray samples.
- Next go to the second dataset, by choosing from the drop-down menu GSE20604 Downstream Targets of HOXB4 in the primitive hematopoietic progenitor EML Cell line. Then click Update button.
- This dataset has 6 microarry samples. The 6 input CEM genes are highly expressed in 3 samples (Control Clone) and lowly expressed in 3 samples (HOXB4 Overexpressing). This dataset may tell you something interesting about when your input genes are varying together.
- Again in this dataset, the top CEM1+ genes show the same pattern as the CEM genes.
- Now let us look at the Co-expression Summary view. Go to the My CLIC Jobs page. For the HemeMouse input set, click the green button Co-expression Summary.
- This view summarizes the number and scores of the top datasets (show as columns) and the names of the co-expressing genes in CEM1+.
- First, it shows you genes in CEM1 (other CEMs are listed in drop-down menu at top)
- The 6 heme CEM genes are shown as the top 6 rows (blue rows), where the blue color indicates the strength of co-expression in each dataset.
- The histogram bar at the top shows you the dataset weights for all datasets, weighted by how tightly the CEM genes co-express in each dataset.
- Next the CEM1+ genes are listed by their LLR score. Scores LLR>10 show co-expression and LLR>100 shows strong co-expression.
Each CLIC query creates a PDF output (along with a text file output described below). This PDF has three sections described below.
- CEMs summary: Overview of Co-Expression Modules with Dataset Weighting
- Heatmap shows pairwise correlations between all genes in the input query gene set G across all datasets X, where each dataset is weighted based CLIC’s posterior dataset probability. Each row shows one gene, and the brightness of squares indicates degree of correlation.
- Red boxes indicate the partition of input genes G into CEMs, ordered by CEM strength.
- CEM: Details of each CEM and its expansion CEM+
- Top panel shows the posterior selection probability (dataset weights) for top 200 GEO series datasets.
- Bottom panel shows the CEM genes (blue rows) as well as expanded CEM+ genes (green rows).
- Each column is one GEO series dataset, sorted by their posterior selection probability (dataset weight).
- The brightness of squares indicates the gene’s correlations with CEM genes in the given dataset.
- CEM+ includes genes that co-express with CEM genes in high-weight datasets, measured by LLR score.
- GEO Series: Details of each GEO series dataset and its expression profile:
- Top panel shows the detailed information (e.g. title, summary) for the GEO series dataset.
- Middle panel shows the background null distribution of expression correlation between two randomly selected genes in this dataset, which is estimated in the preprocessing stage of CLIC.
- Bottom panel shows the expression profiles for all input genes G in this dataset. Red indicates high relative expression and green indicates low relative expression measured by row Z-score.
The interactive visualizer provides an interactive HTML representation of the CLIC results. The interactive visualizer has two views: “Co-expression Summary” and “Top Datasets”. The “Co-expression Summary” shows a CEM (blue rows) and its expansion (CEM+, green rows), where each row is one gene and each column summarizes one dataset (e.g. GSE24920: Gene Expression in Myotonic Dystrophy). “Top Datasets” shows expression profiles for each dataset (e.g. GSE24920: Gene Expression in Myotonic Dystrophy) where each row is one gene and each column is one microarray within this dataset.
Co-expression Summary View
There is one “Co-expression Summary View” for each CEM. You can select which CEM to view using the drop-down menu at the upper right. For each CEM, at the top is a summary of the number of genes in the CEM, the number of genes in the CEM+, and the CEM strength score. The “Co-expression Summary“ shows a heatmap of the CEM genes (blue rows) and CEM+ genes (green rows) across the top 200 GEO series datasets (columns). The GEO series dataset names are hyperlinked to the “Top Datasets” view (to view detailed expression profiles within this dataset). The brightness of each heatmap square indicates the gene’s correlations with all CEM genes in the given dataset. CEM+ includes genes that co-express with CEM genes in high-weight datasets, measured by the LLR score.
Top Datasets View
There is one “Top Datasets” view for each CEM for each of the top 200 datasets. You can select which CEM and dataset to view using the drop-down menu at the upper right. Each column is one microarray sample, and each row is a gene. The expression values are normalized for each row to have mean 0 and standard deviation 1. Red and green color gradient shows high to low normalized expression values from -3 to 3. A button is provided on the top of this page that links to the GEO webpage for this GEO series dataset.
Compared to the PDF output, the interactive visualizer has the following added features.
- You select the CEM of interest via a drop down menu at the top left. Unlike the PDF output, there is no summary of the entire partition in the Interactive Visualizer.
- Gene symbols (left column of result matrix) are hyperlinked to the relevant gene page in NCBI Entrez Gene or other database.
- GEO series dataset names (column headers of result matrix) are hyperlinked to the page showing the expression profiles for CEM and CEM+ genes.
- Mouse Hover: Hovering on a cell in the results table will provide a tooltip with the gene symbol and GEO series name for your convenience.
Downloaded ZIP results include a text file in addition to the PDF file. The text file contains the same information as the PDF file. This text file can be useful to view in Excel and to link gene symbols to other annotations.