GEView (Gene
Expression View) Tool
Download
and Installation: http://www.weizmann.ac.il/complex/compphys/software/geview/
Note on Linux
version: The result Excel file can be opened by OpenOffice
with similar functionality like on Windows.
General
Description:
This tool was developed
by: Libi Hertzberg, Assif Yitzhaky
(Domany group, Weizmann Institute of Science) and Metsada Pasmanik-Chor (Head
of Bioinformatics Unit, Faculty of Life Science, TAU).
The following flow
chart summarizes the functionalities of GEView.
Input:
Microarray
format:
A text
tab-delimited file with normalized and summarized expression values (log 2
scale). Each row (beginning from the 3rd row) represents a probe set for a
specific gene, and each column (beginning from the 3rd column) represents
a sample.
·
The first column contains the
probe set name.
·
The second
column contains the gene symbol name.
·
The first row contains the
samples' individual names.
·
The second row contains the
samples' labels. Labels are short and clear descriptions of the sample type,
containing letters and numbers only, with no spaces. Underscores ( _ ) are used to separate between the types of labels.
An example for a text representing the labels of one sample: CNT_Bt1_C. In this
example Label 1 is CNT (control), Label 2 is Bt1 (batch1) and Label 3 is C
(child). It means that this specific sample is from the control group, was
measured in batch number 1, and is of a child. Label 1 represents the
different conditions of the experiment, and will be used for the t-test or
ANOVA and for grouping samples in the box plots. The number of labels (=
[number of underscores (_) +1] is not limited, and the number of conditions
inside each label is also not limited.
The data begin from
the 3rd row and the 3rd column.
Note on data scale:
the
expression values can be on log 2 scale (default) or raw data (in this case
please make sure that all values are positive since log 2 transformation will
be applied on your data by GEView.
In the following input
example, the first level (the conditions for the statistical analysis) has
either the value “CNT” or “DIS”. The second level (after the underscore _ ) designates the batch (Bt1 or Bt2).
|
|
Sample1 |
Sample2 |
Sample3 |
Sample4 |
Sample5 |
Sample6 |
|
|
CNT_Bt1_M |
CNT_Bt1_F |
CNT_Bt2_F |
DIS_Bt1_M |
DIS_Bt2_F |
DIS_Bt2_M |
34689_at |
TREX1 |
5.5 |
5.3 |
4.7 |
7.5 |
12 |
4.4 |
34697_at |
LRP6 |
3.7 |
7.6 |
5.7 |
5.6 |
8.5 |
3.7 |
34726_at |
CACNB3 |
9 |
5.9 |
4.6 |
7.6 |
8 |
6.7 |
A small
tab-delimited input file example can be found here: data_example_small.txt
Tip: In order to create a
similar expression table, you can use the EXPANDER tool (http://acgt.cs.tau.ac.il/expander/)
or the Expression console tool (for Affymetrix
microarray data) http://www.affymetrix.com/estore/browse/level_seven_software_products_only.jsp?productId=131414#1_1
.
Next
Generation Sequencing (NGS) / protein format:
Like the microarray format as described above but without
the first column. In this case the only descriptive column is gene symbol
name:
Example:
|
Sample1 |
Sample2 |
Sample3 |
Sample4 |
Sample5 |
Sample6 |
|
CNT_Bt1_M |
CNT_Bt1_F |
CNT_Bt2_F |
DIS_Bt1_M |
DIS_Bt2_F |
DIS_Bt2_M |
TREX1 |
5.5 |
5.3 |
4.7 |
7.5 |
12 |
4.4 |
LRP6 |
3.7 |
7.6 |
5.7 |
5.6 |
8.5 |
3.7 |
CACNB3 |
9 |
5.9 |
4.6 |
7.6 |
8 |
6.7 |
GUI Layout – 3 stages:
Stage 1 – Load data
Use the “Load” button in order to browse
for the Data File. After choosing the Data File, its path will appear in the
Data File text box.
Stage 2 –
Preprocessing
Click “PCA” in order to
plot the Principal Component Analysis of the samples based on (up to) the 1000
highest variance genes. Using the new figure toolbar, you can rotate (3D) the
PCA figure, zoom in, save, etc.
If you right-click
anywhere on the white space between the points (but not on the points
themselves), a context menu will appear, enabling you to recolor the points
according to any one of the various labels (as defined in the second row in the
input file, see the input section above)
If you right-click on a
point (sample), a context menu will appear, enabling you to remove (filter out)
this point from the analysis.
If you click “Remove Sample…” this point
will be designated as “Removed”, and the sample name will appear in the stage 2
“filtered samples” list (see the figures below).
After
removing a sample, you can add it again (cancel the sample removal) by
right-clicking again on the sample and choosing “Add Sample…”.
After
removing the unwanted samples, click “PCA” again in order to recalculate and
redisplay the PCA of the desired samples.
Batch
correction:
Use this button if you
have various batches and you would like to perform batch effect correction. Note:
the batch label must be the second level, after the first underscore (_)
(second line in the input file), see the input section above. Moreover, In order to run the batch
correction, in each batch group, all the various conditions (first label level)
must appear.
After the batch
correction is completed, an additional PCA figure will open, reflecting the
batch correction output.
The Batch correction
algorithm utilized in GEView is ComBat.
It is implemented in the sva (Surrogate Variable
Analysis) package of Bioconductor.
https://bioconductor.org/packages/release/bioc/html/sva.html
References:
-
Johnson WE, Li
C, Rabinovic A. Adjusting batch effects in microarray
expression data using empirical Bayes methods. Biostatistics. 2007;
8(1):118–27. Epub 2006/04/25. doi: 10.1093/biostatistics/kxj037 PMID: 16632515.
-
Jeffrey T. Leek,
W. Evan Johnson, Hilary S. Parker, Andrew E. Jaffe and John D. Storey (). sva:
Surrogate Variable Analysis, R package version 3.6.0.
Batch correction
Example:
Before:
in the following figure, the samples are grouped according to batches
After: in
the following figure, the samples are grouped according to the experiment
conditions
Stage 3 –
Run
Running the statistical
analysis may take a while depending on the number of the processed probe sets.
You can follow the running progress by watching at the status line at the
bottom of the graphical interface.
This stage generates an
Excel file which summarizes in each line a specific probe. The probe sets are
sorted (starting from the most significant) by the FDR ANOVA Q-value (corrected
p-value, see reference below). Each row contains a link for the figure
containing the ANOVA box plots. Note: in case that there are more than two
experiment conditions (more than two groups in the first level of the labels),
then Tukey’s test for multiple comparison is performed; for each subgroup pair
– the fold change (f.change) and the p-values (“p-val”) are given.
Example for output
table:
Probe set ID |
Gene Symbol |
GeneCards Link |
UCSC Link |
NCBI Link |
ANOVA P-value |
Q-value (corrected) |
Box plots |
E2A/HD f.change |
p-val |
E2A/TEL f.change |
p-val |
HD/TEL f.change |
p-val |
212148_at |
PBX1 |
2.30E-15 |
3.80E-11 |
20.3 |
1.90E-09 |
21 |
1.90E-09 |
1.03 |
0.9 |
||||
212151_at |
PBX1 |
3.50E-13 |
2.90E-09 |
25.2 |
1.90E-09 |
20.6 |
1.90E-09 |
0.817 |
0.2 |
||||
200953_s_at |
CCND2 |
2.10E-07 |
0.00022 |
0.0787 |
1.70E-07 |
0.223 |
1.40E-05 |
2.83 |
0.00089 |
||||
201005_at |
CD9 |
1.30E-06 |
0.00055 |
0.874 |
0.92 |
10.7 |
6.00E-06 |
12.3 |
6.90E-06 |
||||
205253_at |
PBX1 |
3.70E-06 |
0.0011 |
10.8 |
2.60E-05 |
10.8 |
6.10E-06 |
1 |
1 |
||||
204849_at |
TCFL5 |
2.20E-05 |
0.0029 |
0.639 |
0.34 |
0.152 |
2.70E-05 |
0.238 |
0.00069 |
||||
200951_s_at |
CCND2 |
3.80E-05 |
0.004 |
0.372 |
3.10E-05 |
0.744 |
0.083 |
2 |
0.00049 |
||||
221773_at |
ELK3 |
0.0011 |
0.024 |
0.195 |
0.0016 |
0.279 |
0.0039 |
1.43 |
0.56 |
||||
200952_s_at |
CCND2 |
0.13 |
0.27 |
0.789 |
0.18 |
0.812 |
0.18 |
1.03 |
0.97 |
||||
206127_at |
ELK3 |
0.28 |
0.43 |
0.988 |
0.99 |
0.905 |
0.32 |
0.916 |
0.45 |
FDR reference:
Benjamini, Y., and Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful
approach to multiple testing. Journal of the Royal Statistical Society
57, 289–300.
In case that a particular
gene symbol has multiple probe sets, then all probe sets for the same gene are
combined together in the same figure (see an example below). The figure
displays 2 examples: A) 3 probe sets of PBX1. B) 3 probe sets of ELK3. The
Y-axis represents the Log2 expression value and the X-axis represents the 3
experimental conditions (see E2A, HD and TEL X-labels). The number of samples
in each experimental condition is written in parentheses. Each sample
expression value is marked by a red dot and the mean (default) and standard
deviation in each experimental condition are marked by a black line and a blue
box, respectively.
Run parameters
1. You have
the option to display in the figure either the mean (default) or
the median value of each group of conditions. In the example
below, the mean is shown as black line.
2. The
parameter “Display sample names in Fig.” determines which sample names will be
displayed in the figure. The default value (min & max exp.) displays the
names of only 2 samples in each box: the samples with the minimal and maximal
expression. You can also choose to display all the samples (note that this may
be overcrowded), or none of them. In the example above, no sample names are
shown.
3. The
checkbox “only probe sets with gene symbols” (default is checked) determines
whether all probe sets will be analyzed or only probes sets which have gene
symbols (default).