Module 3: Subtype Analysis
Run in one command
mkdir s3
cd s3
baseDir=/your/path/to/PipeOne/
nextflow run ${baseDir}/s3_Subtype.nf -profile docker \
--rawdir ../test_dat/s3_subtype/00_rawdata/ \
--clinical ../test_dat/s3_subtype/KIRP_cli.OS.csv
Options
Required:
--rawdir <str> directory contain RNA-seq result tables produce in module 1
--clinical <str> Clinical information file
Optional:
--var_topK <int> top K most variance features for each table. default [1000]
-- cluster_range <str> cluster of to test in Kmeans default ["3-8"]
--clinical Clinical information
The clinical file must contain three columns, the column names are Sample, Event, Time_to_event
-
For example
$ head KIRP_cli.OS.csv
Sample,Event,Time_to_event
105248a5-eb2a-4336-b76c-cb759f45e45b_gdc_realn_rehead,0,214
e263e4a0-1489-40a9-8e89-9ee6aa19cdc7_gdc_realn_rehead,0,2298
be3d303c-ba1a-4cf8-a8a5-b5adcae05d14_gdc_realn_rehead,0,1795
dc5d11b5-742f-4740-acd0-2806962d9d1b_gdc_realn_rehead,1,1771
291bab1e-5d83-4b8f-9212-ecec1278ea1a_gdc_realn_rehead,0,3050
5779df14-6e54-4f72-97d3-5486e31ffc5d_gdc_realn_rehead,0,1731
a58cd3b2-3b69-4677-87f3-82b256bbcc48_gdc_realn_rehead,1,139
753f54d4-1994-4a3a-adeb-a37df28973d6_gdc_realn_rehead,0,2790
ff1a9e27-04bb-4732-8631-4ffad26c700e_gdc_realn_rehead,0,2294
Results
main result
-
record_log_rank_test_pvalue.csv, Survival difference of different classification results -
clusterscluster reuslt and survival curve plotsclusters/eval_cluster_num/lowDim=*_alpha=*_gamma=*_clusters=*_clustering.csvKmeans cluster base of H matrixclusters/eval_cluster_num/lowDim=*_alpha=*_gamma=*_silhouette_score.pngsilhouette width plot of cluster resultsclusters/surv_curve/low_dim=*_alpha=*_gamma=*_clustering.pdfsurvival plot of Kmeans cluster results
-
FeatureSelectionRandom Forest resultsRF_best_params_settings_for_feature_selection.csvbest Random Forest parameter and accuracyfeature*_importance.csvRandom Forest feature importanceRF_params_setting_record.txtall Random Forest parameter and accuracy
Intermediate files
-
NMFresults of Non-negative matrix factorization (NMF) under different parametersweight_*weight matrix W of NMFX_*H matrix
-
data_randomForestselected data for running random ForestlowDim=*_alpha=*_gamma=*_clusters=*_clustering.csvselected clustering resulttop*top K features in W matrix used for Random Forest training
-
rf_my_record.csvRandom Forest accuary record file
Run module 3 step by step ( Alternative )
User can run step by step to choose more options
1. select topK variance features and run NMF
source activate pipeOne_ml
baseDir=/path/to/PipeOne/
python3 ${baseDir}/bin/NMF/proc_raw_data.py proc --rawdir 00_rawdata/ --sample_info sample.cli.csv --var_topk 1000
## defusion, need to run a long time
python3 ${baseDir}/bin/NMF/run_defusion.py --threads 24
python3 ${baseDir}/bin/NMF/check_convergence.py
Options
--rawdir <str> directory contain RNA-seq result table, such as expression level (TPM value), RNA editing rate, fusion event, etc.
--clinical <str> Clinical information file
--var_topK top K most variance features. default [1000]
--threads <int> number of threads to use default [24]
2. clustering and eval
python3 ${baseDir}/bin/NMF/eval_cluster_num.py --cluster_range "3-8"
Rscript ${baseDir}/bin/NMF/survival_eval.R ./data/sample.cli.csv ./clusters/surv_curve/ "3-8"
Options
--cluster_range <str> cluster of to test in Kmeans default['3-8']
Script survival_eval.R need three mandotory inputs:
* clinical information file
* cluster_result which produce by script eval_cluster_num.py
* cluster_range same as --cluster_range
3. select features
python3 ${baseDir}/bin/NMF/select_topk_nong.py
python3 ${baseDir}/bin/NMF/find_best_RFparams.py
Options
select_topk_nong.py
--topK_importance <str> top most importance features use to retraining. default ['50,100,200']
--outdir <str> default [./data_randomForest]
--cluster_survival_file <str> default [record_log_rank_test_pvalue.csv]. when this option is provide program will select the clustering result with max silhoutte with value and log rand test p value is significance
--cluster_file <str> specify a cluster result file under ./clusters/surv_curve/ . default [None]. when this option is provide program will use the cluster
Note: The --cluster_survival_file option is mutually exclusive with the --cluster_file option.
find_best_RFparams.py
--ddir input direcotry generate by last step, default ./data_randomForest
--tdir output directory default ./FeatureSelection/