
__________________________________________________________________________________________

COMPILING C FUNCTION WITH MATLAB MEX
__________________________________________________________________________________________

To improve runtime efficiency, some routines called by MATLAB functions documented below 
are written in C code and must by compiled within MATLAB. To compile these routines it is
necessary that
1) the Boost library is installed (see http://www.boost.org)
2) the MATLAB mex compiler is configured (see http://mathworks.com/help/matlab/ref/mex.html)

Then all relevant mex-functions are generated by the following commands from the MATLAB
console (change to the folder GRF_estimation/C_code to execute these commands):

mex -I/PATH_TO_BOOST_INCLUDE_FILES ./SimAnnFit.cpp ./LamSimAnn.cpp ./Energy_fit.cpp
mex -I/PATH_TO_BOOST_INCLUDE_FILES ./SimAnnFit_lambda.cpp ./LamSimAnn.cpp ./Energy_fit_lambda.cpp
mex -I/PATH_TO_BOOST_INCLUDE_FILES ./SimAnnFit_4d.cpp ./LamSimAnn.cpp ./Energy_fit_4d.cpp
mex ./ConstructNetworks.cpp ./NetworkFinder.cpp

After compilation either move the generated executables to the GRF_estimation main folder
of include GRF_estimation/C_code in your MATLAB search paths


__________________________________________________________________________________________

REQUIRED DATA FORMAT
__________________________________________________________________________________________

DTA data required for the functions documented below is stored in the file 'DTA_data.mat'
and must be loaded into the MATLAB workspace prior to running them. The data arrays within 
this file are declared as 'global' and most functions access them as such. Therefore, the
variable names within 'DTA_data.mat' cannot be changed for the functions to run properly.
The following data arrays are contained within 'DTA_data.mat':

expr1,expr2: 5656 x 42 matrix containing the total mRNA expression levels for 5656 ORFs
			 and 42 timepoint (t = 0, 5, 10, ..., 205 min)
			 
syn1, syn2:	 5656 x 42 matrix containing the newly transcribed mRNA levels for 5656 ORFs
			 and 42 timepoint (t = 0, 5, 10, ..., 205 min). This is used as a proxy for 
			 current gene activity

gene_names_std: 5656 x 1 cell array containing the standard names for the measured ORFs

gene_names_sys: 5656 x 1 cell array containing the systematic names for the measured ORFs


__________________________________________________________________________________________

AUXILIARY FUNCTIONS
__________________________________________________________________________________________

num = g(name)

Returns the index of the gene with standard name 'name'. If the gene name is not contained
in the data, -1 is returned.

EXAMPLE:
swi4_num = g('swi4');

------------------------------------------------------------------------------------------

num = geneStd2Num(list)

Returns a list of gene indices for a list of standard names. If a gene name is not 
contained in the data, -1 is returned.

ARGUMENT:
list: cell array containing standard gene names

EXAMPLE:
gn = geneStd2Num({'swi4','swi5'});

------------------------------------------------------------------------------------------

num = geneSys2Num(list)

Returns a list of gene indices for a list of systematic names. If a gene name is not 
contained in the data, -1 is returned.

ARGUMENT:
list: cell array containing systematic gene names

EXAMPLE:
gn = geneStd2Num({'ybr142w','ybr138c'});

------------------------------------------------------------------------------------------

res = geneNum2Name(list)

Returns a cell array of gene names for a list of gene indices. The standard gene name is
used if it exists, otherwise the systematic name is returned.

ARGUMENT:
list: vector containing gene indices

EXAMPLE:
names = geneNum2Name([35,992,3341]);

------------------------------------------------------------------------------------------

p_t = protein_traj(mrna,t,p_0, lambdaP, alphaP)

Computes a 'protein model' time series from mRNA data.

ARGUMENTS:
mrna:   row vector containing the mRNA data
t:      row vector containing the timepoints for the mRNA data (for our data t = 0:5:205)
p_0:    initial value of the protein model (see set_p0 below)
lambdaP:effective degradation rate of the protein model (parameter lambda)
alphaP :effective translation rate of the protein model (can be omitted for alphaP = lambdaP)

RETURNS:
row vector of length(mrna) with protein model

EXAMPLE:
p_t = protein_traj(expr1(g('swi4'),:),0:5:205,set_p0(expr1(g('swi4'),:),0.1386,0:5:205),0.1386);

------------------------------------------------------------------------------------------

p = set_p0(m,lp,t)

Estimates the initial value of a protein time series by minimizing linear trends

ARGUMENTS:
m:  mRNA data vector
lp: effective degradation rate of the protein model (parameter lambda)
t:  row vector containing the timepoints for the mRNA data (for our data t = 0:5:205)

EXAMPLE:
p0 = set_p0(expr1(g('swi4'),:),0.1386,0:5:205);

------------------------------------------------------------------------------------------

pt = protein_grid(tf,rep,Ngrid)

Creates a grid of protein time series with the parameter lambda systematically varied 
between its boundaries log(2)/5 and log(2)/70 (Required for GRF estimation -- see below).

ARGUMENTS:
tf:    gene index of the corresponding transcription factor
rep:   replicate dataset for which the grid is created (1 or 2)
Ngrid: number of lambda values with which the grid is created

RETURNS:
Ngrid x 42 matrix containing protein time series for different lambdas in the rows

EXAMPLE:
pt_swi4 = protein_grid(g('swi4'),1,10000);

------------------------------------------------------------------------------------------

[] = plot_data(list)

Plots the timeseries of total expression level and activity of one or multiple genes
from both replicate datasets.

ARGUMENT:
list: index or list of indices of genes in the dataset

EXAMPLE:
plot_data(g('swi4'))

__________________________________________________________________________________________

INFERENCE OF GRFs
__________________________________________________________________________________________

model = LearnGRF (tf,trg,logic,rep,p_t1,p_t2)

Fits a gene regulation function with one or two input TFs to the activity pattern of a 
target gene by simulated annealing

ARGUMENTS:
tf:    index or indices of regulating transcription factor; multiple TFs must be in a vector: [tf1,tf2]
trg:   index of the target gene
logic: number of logic to use in the fit; assign 0 if logic is unknown and to be inferred with the GRF
rep:   number of replicate dataset to be used in the fit (1 or 2)

p_t1, p_t2: protein grids (computed by protein_grid -- see above) for the input TFs
			if left unassigned or empty [], p_t1, p_t2 will be computed within the function

RETURNS:
A model structure containing the indices of input TFs and target gene, the used logic,
the GRF parameters, and the fitting score. The model details in this structure can be 
plotted by plot_fit(model) (see below).

EXAMPLE:
m = LearnGRF([g('fkh1'),g('ndd1')],g('swi5'),0,1,[],protein_grid(g('ndd1'),1,10000));

------------------------------------------------------------------------------------------

model = LearnGRF_no_lambda (tf,lp,trg,logic,rep)

Fits a gene regulation function with one or two input TFs to the activity pattern of a 
target gene by simulated annealing. The 'protein model' parameter lambda must pre-determined
(see Learn_lambda below)

ARGUMENTS:
tf:    index or indices of regulating transcription factor; multiple TFs must be in a row vector: [tf1,tf2]
lp:	   'protein model' parameter(s) lambda corresponding to the TFs defined by 'tf'
trg:   index of the target gene
logic: number of logic to use in the fit; assign 0 if logic is unknown and to be inferred with the GRF
rep:   number of replicate dataset to be used in the fit (1 or 2)

RETURNS:
A model structure containing the indices of input TFs and traget gene, the used logic,
the GRF parameters, and the fitting score. The model details in this structure can be 
plotted by plot_fit(model) (see below).

EXAMPLE:
m = LearnGRF_no_lambda([g('fkh1'),g('ndd1')],[0.045,0.1386],g('swi5'),0,1);

------------------------------------------------------------------------------------------

model = LearnGRF_4d (tf,trg,logic1,logic2,rep,p_t1,p_t2,p_t3,p_t4)

Fits a gene regulation function with four input TFs to the activity pattern of a 
target gene by simulated annealing (this was necessary to fit the expression pattern
of Swi4 by Swi4/Cln2 + Yhp1,Yox1). For this function two separate logics must be assigned,
logic1 connecting the first two TFs and logic2 connecting the second two TFs. For the 
final GRF the two logics are connected additively: tf1 [logic1] tf2 + tf3 [logic2] tf4.

ARGUMENTS:
tf:    vector with 4 indices of regulating transcription factors
trg:   index of the target gene
logic1: number of the first logic to use in the fit
logic2: number of the second logic to use in the fit
rep:   number of replicate dataset to be used in the fit (1 or 2)

p_t1-4: Either full protein grids (as in LearnGRF) or 'protein model' parameters lambda
		for all TFs

RETURNS:
A model structure containing the indices of input TFs and traget gene, the used logics,
the GRF parameters, and the fitting score. The model details in this structure can be 
plotted by plot_fit(model) (see below).

EXAMPLE:
m = LearnGRF_4d([g('cln2'),g('swi4'),g('yhp1'),g('yox1')],g('swi4'),1,9,1,0.12,0.1386,0.04,0.0354);

------------------------------------------------------------------------------------------

[] = plotGRF(model)

Creates a comprehensive plot of the model structure 'model' (obtained from one of the
3 fitting functions).


__________________________________________________________________________________________

INFERENCE OF GLOBAL PROTEIN PARAMETERS LAMBDA
__________________________________________________________________________________________

[s,m] = MCMC_flam (tf,pt,trg,rep,N,model)

Uses a GRF to sample the likelihood of its parameters by Markov Chain Monte Carlo. The GRF
is either estimated within the function or provided by the user. The likelihood 
distribution of lambda can be estimated by creating histograms of the samples and combine
them from different target genes (see functions listed below).

ARGUMENTS:
tf:    one or two indices of TFs
pt:    protein grids for the TFs; for two TFs this must be a 2 x N_grid x 42 array
trg:   index of the target gene
rep:   replicate dataset to use (1 or 2)
N:     number of samples to generate from MCMC (1e6 - 1e7 are reasonable values)
model: model structure to use; if model is not assigned, it is estimated within the function

RETURNS:
s: (5 or 8) x N array of sampled model parameters; for 1 input TF the columns are:
    lambda, b, a, K, n; for two input TFs the columns are: lambda1, lambda2, b, a, K1, K2, n1, n2

m: regulation model estimated within the function or provided by the user

EXAMPLE:
[s,m] = MCMC_flam(g('swi5'),protein_grid(g('swi5'),1,10000),g('ash1'),1,1e6);

------------------------------------------------------------------------------------------

[ns,n,l] = Learn_lambda (tf,trgs,rep,N)

Uses MCMC_flam to estimate single input GRFs and sample lambda from it to create 
likelihood distribution histograms. For multiple target genes a product histogram is created.
The resulting histogram can be plotted by 'bar(l,n)'. See 'Estimate_lambda' below for 
combining multiple histograms. Returned histograms are normalized to sum(histogram) = 1.

ARGUMENTS:
tf:   TF index
trgs: Single or list of target gene indices
rep:  replicate dataset to be used (1 or 2)
N:     number of samples to generate from MCMC (1e6 - 1e7 are reasonable values)

RETURNS
ns: length(trgs) x 500 array of histograms over likelihood of the parameter lamdba
n (optional):  product histogram of all histograms in 'ns'
l (optional):  vector of center points over which lambda histograms are created

EXAMPLE:
[ns,n,l] = Learn_lambda (g('fkh1'),[g('alk1'),g('clb1'),g('clb2'),g('hst3'),g('kip2')],2,1e6);

------------------------------------------------------------------------------------------

[ns1,ns2,n1,n2,l] = Learn_lambda_2d (tf,trgs,rep,N)

Uses MCMC_flam to estimate two input GRFs and sample lambda from it to create 
likelihood distribution histograms. For multiple input TFs a product histogram is created.
The resulting histogram can be plotted by 'bar(l,n1)'. See 'Estimate_lambda' below for 
combining multiple histograms. Returned histograms are normalized to sum(histogram) = 1.

ARGUMENTS:
tf:   vector of two TF indices
trgs: Single or list of target gene indices
rep:  replicate dataset to be used (1 or 2)
N:     number of samples to generate from MCMC (1e6 - 1e7 are reasonable values)

RETURNS
ns1,ns2: length(trgs) x 500 arrays of histograms over likelihood of the parameter lambda
		 for first and second TF
n1,n2 (optional):  product histograms of all histograms in 'ns1' and 'ns2'
l (optional):  vector of center points over which lambda histograms are created

EXAMPLE:
[ns1,n1,ns2,n2,l] = Learn_lambda_2d ([g('fkh1'),g('ndd1')],[g('alk1'),g('clb1'),g('clb2'),g('hst3'),g('kip2')],2,1e6);

------------------------------------------------------------------------------------------

[phist,lam_est] = Estimate_lambda (l,ns,plot_res)

Takes multiple lambda-histograms for one TF but for multiple target genes and computes
the product histogram. The mode of the product histogram is returned as an global 
estimate for lambda.

ARGUMENTS:
ns: array where each row is a lambda histogram
l:  vector of center points over which lambda histograms are created
plot_res: boolean flag that, if true, plots the resulting product histogram and marks
		  the estimate for lambda

RETURNS:
phist:   product histogram
lam_est: mode of the product histogram - estimation for lambda of the corresponding TF

EXAMPLE:
% single input target genes
[ns_1,n_1,l] = Learn_lambda (g('ndd1'),[g('alk1'),g('clb1'),g('clb2'),g('hst3')],2,1e6);
% two input target genes
[ns_tmp,n_tmp,ns_2,n_2] = Learn_lambda_2d ([g('fkh1'),g('ndd1')],[g('swi5'),g('kip2')],2,1e6);
%combine
[phist_ndd1,lambda_ndd1] = Estimate_lambda(l,[ns_1;ns_2],true);

__________________________________________________________________________________________

CONSTRUCTING THE TRANSCRIPTIONAL CELL CYCLE OSCILLATOR
__________________________________________________________________________________________

REMARK:
For the re-construction of a transcriptional cell cycle oscillator we first selected a
set of canditate genes that are potentially contained in such a regulatory module. 
The indices of the genes in this set are stored in a vector, which here is referred 
to as 'tfs'. Furthermore, a quadratic [length(tfs) x length(tfs)] boolean 'interacion matrix'
is required. This matrix - here referred to as 'imat' - has an entry 1 in row i, column j
if gene i listed in 'tfs' regulates gene j listed in 'tfs' and 0 otherwise. 

------------------------------------------------------------------------------------------

models = construct_node_models (imat, tfs, lambdas, rep)

Estimates for each gene in 'tfs' all possible GRFs with one or two inputs according to 
the interaction matrix imat.

ARGUMENTS:
imat:    'interaction matrix' (see REMARK above)
tfs:     set of candidate genes (see REMARK above)
lambdas: vector of pre-estimated lambdas for each gene in 'tfs' (see Learn_lambdas)
rep:     replicate dataset to be used in GRF estimation (1 or 2)

RETURNS:
An array of model structures with all possible GRFs for each gene in 'tfs'

------------------------------------------------------------------------------------------

[networks, scores, stats] = FindNetworks(tfs,imat,models,N)

Takes the previously fitted GRFs, combinatorially constructs all possible networks from
them and filters out those which are not fully connected. A network score is computed as
the average score of its constituent GRFs. Returns a list of the best scoring networks 
ordered by score.

ARGUMENTS:
tfs:     set of candidate genes (see REMARK above)
imat:    'interaction matrix' (see REMARK above)
models:  all possible GRFs for 'tfs', contructed by 'construct_node_models'
N:       number of best scoring networks to be returned (e.g. ~100)

RETURNS:
networks: an array of N network structures ordered by score; a network structure contains
		  (i) a vector with the indices of the constituent GRFs (index referring to 'models')
		  (ii) network score (average GRF score); (iii) score of the worst GRF in network
		
scores (optional): separate vector with network scores (for statistics)
stats:	vector with two entries: (i) number of fully connected networks 
		(ii) total number of constructed networks

------------------------------------------------------------------------------------------

[] = print_network(net,models)

Prints a summary of a network structure (as returned by 'FindNetworks') in the command
window.

ARGUMENTS:
net: 	network structure
models: array of GRFs with which 'net' was constructed

------------------------------------------------------------------------------------------

networks = combine_replicate_networks(net1,net2)

Takes a set of best scoring network structures constructed on both replicate datasets,
respectively, and finds the best compromise. This is done by (i) determining a set 
a networks that occur equivalently in both input sets and (ii) computing an average score
and ordering the output set by it.

ARGUMENTS:
net1, net2: sets of best ranking network structures (as returned by 'FindNetworks'), 
			constructed from both replicate datasets, respectively
			
RETURNS:
networks: array of network structures that occur equivalently in both input sets
		  ordered by their average score in both datasets
		  
__________________________________________________________________________________________

CONSTRUCTING AND SIMULATING ODE MODELS
__________________________________________________________________________________________

REMARK:
The following functions construct an ODE simulation model from a network structure and
simulate it. All GRFs within the network must have input TFs that are also contained in
the network (closed network). An exception to this is 'Cln2', which the simulation uses
as an external input.

------------------------------------------------------------------------------------------

netw_ode = fit_network(net, models,rep)

Constructs an output structure that can be simulated by 'sim_network' from an input
network structure. The parameters in the network structure are re-fitted by gradient
ascend using the original parameters as start values. This is done to avoid a build-up
of propagated error in the simulation which can disturb the overall dynamical behavior.

ARGUMENTS:
net: 	network structure (as returned by 'FindNetworks') for which an ODE model is to be constructed
models: array of GRFs with which 'net' was constructed
rep:	replicate dataset on which 'models' have been estimated

RETURNS:
A structure which contains all information needed by 'sim_network' to simulate the network


------------------------------------------------------------------------------------------

[t, m, p] = sim_network(netw,t_end,kos)

Simulates an ODE structure constructed by 'fit_networks' from t=0 to t=t_end. If GRFs
in the network have the input TF Cln2, data is used for the input until t=205; then Cln2
input is set to 0.

ARGUMENTS:
netw:  network ODE structure as returned by 'fit_network'
t_end: time until the system is to be simulated in minutes

kos (optional): used for in-silico knock-out experiments; vector with indices of genes
				which expression levels are set to zero. An index in 'kos' corresponds to
				the number of the gene in the network. E.g. kos = [1,3] means that the
				output of the the 1st and the 3rd GRF in the network is set to constant 0.
				
RETURNS:
t: vector of timepoints for the simulation results in 'm' and 'p'
m: simulated mRNA expression level; the columns correspond to the individual genes in
   the network, rows correspond to the timepoints returned in 't'
p: simulated 'protein model'; the columns correspond to the individual genes in
   the network