
Advanced Data Analysis Techniques for Statistical Inference
Explore the importance of computers in generating random numbers for statistical inference methods, testing new statistical approaches, and simulating complex models in data analysis. Understand the significance of pseudo-random number generation and linear congruential generators in producing apparently random sequences for various applications.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Advanced Topics in Data Analysis 2019 Shai Carmi
Outline Introduction Basic method Discrete random variables Continuous random variables
Why do we need the computer to generate random numbers? Statistical inference: We are often interested in hypothesis testing or estimation for problems without known recipe Using randomness allows us to perform statistical inference even under those conditions more later
Testing methods Create realistic new datasets for testing new statistical methods Example: o A method estimates gene expression from sequencing reads o We would like to test the method, but for any real data we don t know the true levels o We generate many synthetic datasets, for which we know the true levels, by randomly drawing an expression levels, reads, and errors
Simulations to test complex models Predict the behavior of complex stochastic models Example: o An RNA molecule is traveling in the nucleus o How long will it take it to exist? o Sometimes the theory is too complicated o Instead, we simulate a random walk Nucleus Pore RNA
Generating random numbers Traditional methods: o Flip a coin o Roll a die o Pick a card from a deck o Pick a ball from an urn o Roulette But computers are deterministic For a given input, the output is always the same (i.e., non-random!)
Pseudo-random numbers We can generate a sequence of apparently random numbers Different sequences can be generated with different seeds
Outline Introduction Basic method Discrete random variables Continuous random variables
Linear Congruential Generator Provide a seed: ?1 Generate a sequence of numbers: ?2,?3,?4, ??= ??? 1+ ? mod ? mod( modulo ) is the remainder after division by ?
Linear Congruential Generator ??= ??? 1+ ? mod ? 2?? 1 + 3 5 13 9 21 5 13 9 21 5 ? ?? 1 2?? 1 ??= mod 10 2?? 1+ 3 2 3 4 5 6 7 8 9 1 5 3 9 1 5 3 9 1 2 ?2= 5 ?3= 3 ?4= 9 ?5= 1 ?6= 5 ?7= 3 ?8= 9 ?9= 1 ?10= 5 Example: o ? = 2,? = 3,? = 10,?1= 1 o The sequence: 1,5,3,9,1,5,3,9,1,5, o Highly repetitive! 10 6 18 2 10 6 18 2 The method generates integers in the range [0,? 1] We can t draw more than ? numbers without repeating ourselves 10
Large ? To avoid repetitions, we can use very large ? and ? ? = 1, ? = 1234567, ? = 7654321, ?1= 1 The obtained sequence is 1, 1234568, 5551574, 3874565, 2920147, 819239, 1729179, 4857915, 797392, 3971134, 116874, 4832709, 4770776, 3345234, 3645166, 4361914, 10825, 7397631, 3194813, 324240, 5633065, 1578738, 7462933, 7580275, 807622, 3961894, 5310405, 2364000, 333911, 4389762, 4677030, 960772, 860602, 5148609, 1522484, 5785348, 7367118, 7419904, 6384572, 7072797, 6698088, 7530362, 4803322, 6031887, 2206166, 7390051, 6411536, 7295356, 3534104, 2733833, 3703572, 6778296, 5729521, 4701493, 5876948, 1111543, 7638002, 6928321, 2384138, 809549, 2478672, 3934040, 891119, 5861786, 5075855, 5291901, 6084417, x = numeric(100) x[1] = 1 for (i in 2:100) { } Code in R x[i] = (1234567 * x[i-1] + 1) %% 7654321 cat(sprintf("%d, ",x[i]))
How to generate random numbers in [0,1]? We can simply divide the resulting numbers by m Example: ? = 0, ? = 12345, ? = 54321, ?1= 1 n=200000 x = numeric(n) y = 1 x[1]=y/54321 for (i in 2:n) { y = (12345*y)%%54321 x[i] = y/54321 } hist(x,breaks=25)
Practice Practically used generator: Mersenne-Twister The seed can be either fixed for reproducibility Or equal the time to generate a new sequence in each run All common software packages have a method for a random number ? uniformly distributed between [0,1] o This means that ? will take any value between [0,1] with equal probability o In R: runif Using draws of ?, we can generate numbers with any distribution
Real story Mersenne-Twister in the US lottery draw https://www.nytimes.com/interactive/2018/05/03/magazine/money-issue-iowa- lottery-fraud-mystery.html
Are those numbers random? Subsequent numbers should be independent of each other: o Prob ???1,?2, ,?? 1 = Prob ?? o Prob ????+1,??+2, = Prob ?? The probability of each number (or a combination of numbers) should be the same as any other number (or combination) set.seed(1) n = 10000 x = runif(n) y = runif(n) plot(x,y,pch=". )
Is it possible to generate true random numbers? Physical phenomena can be exploited to create truly random numbers o Thermal noise (random voltage on a resistor) o Nuclear decay (number of emitted particles) o Human behavior (moving a mouse) Such methods are typically not used in bioinformatics
Outline Introduction Basic method Discrete random variables Continuous random variables
Generating integers using ? In practice, we often only have a method to generate ? What if we need random integer ? between 1 and ?? Draw a random number ? uniformly distributed in (0,1] Calculate ? = ?? The desired number ? is the ceiling of ? Ceiling of ?: The smallest integer that s larger than or equal to ?(rounding upwards) ceil[1.1]=2, ceil[4]=4, ceil[0.8]=1, ceil[5.23]=6
Generating integers within a range Draw a random number ?, uniformly distributed in (0,1] The desired number ? is the ceiling of ? = ?? Example: ? = 10 o ? = 0.23 ? = 2.3 ? = 3 o ? = 0.78 ? = 7.8 ? = 8 o ? = 0.03 ? = 0.3 ? = 1 o ? = 0.91 ? = 9.1 ? = 10 o ? = 0.6 ? = 6 ? = 6
Random variables: reminder A random variable ? can take different numbers with certain probabilities based on the results of an experiment Example: roll a die. ? can be: o The number drawn o 1 if the number is even, 0 if odd o The square of the number drawn o The sum of the numbers drawn in two rolls
Random variables: reminder Discrete random variables can take only specific values: ?1,?2, ,?? The distribution of ? is the list of possible values (?1,?2, ,??)along with their probabilities (?1,?2, ,??) ? ? = ?? = ??, ? = 1,2, ,? ?=1 ??= 1 ? To draw discrete random variables with an arbitrary distribution, we use the inverse transformation method
The inverse transformation method: example Suppose we are given the distribution o ?(? = 1) = 0.4 o ?(? = 3) = 0.5 o ?(? = 6) = 0.1 To draw numbers: First, draw a number ? uniformly distributed between [0,1] Then, set: 1, 3, 6, if 0 < ? 0.4 if 0.4 < ? 0.9 if 0.9 < ? 1 ? =
The inverse transformation method: why? Suppose ? is a random variable with equal probability to take any value in [0,1] The probability of ? to take value in an interval [?,?] is equal to the length of the interval Possible values of ? 0 1 ? ? Prob ? < ? < ? = (? ?)
The inverse transformation method: why? Back to the example: o ?(? = 1) = 0.4 o ?(? = 3) = 0.5 o ?(? = 6) = 0.1 Why does ? have the desired distribution? Possible values of ? 1 3 6 0 0.9 1 0.4 0.4 0.5 0.1 Probabilities Values of ?
The inverse transformation method: formal definition Denote the values as ?1, ,?? and the probabilities as ?1, ,?? To generate a sample from ?: Draw a number ? uniformly distributed between [0,1] ? 1??< ? ?=1 ? Set ? = ?? if ?=1 ?? 0 (By convention, for ? = 1, we define ?=1 ??= 0)
The inverse transformation method: why? Draw a number ? uniformly distributed between [0,1] ? 1??< ? ?=1 ? Set ? = ?? if ?=1 ?? Possible values of U Event Probability ? ?1 ?2 ?3 0 < ? ?1 ?1< ? ?1+ ?2 ?1+ ?2< ? ?1+ ?2+ ?3 ?1 ?2 ?3 ?? ?? ?? ?? 0 1 ?1 ?1+ ?2 ?1+ ?2+ ?3 ?1 ?2 ?3 ?4 ?1+ + ?? 1< ? ?1+ + ?? 1+ ??= 1 ?? ??
Example ?? 1 3 6 ?? 0.4 0.5 0.1 ? To generate a sample from ?: 1 2 3 o Draw a number ? uniformly distributed between [0,1] o Set ? = ?? if ?=1 ? 1??< ? ?=1 ? ?? ? 1 ? Value of ?? Condition on ? ? ?? ?? ?=1 ?=1 1 2 3 1 3 6 0 ?1= 0.4 ?1+ ?2= 0.9 ?1+ ?2+ ?3= 1 0 ? 0.4 ?1= 0.4 ?1+ ?2= 0.9 0.4 < ? 0.9 0.9 < ? 1
Example: tossing coins Suppose we would like to draw a random number ?, equal to the number of heads out of three coin tosses ? is binomial with ? = 3 and ? = 0.5 Event TTT TTH THT THH HTT HTH HHT HHH Probability 1/8 1/8 1/8 1/8 1/8 1/8 1/8 1/8 ? 0 1 1 2 1 2 2 3 ?(? = 0) = 1/8 ?(? = 1) = 3/8 ?(? = 2) = 3/8 ?(? = 3) = 1/8
The inverse transformation method To generate a sample from ?: o Draw a number ? uniformly distributed between [0,1] o Set ? = ?? if ?=1 ? 1??< ? ?=1 ? ?? Possible values of U ?? 0 1 2 3 ?? 1/8 3/8 3/8 1/8 ? 0 1 2 3 1 2 3 4 0 1/8 1/2 7/8 1 1/8 3/8 3/8 1/8
The inverse transformation method ?? 0 1 2 3 ?? 1/8 3/8 3/8 1/8 ? To generate a sample from ?: 1 2 3 4 o Draw a number ? uniformly distributed between [0,1] o Set ? = ?? if ?=1 ? 1??< ? ?=1 ? ?? ? 1 ? Value of ?? Condition on ? Probability ? ?? ?? ?=1 ?=1 1 2 3 4 0 1 2 3 1/8 3/8 3/8 1/8 0 ? 0 ?1= 1/8 ?1+ ?2= 1/2 ?1+ ?2+ ?3= 7/8 ?1+ ?2+ ?3+ ?4 = 1 1 8 1 8 < ? ?1= 1/8 ?1+ ?2= 1/2 ?1+ ?2+ ?3= 7/8 1 2 1 2 < ? 7 8 7 8 < ? 1
Trick For some special cases, we can draw ? in a simpler way Consider the example of three coin tosses To toss a coin, draw a uniformly distributed random number ? If ? < 0.5, set the coin as T , otherwise set the coin as H Repeat three times Set ?as the number of coins that came out as H Limited to specific cases where the problem has some structure, symmetry, etc.
Continuous random variables Can take any value within a range Examples: o Standard uniform variables can have any value between [0,1] o Normal variables can take any real value Continuous random variables have probability density function (PDF) ?(?) ?? ? ?? Prob ? < ? < ? = ? ? ? ?? = 1
Examples Normal distribution with mean ? and variance ?2 Standard uniform variable ?(?) 1 ? ? ? 0 1
The cumulative distribution function Standard uniform variable ? ? = ? ? ? ?(?) ?(?) ? ?(?) = ? = 0 ? = 1 ? ? ?? 1 1 ? ? 0 0 1 1 Normal distribution with mean ? and variance ?2 ?(?) ?(?)
The inverse transformation method for continuous variables To draw a continuous random variable ? with density ?(?): Draw a standard uniform random number ? Set ? = ? to satisfy ?(?) = ? ?(?) Intuition: o The random number ? is a position on the ?-axis o We find ? for which ?(?) = ? o The larger ?(?) in a region, the more values of ? are covered by ?(?) ?(?) ? ?
Example: non-standard uniform variable Consider a variable ? that is uniform in the range [0,4] f(x) ?(?) = 1/4 for ? [0,4] 1/4 x ? ??? =? ? ? ?? =1 ? ? = 4 0 0 4 4 F(x) To draw a variable ?, we draw a standard uniform ? ?(?) = ? ?/4 = ? ? = 4? An obvious solution, after all 1 x 0 4
Drawing normal variables The Box-Muller method: o Draw two standard uniform variables, ?1 and ?2 o Set ? = 2ln?1 and ? = 2??2 o Set ?1= ?cos? and ?2= ?sin? o ?1 and ?2 are standard normal variables (mean 0, variance 1) How to generate a normal variable ? with mean ? and variance ?2? o Generate standard normal variable ? o Set ? = ?? + ?
Example: normal variable Generate random normal variables with ? = 7,? = 5 set.seed(1) n = 10000 mu = 7 sigma = 5 Z = numeric(2*n) for (i in 1:n) { U1 = runif(1) U2 = runif(1) R = sqrt(-2*log(U1)) theta = 2*pi*U2 Z[2*i-1] = R*cos(theta) Z[2*i] = R*sin(theta) } X = Z*sigma+mu h = hist(X,breaks=50,prob=T) curve(dnorm(x,mu,sigma),add=T)
An exponential random variable An exponential variable has PDF ? ? = ?? ??,0 ? < It represents the time to an event that happens at rate ? per unit time, independently of past events Examples: o The number of generations until a mutation happens o The time until a protein binds to DNA o The time until a phone call in a call center o The time until a lightbulb stops working o The time to cancer, heart attack, injury At least as a null model ?(?)
Properties of exponential random variables ?? ? ?? = 1/? The mean of ?( mean waiting time ) is 0 ? is memoryless: ? ? > ? + ? ? > ? = ?(? > ?) o Interpretation: the waiting time is the same regardless of whether we start counting from time zero or from any other time
How to draw an exponential random variable ?? ? ?? = 1 ? ?? ? ? = 0 o ? ? = ? o 1 ? ??= ? o ? = 1 ?ln(1 ?) Can also use ? = 1 ?ln? set.seed(1) U = runif(10000) X = -log(U) h = hist(X,breaks=50,prob=T) curve(dexp,xlim=c(0,5),add=T) Example: ? = 1
Drawing from complex/non-standard distributions Rejection sampling Markov chain Monte Carlo (MCMC): Metropolis-Hastings algorithm o For drawing points in high dimension Basic idea: o We need to draw a number from ?(?), but there is no direct method o We draw instead numbers from ?(?), from which sampling is easy ?(?) can be uniform, normal, etc. o We then accept or reject the new sample ? depending on the relationship between ?(?) and ?(?) o MCMC generates a sequenceof correlated samples, which we need to thin to obtain independent draws