Key Concepts and Steps in Data Science and Statistics at University of Minnesota, Morris

statistics discipline n.w
1 / 38
Embed
Share

Explore the key concepts and steps in data science and statistics as presented in the UMM-HHMI Undergraduate Summer Research Program at the University of Minnesota, Morris. Learn about the stages of data science, including question formulation, data collection, manipulations, exploratory and confirmatory data analysis, finding communication, and generating new research questions. Familiarize yourself with statistical techniques, methods, and tools, identify the seven stages of data science, and understand statistical aspects of research projects. Get ready to formalize your statistical needs and prepare for upcoming research questions.

  • Data Science
  • Statistics
  • University of Minnesota
  • Morris
  • Research

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Statistics Discipline University of Minnesota, Morris The Key Concepts and Steps in Data Science Engin A. Sungur Statistics Discipline University of Minnesota, Morris UMM-HHMI Undergraduate Summer Research Program, June 4, 2014 1

  2. Statistics Discipline University of Minnesota, Morris OUTLINE PRESENTATION SURVEY o o INTRODUCTIONS & BACKGROUND INFORMATION STEPS/STAGES OF DATA SCIENCE/STATISTICS o QUESTION/PROBLEM o DATA COLLECTION o DATA MANIPULATIONS o EXPLORATORY DATA ANALYSIS o COMFIRMATORY DATA ANALYSIS o COMMUNICATING THE FINDINGS o FORMULATING NEW QUESTIONS/PROBLEMS GENERAL REMARKS UMM-HHMI Undergraduate Summer Research Program, June 4, 2014 TEST o 2

  3. Statistics Discipline University of Minnesota, Morris LEARNING OBJECTIVES PRESENTATION SURVEY o o IDENTIFY THE SEVEN STAGES OF THE DATA SCIENCE (you) LEARN COMMON CONCEPTS IN EACH (you) GET FAMILIAR WITH SOME SATATISTICAL TECHNIQUES/METHODS/TOOLS THAT ARE AVAILABLE (you) LEARN ABOUT THE STATISTICAL ASPECT OF THE RESEARCH PROJECTS (me) o o FORMALIZE THE TYPE OF STATISTICAL NEED OF YOUR PROJECTS (me) GET READY FOR YOUR QUESTIONS o o (me) UMM-HHMI Undergraduate Summer Research Program, June 4, 2014 3

  4. Statistics Discipline University of Minnesota, Morris STAGES OF DATA SCIENCE Other Fields/Disci plines Question/Problem (Inquiry) Formulation of New Questions/Problem s Data Collection (Collecting Evidence) Data Manipulations Communicating the Findings (First Encounter) Probability Models Confirmatory Data Analysis Exploratory Data Analysis (Confirmation of what we have found) (Describing what we have) UMM-HHMI Undergraduate Summer Research Program, June 4, 2014 4

  5. Statistics Discipline University of Minnesota, Morris STAGES OF DATA SCIENCE (Contd.) Other Fields/Disci plines Question/Problem (Inquiry) Formulation of New Questions/Problem s Hypothesis vs. No Hypothesis Supervised vs. Unsupervised Data Collection (Collecting Evidence) Model vs. No Model Data Manipulations Communicating the Findings (First Encounter) Probability Models Confirmatory Data Analysis Exploratory Data Analysis (Confirmation of what we have found) (Describing what we have) UMM-HHMI Undergraduate Summer Research Program, June 4, 2014 5

  6. Statistics Discipline University of Minnesota, Morris STAGES OF DATA SCIENCE (Contd.) Other Fields/Disci plines Question/Problem (Inquiry) Formulation of New Questions/Problem s Hypothesis vs. No Hypothesis Supervised vs. Unsupervised Population vs. Sample Available vs. Produced Observational vs. Experimental Measurable vs. Not Measurable Data Collection (Collecting Evidence) Model vs. No Model Sampling Design Experimental Design Data Manipulations Communicating the Findings (First Encounter) Probability Models Confirmatory Data Analysis Exploratory Data Analysis (Confirmation of what we have found) (Describing what we have) UMM-HHMI Undergraduate Summer Research Program, June 4, 2014 6

  7. Statistics Discipline University of Minnesota, Morris STAGES OF DATA SCIENCE (Contd.) Other Fields/Disci plines Question/Problem (Inquiry) Formulation of New Questions/Problem s Hypothesis vs. No Hypothesis Supervised vs. Unsupervised Population vs. Sample Available vs. Produced Observational vs. Experimental Measurable vs. Not Measurable Data Collection (Collecting Evidence) Model vs. No Model Sampling Design Experimental Design Database Creation Data Reduction Data Condensation Data Manipulations Communicating the Findings (First Encounter) Data Reliability Probability Models Confirmatory Data Analysis Exploratory Data Analysis (Confirmation of what we have found) (Describing what we have) UMM-HHMI Undergraduate Summer Research Program, June 4, 2014 7

  8. Statistics Discipline University of Minnesota, Morris STAGES OF DATA SCIENCE (Contd.) Other Fields/Disci plines Question/Problem (Inquiry) Formulation of New Questions/Problem s Hypothesis vs. No Hypothesis Supervised vs. Unsupervised Population vs. Sample Available vs. Produced Observational vs. Experimental Measurable vs. Not Measurable Data Collection (Collecting Evidence) Model vs. No Model Sampling Design Experimental Design Database Creation Data Reduction Data Condensation Data Manipulations Communicating the Findings (First Encounter) Data Reliability Probability Models Graphical Confirmatory Data Analysis Exploratory Data Analysis Numerical (Confirmation of what we have found) (Describing what we have) UMM-HHMI Undergraduate Summer Research Program, June 4, 2014 8

  9. Statistics Discipline University of Minnesota, Morris STAGES OF DATA SCIENCE (Contd.) Other Fields/Disci plines Question/Problem (Inquiry) Formulation of New Questions/Problem s Hypothesis vs. No Hypothesis Supervised vs. Unsupervised Population vs. Sample Available vs. Produced Observational vs. Experimental Measurable vs. Not Measurable Data Collection (Collecting Evidence) Model vs. No Model Sampling Design Experimental Design Database Creation Data Reduction Data Condensation Data Manipulations Communicating the Findings (First Encounter) Data Reliability Probability Models Model Selection Model Fitting Graphical Confirmatory Data Analysis Exploratory Data Analysis Model Checking Numerical (Confirmation of what we have found) (Describing what we have) Model Revision UMM-HHMI Undergraduate Summer Research Program, June 4, 2014 9

  10. Statistics Discipline University of Minnesota, Morris STAGES OF DATA SCIENCE (Contd.) Other Fields/Disci plines Question/Problem (Inquiry) Formulation of New Questions/Problem s Hypothesis vs. No Hypothesis Supervised vs. Unsupervised Population vs. Sample Available vs. Produced Observational vs. Experimental Measurable vs. Not Measurable Data Collection (Collecting Evidence) Model vs. No Model Sampling Design Explanation Formal vs. Informal Experimental Design Database Creation Interpretation Written vs. Oral Data Reduction Data Condensation Data Manipulations Communicating the Findings (First Encounter) Data Reliability Probability Models Model Selection Model Fitting Graphical Confirmatory Data Analysis Exploratory Data Analysis Model Checking Numerical (Confirmation of what we have found) (Describing what we have) Model Revision UMM-HHMI Undergraduate Summer Research Program, June 4, 2014 10

  11. Statistics Discipline University of Minnesota, Morris STAGES OF DATA SCIENCE (Contd.) Other Fields/Disci plines Assessment Question/Problem (Inquiry) Evaluation New Knowledge Formulation of New Questions/Problem s Hypothesis vs. No Hypothesis Supervised vs. Unsupervised Population vs. Sample Available vs. Produced Observational vs. Experimental Measurable vs. Not Measurable Data Collection (Collecting Evidence) Model vs. No Model Sampling Design Explanation Formal vs. Informal Experimental Design Database Creation Interpretation Written vs. Oral Data Reduction Data Condensation Data Manipulations Communicating the Findings (First Encounter) Data Reliability Probability Models Model Selection Model Fitting Graphical Confirmatory Data Analysis Exploratory Data Analysis Model Checking Numerical (Confirmation of what we have found) (Describing what we have) Model Revision UMM-HHMI Undergraduate Summer Research Program, June 4, 2014 11

  12. Statistics Discipline University of Minnesota, Morris QUESTION/PROBLEM Other Fields/Disci plines Question/Problem (Inquiry) Formulation of New Questions/Problem s Hypothesis vs. No Hypothesis Supervised vs. Unsupervised Data Collection (Collecting Evidence) Model vs. No Model Data Manipulations Communicating the Findings (First Encounter) Probability Models Confirmatory Data Analysis Exploratory Data Analysis (Confirmation of what we have found) (Describing what we have) UMM-HHMI Undergraduate Summer Research Program, June 4, 2014 12

  13. Statistics Discipline University of Minnesota, Morris QUESTION/PROBLEM UMM-HHMI Undergraduate Summer Research Program, June 4, 2014 13

  14. Statistics Discipline University of Minnesota, Morris QUESTION/PROBLEM UMM-HHMI Undergraduate Summer Research Program, June 4, 2014 14

  15. Statistics Discipline University of Minnesota, Morris DATA COLLECTION Other Fields/Disci plines Question/Problem (Inquiry) Formulation of New Questions/Problem s Hypothesis vs. No Hypothesis Supervised vs. Unsupervised Population vs. Sample Available vs. Produced Observational vs. Experimental Measurable vs. Not Measurable Data Collection (Collecting Evidence) Model vs. No Model Sampling Design Experimental Design Data Manipulations Communicating the Findings (First Encounter) Probability Models Confirmatory Data Analysis Exploratory Data Analysis (Confirmation of what we have found) (Describing what we have) UMM-HHMI Undergraduate Summer Research Program, June 4, 2014 15

  16. Statistics Discipline University of Minnesota, Morris DATA COLLECTION: DATA TYPES DATA TYPES DOUBLE MULTIVARIATE (Repeated Measures) MULTIVARIATE UNIVARIATE UMM-HHMI Undergraduate Summer Research Program, June 4, 2014 16

  17. Statistics Discipline University of Minnesota, Morris DATA COLLECTION: DATA TYPES DATA TYPES CATEGORICAL NUMERICAL NOMINAL ORDINAL DISCRETE CONTINUOS INTERVAL RATIO UMM-HHMI Undergraduate Summer Research Program, June 4, 2014 17

  18. Statistics Discipline University of Minnesota, Morris DATA COLLECTION: DATA TYPES Response Variable Categorical (Nominal or Ordinal) Numerical (Interval or Ratio) Categorical (Nominal or Ordinal) Explanatory Variable Numerical (Interval or Ratio) UMM-HHMI Undergraduate Summer Research Program, June 4, 2014 18

  19. Statistics Discipline University of Minnesota, Morris DATA COLLECTION: DATA TYPES Response Variable Categorical (Nominal or Ordinal) Numerical (Interval or Ratio) Categorical (Nominal or Ordinal) Chi-square Analysis Through Crosstabulation Independent/D ependent t-test ANOVA Explanatory Variable Numerical (Interval or Ratio) Logistic Regression Regression Correlation Log-linear Models UMM-HHMI Undergraduate Summer Research Program, June 4, 2014 19

  20. Statistics Discipline University of Minnesota, Morris DATA COLLECTION: TYPES OF RELATIONSHIPS Association/Correlation does not imply Causation Dependence does not imply Causation (but it sure is a hint Lynd & Stevenson (2007), Tufte (2006), von Eye & DeShon (2011)). ? X Y X Y X Y Z Z CAUSAL COMMON RESPONSE CONFOUNDING Association/Correlation Cause-and-effect Relationship UMM-HHMI Undergraduate Summer Research Program, June 4, 2014 20

  21. Statistics Discipline University of Minnesota, Morris DATA COLLECTION: DESIGN OF EXPERIMENTS Causal relationships can only be set through experiments. PRINCIPALS OF DESIGN OF EXPERIMENTS CONTROL RANDOMIZE REPLICATE UMM-HHMI Undergraduate Summer Research Program, June 4, 2014 21

  22. Statistics Discipline University of Minnesota, Morris DATA MANIPULATIONS Other Fields/Disci plines Question/Problem (Inquiry) Formulation of New Questions/Problem s Hypothesis vs. No Hypothesis Supervised vs. Unsupervised Population vs. Sample Available vs. Produced Observational vs. Experimental Measurable vs. Not Measurable Data Collection (Collecting Evidence) Model vs. No Model Sampling Design Experimental Design Database Creation Data Reduction Data Condensation Data Manipulations Communicating the Findings (First Encounter) Data Reliability Probability Models Confirmatory Data Analysis Exploratory Data Analysis (Confirmation of what we have found) (Describing what we have) UMM-HHMI Undergraduate Summer Research Program, June 4, 2014 22

  23. Statistics Discipline University of Minnesota, Morris DATA MANIPULATIONS: DATA RELIABILITY Data reliability is a state that exists when data is sufficiently complete and error free to be convincing for its purpose and context. COMPLETE: Includes all of the data elements (variables/fields) needed for the analysis ACCURATE: CONSISTENT: The data was obtained and used in a manner that is clear and well-defined enough to yield similar results in similar analysis CORRECT: The data set reflects the data entered at the source and/or properly represents the intended results. UNALTERED: The data reflects source and has not been tampered with. UMM-HHMI Undergraduate Summer Research Program, June 4, 2014 23

  24. Statistics Discipline University of Minnesota, Morris DATA MANIPULATIONS: DATABASE Database is an organized collection of data o Easy to use (data entry and data manipulations) o Dynamic o Interactive o Open to collaboration o Integrated Piece of paper Word processor (Microsoft Word) Microsoft Excel Microsoft Access Statistical software package (R, StatCrunch, SPSS, SAS etc.) Any program that uses SQL (Structured Query Language) Google Docs Google Fusion Tables (UMM Data Services Center: http://mnstats.morris.umn.edu/UMMDataServicesCenter.html ) UMM-HHMI Undergraduate Summer Research Program, June 4, 2014 24

  25. Statistics Discipline University of Minnesota, Morris EXPLORATORY DATA ANALYSIS Other Fields/Disci plines Question/Problem (Inquiry) Formulation of New Questions/Problem s Hypothesis vs. No Hypothesis Supervised vs. Unsupervised Population vs. Sample Available vs. Produced Observational vs. Experimental Measurable vs. Not Measurable Data Collection (Collecting Evidence) Model vs. No Model Sampling Design Experimental Design Database Creation Data Reduction Data Condensation Data Manipulations Communicating the Findings (First Encounter) Data Reliability Probability Models Graphical Confirmatory Data Analysis Exploratory Data Analysis Numerical (Confirmation of what we have found) (Describing what we have) UMM-HHMI Undergraduate Summer Research Program, June 4, 2014 25

  26. Statistics Discipline University of Minnesota, Morris EXPLORATORY DATA ANALYSIS Dynamic Interactive Database integrated graphical displays Correct selection of numerical and graphical summary techniques and methods UMM-HHMI Undergraduate Summer Research Program, June 4, 2014 26

  27. Statistics Discipline University of Minnesota, Morris CONFIRMATORY DATA ANALYSIS Other Fields/Disci plines Question/Problem (Inquiry) Formulation of New Questions/Problem s Hypothesis vs. No Hypothesis Supervised vs. Unsupervised Population vs. Sample Available vs. Produced Observational vs. Experimental Measurable vs. Not Measurable Data Collection (Collecting Evidence) Model vs. No Model Sampling Design Experimental Design Database Creation Data Reduction Data Condensation Data Manipulations Communicating the Findings (First Encounter) Data Reliability Probability Models Model Selection Model Fitting Graphical Confirmatory Data Analysis Exploratory Data Analysis Model Checking Numerical (Confirmation of what we have found) (Describing what we have) Model Revision UMM-HHMI Undergraduate Summer Research Program, June 4, 2014 27

  28. Statistics Discipline University of Minnesota, Morris CONFIRMATORY DATA ANALYSIS ( ) Y1,Y2, ,Yn? are? i.i.d.? from? N m,s2 Y1,Y2, ,Yn DATA i.i.d. independent? and? identically? distributed Simple? random? sample? from? the? same? population N m,s2 ( ) Normal? Distribution p Y1,Y2, ,Yn are i.i.d. from N b0+ bimXi ,s2 i=1 ( ) Constant? Variance p Linear? Model N ,s2 b0+ bimXi i=1 UMM-HHMI Undergraduate Summer Research Program, June 4, 2014 28

  29. Statistics Discipline University of Minnesota, Morris CONFIRMATORY DATA ANALYSIS: TRANSFORMATIONS WHAT TO DO WHEN THE MODEL ASSUMPTIONS ARE VIOLATED? FEWER ASSUMPTIONS TRANSFORMATIONS ORIGINAL DATA POWER TRANSFORMATION RANKING CATEGORIZING NONPARAMETRIC/DISTRIBUTION FREE STATISTICS LOSS OF INFORMATION UMM-HHMI Undergraduate Summer Research Program, June 4, 2014 29

  30. Statistics Discipline University of Minnesota, Morris CONFIRMATORY DATA ANALYSIS: NONPARAMETRIC STATISTICS RANK-BASED METHODS PERMUTATION TESTS R.A. FISHER (1935) BOOTSRAP METHODS TAKE A SAMPLE OF SAME SIZE FROM THE SAMPLE WITH REPLACEMENT CURVE SMOOTHING NO LINEAR OR NONLINEAR MODEL UMM-HHMI Undergraduate Summer Research Program, June 4, 2014 30

  31. Statistics Discipline University of Minnesota, Morris CONFIRMATORY DATA ANALYSIS STATISTICS PROBABILITY BAYESIAN STATISTICS SUBJECTIVE PROBABILITY CLASSICAL STATISTICS FREQUENTIST PROBABILITY UMM-HHMI Undergraduate Summer Research Program, June 4, 2014 31

  32. Statistics Discipline University of Minnesota, Morris CONFIRMATORY DATA ANALYSIS How would you describe in plain English the characteristics that distinguish Bayesian from Frequentist reasoning? (http://stats.stackexchange.com) Here is how I would explain the basic difference to my grandma: I have misplaced my phone somewhere in the home. I can use the phone locator on the base of the instrument to locate the phone and when I press the phone locator the phone starts beeping. Problem: Which area of my home should I search? Frequentist Reasoning: I can hear the phone beeping. I also have a mental model which helps me identify the area from which the sound is coming from. Therefore, upon hearing the beep, I infer the area of my home I must search to locate the phone. Bayesian Reasoning: I can hear the phone beeping. Now, apart from a mental model which helps me identify the area from which the sound is coming from, I also know the locations where I have misplaced the phone in the past. So, I combine my inferences using the beeps and my prior information about the locations I have misplaced the phone in the past to identify an area I must search to locate the phone. UMM-HHMI Undergraduate Summer Research Program, June 4, 2014 32

  33. Statistics Discipline University of Minnesota, Morris CONFIRMATORY DATA ANALYSIS Tongue firmly in cheek: A Bayesian defines a "probability" in exactly the same way that most non-statisticians do - namely an indication of the plausibility of a proposition or a situation. If you ask him a question, he will give you a direct answer assigning probabilities describing the plausibilities of the possible outcomes for the particular situation (and state his prior assumptions). A Frequentist is someone that believes probabilities represent long run frequencies with which events occur; if needs be, he will invent a fictitious population from which your particular situation could be considered a random sample so that he can meaningfully talk about long run frequencies. If you ask him a question about a particular situation, he will not give a direct answer, but instead make a statement about this (possibly imaginary) population. Many non-frequentist statisticians will be easily confused by the answer and interpret it as Bayesian probability about the particular situation. P-VALUE? https://www.youtube.com/watch?feature=endscreen&NR=1&v=ax0tDcFkPic https://www.youtube.com/watch?v=ez4DgdurRPg UMM-HHMI Undergraduate Summer Research Program, June 4, 2014 33

  34. Statistics Discipline University of Minnesota, Morris CONFIRMATORY DATA ANALYSIS Very crudely I would say that: Frequentist: Sampling is infinite and decision rules can be sharp. Data are a repeatable random sample - there is a frequency. Underlying parameters are fixed i.e. they remain constant during this repeatable sampling process. Bayesian: Unknown quantities are treated probabilistically and the state of the world can always be updated. Data are observed from the realised sample. Parameters are unknown and described probabilistically. It is the data which are fixed. UMM-HHMI Undergraduate Summer Research Program, June 4, 2014 34

  35. Statistics Discipline University of Minnesota, Morris CONFIRMATORY DATA ANALYSIS: SOME TECHNIQUES MULTIVARIATE TECHNIQUES http://mnstats.morris.umn.edu/multivariatestatistics/overview.html NONPARAMETRIC/DISTRIBUTION FREE TECHNIQUES http://mnstats.morris.umn.edu/introstat/nonparametric/learningtools.html UMM-HHMI Undergraduate Summer Research Program, June 4, 2014 35

  36. Statistics Discipline University of Minnesota, Morris COMMUNICATING THE FINDINGS Other Fields/Disci plines Question/Problem (Inquiry) Formulation of New Questions/Problem s Hypothesis vs. No Hypothesis Supervised vs. Unsupervised Population vs. Sample Available vs. Produced Observational vs. Experimental Measurable vs. Not Measurable Data Collection (Collecting Evidence) Model vs. No Model Sampling Design Explanation Formal vs. Informal Experimental Design Database Creation Interpretation Written vs. Oral Data Reduction Data Condensation Data Manipulations Communicating the Findings (First Encounter) Data Reliability Probability Models Model Selection Model Fitting Graphical Confirmatory Data Analysis Exploratory Data Analysis Model Checking Numerical (Confirmation of what we have found) (Describing what we have) Model Revision UMM-HHMI Undergraduate Summer Research Program, June 4, 2014 36

  37. Statistics Discipline University of Minnesota, Morris FORMULATING NEW QUESTIONS/PROBLEMS Other Fields/Disci plines Assessment Question/Problem (Inquiry) Evaluation New Knowledge Formulation of New Questions/Problem s Hypothesis vs. No Hypothesis Supervised vs. Unsupervised Population vs. Sample Available vs. Produced Observational vs. Experimental Measurable vs. Not Measurable Data Collection (Collecting Evidence) Model vs. No Model Sampling Design Explanation Formal vs. Informal Experimental Design Database Creation Interpretation Written vs. Oral Data Reduction Data Condensation Data Manipulations Communicating the Findings (First Encounter) Data Reliability Probability Models Model Selection Model Fitting Graphical Confirmatory Data Analysis Exploratory Data Analysis Model Checking Numerical (Confirmation of what we have found) (Describing what we have) Model Revision UMM-HHMI Undergraduate Summer Research Program, June 4, 2014 37

  38. Statistics Discipline University of Minnesota, Morris CONCLUDING REMARKS SEE ME FOR A HELP QUESTIONS? IF NOT, I HAVE SOME FOR YOU. PLEASE TAKE THE TEST BEFORE YOU LEAVE UMM-HHMI Undergraduate Summer Research Program, June 4, 2014 38

Related


More Related Content