Data Mining and Warehousing Teaching Scheme and Course Objectives

1 / 80

Embed Share

Explore the teaching scheme and course objectives of Data Mining and Warehousing, covering fundamentals, methods, techniques, and outcomes. Understand the processes of data mining, data warehousing, and analysis techniques. Course designed to apply basic to advanced mining techniques and optimize the mining process.

holdridge_j Follow

Uploaded on Apr 04, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

210242: Data Mining & Warehousing Teaching Scheme: Examination Scheme: TH: 03 Hours/Week Credit:3 Mid_Semester(TH): 30 Marks End_Semester(TH): 70Marks

Course Instructors Course Instructors Prof. Deepali Prashant Pawar M.E.(CSE), Assistant Professor Department of Computer Engineering, SNJB's Late Sau. Kantabai Bhavarlalji Jain College of Engineering, My Blog-https://deepalipawar.wordpress.com/ 9823076585

Prerequisite Courses & Course Objectives Prerequisite Courses & Course Objectives Prerequisite Courses- DBMS Course Objectives: To understand the fundamentals of Data Mining To identify the appropriateness and need of mining the data To learn the preprocessing, mining and post processing of the data To understand various methods, techniques and algorithms in data mining

Course Outcomes Course Outcomes Apply basic, intermediate and advanced techniques to mine the data Be able to design data warehouses. Explore the hidden patterns in the data using similarity & dissimilarity measures. Optimize the mining process by choosing the best data mining technique using association rules. Analyze the output generated by the classification process of data mining Ability to apply acquired knowledge for understanding Reinforcement learning,holistic learning,Multiclass Classification for data analysis.

Google ClassRoom Code Google ClassRoom Code Google ClassRoom Code- nxahqcj

Course Course Outcomes Outcomes Course Outcomes: On completion of the course the student should be able to- Apply basic, intermediate and advanced techniques to mine the data Analyze the output generated by the process of data mining Explore the hidden patterns in the data Optimize the mining process by choosing best data mining technique

Course Course Outcomes Outcomes Course Outcomes: On completion of the course the student should be able to- Apply basic, intermediate and advanced techniques to mine the data Analyze the output generated by the process of data mining Explore the hidden patterns in the data Optimize the mining process by choosing best data mining technique

Syllabus Syllabus Unit -I Introduction, (8 Hrs.) Data Mining, Data Mining Task Primitives, Data: Data, Information and Knowledge; Attribute Types: Nominal, Binary, Ordinal and Numeric attributes, Discrete versus Continuous Attributes; Introduction to Data Preprocessing, Data Cleaning: Missing values, Noisy data; Data integration: Correlation analysis; transformation: Min-max normalization, z-score normalization and decimal scaling; data reduction: Data Cube Aggregation, Attribute Subset Selection, sampling; and Data Discretization: Binning, Histogram Analysis 9

SYLLABUS Unit -II Data Warehouse Data Warehouse, Operational Database Systems and Data Warehouses(OLTP Vs OLAP), A Multidimensional Data Model: Data Cubes, Stars, Snowflakes, and Fact Constellations Schemas; OLAP Operations in the Multidimensional Data Model, Concept Hierarchies, Data Warehouse Architecture, The Process of Data Warehouse Design, A three-tier data warehousing architecture, Types of OLAP Servers: ROLAP versus MOLAP versus HOLAP. (8Hrs.) 10

SYLLABUS Unit -III Measuring Data Similarity and Dissimilarity(8 Hrs.) Measuring Data Similarity and Dissimilarity, Proximity Measures for Nominal Attributes and Binary Attributes, interval scaled; Dissimilarity of Numeric Data: Minskowski Distance, Euclidean distance and Manhattan distance; Proximity Measures for Categorical, Ordinal Attributes, Ratio scaled variables; Dissimilarity for Attributes of Mixed Types, Cosine Similarity. 11

SYLLABUS Unit -IV Association Rules Mining Market basket Analysis, Frequent item set, Closed item set, Association Rules, a-priori Algorithm, Generating Association Rules from Frequent Item sets, Improving the Efficiency of a-priori,Mining Frequent Item sets without Candidate Generation:FP Growth Algorithm; Mining VariousKinds of Association Rules: Mining multilevel association rules, constraint based association rulemining, Meta rule-Guided Mining of Association Rules. (8 Hrs.) 12

SYLLABUS Unit -V Classification Introduction to: Classification and Regression for Predictive Analysis, Decision Tree Induction, Rule-Based Classification: using IF-THEN Rules for Classification, Rule Induction Using a Sequential Covering Algorithm. Bayesian Belief Networks, Training Bayesian Belief Networks, Classification Using Frequent Patterns, Associative Classification, Lazy Learners-k- NearestNeighbor Classifiers, Case-Based Reasoning. (8 Hrs.) 13

SYLLABUS Unit -VI Multiclass Classification (8 Hrs.) Multiclass Classification, Semi-Supervised Classification, Reinforcement learning, Systematic Learning, Wholistic learning and multi-perspective learning. Metrics for Evaluating Classifier Performance: Accuracy, Error Rate, precision, Recall, Sensitivity, Specificity; Evaluating the Accuracy of a Classifier: Holdout Method, Random Sub sampling and Cross-Validation. 14

TEXT BOOKS AND REFERENCE BOOKS Han, Jiawei Kamber, Micheline Pei and Jian, Data Mining: Concepts and Techniques , Elsevier Publishers, ISBN:9780123814791, 9780123814807. 2. Parag Kulkarni, Reinforcement and Systemic Machine Learning for Decision Making by Wiley-IEEE Press, ISBN: 978-0-470-91999-6 Reference Books 1. Matthew A. Russell, "Mining the Social Web: Data Mining Facebook, Twitter, LinkedIn, Google+, GitHub, and More" , Shroff Publishers, 2nd Edition, ISBN: 9780596006068 Maksim Tsvetovat, Alexander Kouznetsov, "Social Network Analysis for Startups:Finding connections on the social web", Shroff Publishers , ISBN: 10: 1449306462 15

INTRODUCTION Motivation: Why data mining? What is data mining? Data Mining: On what kind of data? Data mining functionality Are all the patterns interesting? Classification of data mining systems Data Mining Task Primitives Integration of data mining system with a DB and DW System Major issues in data mining 16

UNIT-1) INTRODUCTION, Data Mining, Data Mining Task Primitives, Data: Data, Information and Knowledge; Attribute Types: Nominal, Binary, Ordinal and Numeric attributes, Discrete versus Continuous Attributes; Introduction to Data Preprocessing, Data Cleaning: Missing values, Noisy data; Data integration: Correlation analysis; transformation: Min-max normalization, z-score normalization and decimal scaling; data reduction: Data Cube Aggregation, Attribute Subset Selection, sampling; and Data Discretization: Binning, Histogram Analysis

SOME DEFINITIONS Data : Data are any facts, numbers, or text that can be processed by a computer. operational or transactional data such as, sales, cost, inventory, payroll, and accounting nonoperational data, such as industry sales, forecast data, and macro economic data meta data - data about the data itself, such as logical database design or data dictionary definitions Information: The patterns, associations, or relationships among all this data can provide information.

DEFINITIONS CONTINUED.. Knowledge: Information can be converted into knowledge about historical patterns and future trends. For example, summary information on retail supermarket sales can be analyzed in terms of promotional efforts to provide knowledge of consumer buying behavior. Thus, a manufacturer or retailer could determine which items are most susceptible to promotional efforts. Data Warehouses: Data warehousing is defined as a process of centralized data management and retrieval.

WHY NEED OF DATA MINING? The Explosive Growth of Data: from terabytes to petabytes Data collection and data availability Automated data collection tools, database systems, Web, computerized society Major sources of abundant data Business: Web, e-commerce, transactions, stocks, Science: Remote sensing, bioinformatics, scientific simulation, Society and everyone: news, digital cameras, YouTube **We are drowning in data, but starving for knowledge! ** Necessity is the mother of invention Data mining Automated analysis of massive data sets 20

EVOLUTION OF DATABASE TECHNOLOGY 1960s: Data collection, database creation, IMS and network DBMS 1970s: Relational data model, relational DBMS implementation 1980s: RDBMS, advanced data models (extended-relational, OO, deductive, etc.) Application-oriented DBMS (spatial, scientific, engineering, etc.) 1990s: Data mining, data warehousing, multimedia databases, and Web databases 2000s Stream data management and mining Data mining and its applications Web technology (XML, data integration) and global information systems 21

WHAT IS DATA MINING? Data mining (knowledge discovery from data) ,Finding Hidden information in a database. Extraction of interesting (non-trivial(relevant), implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data Data mining: a misnomer?-the goal is the extaction of patterns & knowledge from lagre amount of data,not the extraction(mining) of data itself. Alternative names Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, Exploratory data analysis, Data Driven Discovery & Deductive learning.etc. Watch out: Is everything data mining ? Simple search and query processing (Deductive) expert systems 22

WHY DATA MINING?POTENTIAL APPLICATIONS Data analysis and decision support Market analysis and management Target marketing, customer relationship management (CRM), market basket analysis, cross selling, market segmentation Risk analysis and management Forecasting, customer retention, improved underwriting, quality control, competitive analysis Fraud detection and detection of unusual patterns (outliers) Other Applications Text mining (news group, email, documents) and Web mining Stream data mining Bioinformatics and bio-data analysis 23

EX. 1: MARKET ANALYSIS AND MANAGEMENT Where does the data come from? Credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus (public) lifestyle studies Target marketing Find clusters of model customers who share the same characteristics: interest, income level, spending habits, etc., Determine customer purchasing patterns over time Cross-market analysis Find associations/co-relations between product sales, & predict based on such association Customer profiling What types of customers buy what products (clustering or classification) Customer requirement analysis Identify the best products for different customers Predict what factors will attract new customers Provision of summary information Multidimensional summary reports Statistical summary information (data central tendency and variation) 24

EX. 2: CORPORATE ANALYSIS & RISK MANAGEMENT Finance planning and asset evaluation cash flow analysis and prediction contingent claim analysis to evaluate assets cross-sectional and time series analysis (financial-ratio, trend analysis, etc.) Resource planning summarize and compare the resources and spending Competition monitor competitors and market directions group customers into classes and a class-based pricing procedure set pricing strategy in a highly competitive market 25

EX. 3: FRAUD DETECTION & MINING UNUSUAL PATTERNS Approaches: Clustering & model construction for frauds, outlier analysis Applications: Health care, retail, credit card service, telecomm. Auto insurance: ring of collisions Money laundering: suspicious monetary transactions Medical insurance Professional patients, ring of doctors, and ring of references Unnecessary or correlated screening tests Telecommunications: phone-call fraud Phone call model: destination of the call, duration, time of day or week. Analyze patterns that deviate from an expected norm Retail industry Analysts estimate that 38% of retail shrink is due to dishonest employees Anti-terrorism 26

Data Mining Descriptive Predictive Clustering Classificat ion Sequence Discovery Summarizat ion Prediction Regressio n Association rules Time series Analysis 27 7/31/2020

Classification- maps data into predefined groups or classes It uses supervised learning . The algorithm uses learning phase to build a classifier training data set containing data attributes and associated class labels using Regression-maps data into real-valued prediction variable- Algorithm tries to find best function (linear, Non-linear that fits the training data) Time Series Analysis- the value of an attribute is examined as it varies over time It can be used to determine similarities, classify the behavior or predict future values Prediction predicts future values using regression, time series analysis or other approaches 28 7/31/2020 Data Mining -By Dr. S. C. Shirwaikar

Clustering -Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters Unsupervised learning: no predefined classes Interpretability and usability-results should be comprehensible and usable-domain expert is required Summarization - maps data into subsets with simple descriptions- It extracts or derives representative summary type of information Association rules discovers relationship among data used in Market basket analysis to find item frequently purchased togather Sequence Discovery- discovers sequential patterns in data-oder in which items are purchased or data is accessed 29 7/31/2020 Data Mining -By Dr. S. C. Shirwaikar

KNOWLEDGE DISCOVERY (KDD) PROCESS Knowledge Data mining core of knowledge discovery process Pattern Evaluation Data Mining Task-relevant Data Selection Transformation Data Warehouse Data Cleaning Data Integration 30 Databases

Data Mining process

KDD PROCESS: SEVERAL KEY STEPS Learning the application domain relevant prior knowledge and goals of application Creating a target data set: data selection Data cleaning and preprocessing: (may take 60% of effort!) Data reduction and transformation Find useful features, dimensionality/variable reduction, invariant representation Choosing functions of data mining summarization, classification, regression, association, clustering Choosing the mining algorithm(s) Data mining: search for patterns of interest Pattern evaluation and knowledge presentation visualization, transformation, removing redundant patterns, etc. Use of discovered knowledge 32

DATA MINING AND BUSINESS INTELLIGENCE Increasing potential to support business decisions End User Decision Making Business Analyst Data Presentation Visualization Techniques Data Mining Information Discovery Data Analyst Data Exploration Statistical Summary, Querying, and Reporting Data Preprocessing/Integration, Data Warehouses Data Sources Paper, Files, Web documents, Scientific experiments, Database Systems DBA 33

KDD VS DATA MINING KDD-(Knowledge Discovery in Databases) is a field of computer science, which includes the tools and theories to help humans in extracting useful and previously unknown information (i.e. knowledge) from large collections of digitized data. KDD consists of several steps, and Data Mining is one of them.

CONTI.. This process deal with the mapping of low-level data into other forms those are more compact, abstract and useful. This is achieved by creating short reports, modelling the process of generating data and developing predictive models that can predict future cases. Data Mining:>> is application of a specific algorithm in order to extract patterns from data.

WHAT IS THE DIFFERENCE BETWEEN KDD AND DATA MINING? Although, the two terms KDD and Data Mining are heavily used interchangeably, they refer to two related yet slightly different concepts. KDD is the overall process of extracting knowledge from data while Data Mining is a step inside the KDD process, which deals with identifying patterns in data. In other words, Data Mining is only the application of a specific algorithm based on the overall goal of the KDD process.

Architecture of a typical data mining System

DATA MINING: CONFLUENCE OF MULTIPLE DISCIPLINES Database Technolo gy Statistics Visualizat ion Data Mining Machine Learning Pattern Recogniti on Other Discipline s Algorith m 38

TECHNOLOGIES USED Data mining includes many techniques from Domains bellow: Statistics Machine Learning Database systems and Data Warehouses Information Retrieval Visualization High performance computing Pattern Matching

TECHNOLOGIES CONTINUED.. Statistics: It studies Collection,Analyasis Interpretation and presentation of Data. #>Statistical research develops tools for prediction and forecasting using data #>Statistical methods can also be used to verify data mining results.

CONTI Information Retrieval: It is science of searching for documents or information in documents Text Retrieval Basic Measures of text retrieval- Precision= {Relevant} {Retrieved} / {Retrieved} Recall = {Relevant} {Retrieved} / {Relevant}

CONTI Database Systems Data Warehouses: This research focuses on the creation,maintainance and use of databases for organizations and end users.

CONTINUED.. Machine Learning: It investigates how computers can learn or improve their performance based on data.

CONTINUED.. High Performance Computing most generally refers to the practice of aggregating computing power in a way that delivers much higher performance than one could get out of a typical desktop computer or workstation in order to solve large problems in science, engineering, or business.

CONTINUED.. Data visualization is a general term that describes any effort to help people understand the significance of data by placing it in a visual context. Patterns, trends and correlations that might go undetected in text-based data can be exposed and recognized easier with data visualization software.

MAJOR ISSUES IN DATA MINING Mining methodology Mining different kinds of knowledge from diverse data types, e.g., bio, stream, Web Performance: efficiency, effectiveness, and scalability Pattern evaluation: the interestingness problem Incorporation of background knowledge Handling noise and incomplete data Parallel, distributed and incremental mining methods Integration of the discovered knowledge with existing one: knowledge fusion User interaction Data mining query languages and ad-hoc mining Expression and visualization of data mining results Interactive mining of knowledge at multiple levels of abstraction Applications and social impacts Domain-specific data mining & invisible data mining Protection of data security, integrity, and privacy 46

KDD VS DATA MINING KDD-(Knowledge Discovery in Databases) is a field of computer science, which includes the tools and theories to help humans in extracting useful and previously unknown information (i.e. knowledge) from large collections of digitized data. KDD consists of several steps, and Data Mining is one of them.

WHY NOT TRADITIONAL DATA ANALYSIS? Tremendous amount of data Algorithms must be highly scalable to handle such as tera-bytes of data High-dimensionality of data Micro-array may have tens of thousands of dimensions High complexity of data Data streams and sensor data Time-series data, temporal data, sequence data Structure data, graphs, social networks and multi-linked data Heterogeneous databases and legacy databases Spatial, spatiotemporal, multimedia, text and Web data Software programs, scientific simulations New and sophisticated applications 48

MULTI-DIMENSIONAL VIEW OF DATA MINING Data to be mined Relational, data warehouse, transactional, stream, object-oriented/relational, active, spatial, time-series, text, multi-media, heterogeneous, legacy, WWW Knowledge to be mined Characterization, discrimination, association, classification, clustering, trend/deviation, outlier analysis, etc. Multiple/integrated functions and mining at multiple levels Techniques utilized Database-oriented, data warehouse (OLAP), machine learning, statistics, visualization, etc. Applications adapted Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, text mining, Web mining, etc. 49

DATA MINING: CLASSIFICATION SCHEMES General functionality Descriptive data mining Predictive data mining Different views lead to different classifications Data view: Kinds of data to be mined Knowledge view: Kinds of knowledge to be discovered Method view: Kinds of techniques utilized Application view: Kinds of applications adapted 50

Data Mining and Warehousing Teaching Scheme and Course Objectives

Download Presentation

Presentation Transcript

Related

More Related Content