FOODIE: A Data-centric Sifting Framework for Social Media Analytics in the FnB Industry
Aiming to improve social media data analysis for the FnB sector, FOODIE offers solutions to challenges like homographs, context variations, and irrelevant hashtags. By leveraging heuristics and ML techniques, FOODIE enhances data quality and helps in preparing data for analytical tasks.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
FOODIE: A Data centric Sifting Framework for Social Media Analytics Dolly Agarwal dolly@aipalette.com SEKE 2023 Abdullah Al Imran imran@aipalette.com The 35th International Conference on Software Engineering & Knowledge Engineering Kasun S Perera kasun@aipalette.com Paper ID: 175 1
The Problem A great deal of research has been conducted on utilizing social media analytics to derive consumer insights and understand their behaviors, specially in the CPG/FnB domain. When applied to real-world data in an industrial setting, the results are often found to be incorrect and erroneous. There is a significant gap in the process of social media data sifting. Data Sifting processes vary and possess domain specific challenges. We propose, FOODIE, a data centric Sifting Framework for Social Media Analytics specifically designed for the FnB industry. Address key challenges for sifting FnB social media data and proposes empirical solutions. 2
Contributions We propose an end-to-end framework: FOODIE to prepare social media data for downstream analytical tasks. We show how we can use heuristics and ML based approaches in the denoising and sifting process. We propose a novel method to evaluate social data quality quantitatively when it is expensive to get annotations and hence not straightforward to measure metrics like Precision-Recall on a downstream task. 3
Data Collection Trade-off between Hashtag-based Data Collection Keywords/hashtags are listed by the domain experts and research analysts. High precision: employing targeted keywords to get fewer, but more accurate results; and Examples: Recall: utilizes broad terms to produce a lot of irrelevant results. Hashtags used to extract data from USA are bostonfoodies , newyorkfoodguide , miamifood . In reality, we have experienced that it is cheap to collect the data, but expensive to annotate and sift them. 4
Challenges of Social Media Data in FnB Homographs: All posts with words that are written/spelled the same but have different meanings For example - Nut can refer to both dry fruit and a type of hardware tool. Context variation: Food-related items can be used in other domains especially in skin-care. For example: Green Tea is consumed as a beverage, it can also be the ingredient of a skincare product used by customers. Business/Promotional posts: Many businesses use social media for advertising, promotional giveaways etc. Viral/Influencers posts: Some posts have more engagement (likes, comments, shares) than others as they may be posted by influencers or popular figures on the internet. Irrelevant Hashtags: Hashtags are often misused in an effort to get more engagement. For example: New music coming soon to all streaming platforms. #Music #NewArtist #Foodie #HealthyFood . 5
FOODIE: Key Components Data Preparation: Removing and fixing noise unique to social media data. 1. Sifting: Filtering out noisy and irrelevant data samples and bucket the relevant samples into food categories. Removing duplicates to prevent over-indexing of a few samples. 2. Validation: Validate the data quality returned by the framework. 3. 7
FOODIE: Data Preparation Noise Removal URL, Phone no and Mentions Removal, Symbol & Punctuation Removal, Emoji and Emoticons Transformation, Unicode Normalization Text: #chocolate #coffee #cake made for hubbys' @abcde birthday #foodie #instafood #instafoodie #streetfood Text_only: chocolate coffee cake for hubby s <username> birthday Hashtags: foodie, instafood instafoodie, streetfood OOV Handling Typos, Use of elongated words to express feelings, Concatenated words to tackle word limitations. Example: I looove Baklava. The layers of flaky filo pastyr, pistachio filling and syrup has myheart Processed Text: I love baklava the layers of flaky filo pastry, pistachio filling and syrup has my heart 8
FOODIE: Heuristics & Statistics based Sifting Short-text removal Considered only the text only (without hashtags) data to infer the lower bound of text length If a post contains 15 characters, we do not get much information about any trend that can be used for analytics Flagging promotional content Promotional content have some characteristics like they usually have a URL, phone number, or contain one of the keywords like order now , Call for orders , DM to know more . Useful source to analyze what local businesses or different brands are promoting in FnB. Flagging Viral Content Transformed each record engagement value (likes, shares and comments) in the range 1-100. ?????????? ????? = ? ???????? + ? ? ???? + 1 ? ? ????? Outlier Detection using statistical z-score threshold formula. 9
FOODIE: ML based Sifting Semi-supervised approach for labeling data 11
FOODIE: ML based Sifting Multi-modal approach to Food/Non-Food Classification 12
FOODIE: ML based Sifting Food Category Classification Multi-label classification problem since a single post can belong to multiple categories. Example: Rainy Days call for chocolates And endless cups of Tea -> Confectionery as well as Beverage. Approach: Learning a representation for the label using a Siamese Neural Network architecture with Triplet loss. 13
FOODIE: ML based Sifting Data Deduplication Extracted the embedding vectors of text only part of each post using Sentence Transformer model nli-roberta-base-v2 Computed cosine similarities between the embedding vectors. Lastly, we consider the posts with >=0.9 cosine similarities as duplicates. 14
FOODIE: Inspection and Evaluation Performed Human Validation (related to food) on a sample data (=1000) Proposed 2 metrics Data-to-Error Ratio: Out of total data validated samples, how many are irrelevant. Relevancy: Out of total samples collected, how many samples were marked as relevant by the FOODIE framework. We could achieve a low data-to-error ratio (0.01%) with a relevancy score of 68.87% FOODIE framework can help achieve more than 8% improvement in social media data quality for practical applications. 15
FOODIE: Inspection and Evaluation Insufficient: Most data points are valid but the number of data points returned are low. This can arise if framework has high False Negatives rate to have high precision i.e high Precision, low Recall. Appropriate: Sufficient and valid data points that gives good estimates on various analytics, ie. high Precision and high recall. Corrupt: When we have only a few data points and that is mostly unrelated to the study, this prohibits performing analytics or gives erratic results. This can arise if the framework returns only a few data points and those are mostly False Positives. Noisy: This is the current state of the data, where the collected data has some invalid data points i.e False positives. 16
Conclusion FOODIE for cleaning and preparing the social media data for analytics in FnB. Proposed 2 generalized data quality metrics that can be very useful to assess data quality in many different scenarios where getting labeled data is either expensive or infeasible. FOODIE can be customized to any domain that require social media data sifting. 17