Friday, April 11, 2025 - 14:00
The School of Computer Science is pleased to present…
BigHUSRec: A High Utility Sequential Patterns-Based Recommendation System for Big Data using MapReduce
MSc Thesis Defense by: Esther Umoh
Date: Friday, April 11th, 2025
Time: 2:00 pm – 4:00 pm
Location: Essex Hall, Room 122
Abstract:
High Utility Sequential Pattern Mining (HUSPM) is a data mining technique that identifies valuable patterns in sequential data by considering the frequent occurrences and utility (e.g., profit, importance) of item sets. With the rise of online transactions, a huge amount of data is being generated across platforms like e-commerce and social media, often called big data. Big data refers to large, complex datasets that are difficult to analyze, process, or visualize using traditional methods. In e-commerce, these data often reach terabytes (TB) or even petabytes (PB) in size, posing significant challenges for data mining and recommendation systems. Thus, existing systems like HUSRec21, Scalable Three-Tier MapReduce_HUSP21 and HUSP-SP23 apply HUSPM to extract high-utility sequential patterns from large datasets, improving scalability, efficiency, and recommendation accuracy. HUSRec21 builds on an earlier system (HSPRec19), replacing frequent sequential patterns with high-utility sequential patterns to enhance recommendation performance. The Scalable Three-Tier MapReduce_HUSP21 system tackles the challenge of mining high-utility sequential patterns from large-scale datasets by employing a three-tier MapReduce model. HUSP-SP23 introduces a new structure called seqPro, along with techniques called Early Pruning and Irrelevant Item Pruning to make pattern mining more efficient. These methods help remove low-value patterns early from the projected database, reducing processing time and speeding up pattern discovery. The HUSRec21 and HUSP-SP23 system operates on a single machine processing architecture where each step in HUSPM is computed sequentially. As real-world datasets continue to grow in size, sequential processing becomes increasingly inefficient, leading to longer runtime and high memory usage, which can impact runtime and overall recommendation performance. This thesis proposes a system called Big High Utility Sequential Pattern Recommendation System (BigHUSRec), an extension of the HUSRec21 system designed to mine high-utility sequential patterns from large datasets through a "Top-K" approach integrated with the MapReduce framework. This approach focuses on extracting the Top-K most valuable patterns, aiming to improve recommendation accuracy while minimizing execution time. The proposed BigHUSRec incorporates both purchase and clickstream data, capturing a comprehensive view of customer behavior to create more targeted recommendations. BigHUSRec uses MapReduce to partition large datasets into smaller, manageable parts, enabling parallel processing that speeds up the analysis of the data and improves scalability. The MapReduce process begins by dividing data into smaller parts (Mapping), analyzing each part in parallel to identify patterns efficiently, and then aggregating the results (Reducing) for a comprehensive output. This combination of purchase and clickstream data, processed through the MapReduce framework, makes BigHUSRec more effective at generating improved recommendations. Experimentation on synthetic and real-world datasets evaluated with Mean Absolute Error, Precision, Recall and F1 scores graph shows that the proposed BigHUSRec performs approximately 15.5% better overall in providing more accurate recommendations than the tested existing systems.
High Utility Sequential Pattern Mining (HUSPM) is a data mining technique that identifies valuable patterns in sequential data by considering the frequent occurrences and utility (e.g., profit, importance) of item sets. With the rise of online transactions, a huge amount of data is being generated across platforms like e-commerce and social media, often called big data. Big data refers to large, complex datasets that are difficult to analyze, process, or visualize using traditional methods. In e-commerce, these data often reach terabytes (TB) or even petabytes (PB) in size, posing significant challenges for data mining and recommendation systems. Thus, existing systems like HUSRec21, Scalable Three-Tier MapReduce_HUSP21 and HUSP-SP23 apply HUSPM to extract high-utility sequential patterns from large datasets, improving scalability, efficiency, and recommendation accuracy. HUSRec21 builds on an earlier system (HSPRec19), replacing frequent sequential patterns with high-utility sequential patterns to enhance recommendation performance. The Scalable Three-Tier MapReduce_HUSP21 system tackles the challenge of mining high-utility sequential patterns from large-scale datasets by employing a three-tier MapReduce model. HUSP-SP23 introduces a new structure called seqPro, along with techniques called Early Pruning and Irrelevant Item Pruning to make pattern mining more efficient. These methods help remove low-value patterns early from the projected database, reducing processing time and speeding up pattern discovery. The HUSRec21 and HUSP-SP23 system operates on a single machine processing architecture where each step in HUSPM is computed sequentially. As real-world datasets continue to grow in size, sequential processing becomes increasingly inefficient, leading to longer runtime and high memory usage, which can impact runtime and overall recommendation performance. This thesis proposes a system called Big High Utility Sequential Pattern Recommendation System (BigHUSRec), an extension of the HUSRec21 system designed to mine high-utility sequential patterns from large datasets through a "Top-K" approach integrated with the MapReduce framework. This approach focuses on extracting the Top-K most valuable patterns, aiming to improve recommendation accuracy while minimizing execution time. The proposed BigHUSRec incorporates both purchase and clickstream data, capturing a comprehensive view of customer behavior to create more targeted recommendations. BigHUSRec uses MapReduce to partition large datasets into smaller, manageable parts, enabling parallel processing that speeds up the analysis of the data and improves scalability. The MapReduce process begins by dividing data into smaller parts (Mapping), analyzing each part in parallel to identify patterns efficiently, and then aggregating the results (Reducing) for a comprehensive output. This combination of purchase and clickstream data, processed through the MapReduce framework, makes BigHUSRec more effective at generating improved recommendations. Experimentation on synthetic and real-world datasets evaluated with Mean Absolute Error, Precision, Recall and F1 scores graph shows that the proposed BigHUSRec performs approximately 15.5% better overall in providing more accurate recommendations than the tested existing systems.
Keywords: High Utility Sequential Pattern Mining, Data Mining, Big Data, MapReduce, Sequential Database, E-commerce Recommendation Systems
Thesis Committee:
Internal Reader: Dr. Jianguo Lu
External Reader: Dr. Dennis Borisov
Advisor: Dr. Christie Ezeife
Chair: Dr. Xiaobu Yuan