Prefix span Algorithm with pyspark

Kalpanileo
7 min readJul 3, 2021

--

Photo by Viki Mohamad on Unsplash

Hi, guys today I am going to teach you about the Prefix span algorithm which is not much popular but available in spark as an inbuilt function.

Introduction to Prefixspan Algorithm

Here in order to understand the Prefix span algorithm, You need some basic knowledge on the following Topics.

What is a sequence database?

A sequence database contains ordered elements or events. In sequential pattern mining, we use Sequence databases.

eg: Transaction db : transaction set of the supermarket.

Sequence DB:for each customer, we take the list of items the customer bought in each transaction during a period of three months .when we get all those lists together it's a sequence. and here in sequence DB, we consider it as one row . a collection of sequences generated for each customer, we can get as a sequence DB.

What is Sequential Pattern Mining?

Using a given set of sequences. Identifying or finding the complete set of frequent subsequences.

<a(abc)(ac)d(cf)> in this given sequence

No of Elements: 5

No of Items : 9

Subsequence Vs Super Sequence

To understand more, please refer to the below example

sub-sequences

What is a Prefix?

What is Projection?

When are we using sequential Pattern Mining?

·Shopper shopping sequences:

· In the hospital, patients medical treatments within a duration.

· natural disasters (e.g., earthquakes),

· stocks and markets

· calling patterns of customers of an individual.

· sequences of Weblog click streaming: most of the people who come to the webpage won’t buy goods.

· Sequences of DNA or RNA and different structures of gene: Amino acid sequences.

Other algorithms Available for sequential Pattern Mining?

· Apriori-based method: GSP (Generalized Sequential Patterns)

· Pattern-growth methods: FreeSpan

· Vertical format-based mining: SPADE

· Mining closed sequential patterns: CloSpan

Why do we need Prefix span algorithm than other algorithms ?

· Efficient pattern growth method.

· Surpasses GSP and FreeSpan algorithms .

· Explores prefix-projection in sequential pattern mining.

· Without much effort can complete the mining of sequences.

· Prefix-projection reduces the size of projected database
and leads to efficient processing.

· Here we use prefix projection which leads to reduce the size of a projected database and make the process more efficient.

pseudo-projection is also can be used to improve mining efficiency.

PrefixSpan Algorithm

Prefixspan is a Projection-based algorithm and it is prefix projection based.

PrefixSpan Algorithm Explanation

PrefixSpan Algorithm Explanation

Support is less than the threshold value

Here in this instance, we are using Prefix Projection Method ,The threshold value or Minimum support is 2

figure 1

Here in the above scenario, there is Sequence DB. First, we calculate the support value for each frequent element and if the support value is less than the threshold value it will not be considered as the length 1 sequence pattern .then we make x,y,z,m,n,o projected Databases as above.

Here I have only used M projected database to explained further process . with the use of m-projected DB, we find subsequences with the length of 2 that can be made with the use of the prefix “m”.

Here we make <mz> ,<my > projected database then use <mz>, <my> as prefixes and make subsequences with length 3 . then the rest will happen recursively as in the above figure 1.

PrefixSpan with pyspark Example

DataSet Explanation

Here I have used the below dataset. which include the customerId as well as the set of goods that customer has bought at a time . this data set is filtered depending on the customerid and the sorted depend on the Date and Time they have arrived at the supermarket.

dataset

Here the objective is to recognize the most frequent sequences of goods that people buy within 3 weeks time duration. here in this dataset in each row first element is customerID , other elements are belong list of goods that acquired on a specific date and time. so that within 3 weeks customer have visited the supermarket multiple times.

Use the below link to get the Processed dataset.

Code Explanation

Pyspark terms and codes

Here in spark , they have provided a inbuilt function to mine frequent sequential patterns

PrefixSpan( minSupport=0.1, maxPatternLength=5, maxLocalProjDBSize=32000000)

Here in the above function

minSupport =>

The minimum support level of the sequential pattern. Sequential pattern that exists more than (minSupport *dataset size) times will be output. The value must be greater than or equal to 0, default value is 0.1 .

The data type is double.

maxPatternLength =>

The maximal length of the sequential pattern . And it must be greater than zero .

maxLocalProjDBSize =>

The maximum number of items allowed in a projected database before local processing. Incase If a projected database surpasses this volume, another iteration of distributed prefix growth is run. Must be > 0.

The Default values of these functions are same as the above defined function.

the findfrequentsequentialPatterns explanation
preparing the dataset before applying the function

Here we read the CSV file and get row by row then when it comes to a row , in every row the first element is the customerid and customerid only has digits . here we have identified the customer id with isdigit() function and then the rest of the elements in one row will be collected to the list2[] array .when it is going to the next row item name list which was collected to the list2 array will be added to the anotherlist[]list and the customerid will be added to the customerlist[] list if there is a change in customer id . at the end all the anotherlist[] lists will be added to the list[] List .

Then the output of that process will be as below.

[Row(sequence=[['parmesan cheese', 'spaghetti', 'soup', 'avocado'], ['ground beef', 'spaghetti', 'mineral water', 'milk'], ['sparkling water'], ['mineral water', 'eggs', 'chicken', 'chocolate']]), Row(sequence=[['spaghetti', 'chocolate', 'brownies', 'white wine'], ['fresh tuna', 'mineral water', 'eggs'], ['spaghetti', 'muffins'], ['spaghetti', 'chocolate'], ['french fries', 'escalope', 'champagne'], ['tomato sauce', 'light mayo'], ['turkey', 'fresh tuna', 'frozen vegetables', 'tomatoes'], ['eggs', 'cookies'], ['soup', 'chicken', 'gums', 'soda'], ['turkey', 'frozen vegetables', 'mineral water', 'cider'], ['spaghetti'], ['clothes accessories']]),(Row………..)]

Here in the above output a row is related to one customer , it includes buying sequence of that specific customer.

parallelize function

Here the above-mentioned output will be passed to this function. in order to create an RDD(Resilient Distributed Datasets ) from a list collection. RDD is a fundamental data structure of pyspark .

prefixspan inbuilt function

Here the 0.1 is set as the main support value so that the threshold value will be( minsupport * no of sequences) which provide (0.1*30) threshold value is 3 is the number of customers which is similar to the number of sequences (number of rows).

Here I gave the 10 as maximum pattern length . it won’t bother to the input data . it occurs to the output of the prefix span algorithm . here when we give length as 3 it will only process up to the length 3 subsequences.

Here for the projected database value I gave the default value .

findfrequentsequantialpatterns function

Here this is the most important function it will find the sequential requirements according to the given properties and datasets. the output of that has sorted in respect to the frequencies of sequential patterns processed by the algorithm . then .show(10,False ) will give all top 10 sequence patterns.

The Output is shown as below

output of the whole process

thank you. Hope my article helped you to get some knowledge about prefixspan algorithm usage with pyspark . If there are issues in the content please be kind enough to mention them in the comments.

References

--

--

Kalpanileo
Kalpanileo

Written by Kalpanileo

I am a enthusiastic IT Graduate of University of Moratuwa . seeking for Knowledge and interested in new Technologies

No responses yet