In the context of this package, streams refer to a series of clicks belonging to a given user. The time difference between clicks is defined by the user when assembling these streams, but is typically taken to be 30 minutes in the industry.
The pages refer to the individual clicks of the user, and thus the pages they visit. Rather than storing the entire URL of the page the user visits, it is better to encode pages using a simple code such as PXX where X can be any number. This strategy can be used to group similar pages under the same code, as modelling them as separate pages is sometimes not useful leading to an excessively large probability matrix.
Build a dummy Markov chain¶
To start using the package without any data,
produce dummy data for you to experiment with:
from markovclick import dummy clickstream = dummy.gen_random_clickstream(nOfStreams=100, nOfPages=12)
To build a Markov chain from the dummy data:
from markovclick.models import MarkovClickstream m = MarkovClickstream(clickstream)
m of the
MarkovClickstream class provides access the
class’s attributes such as the probability matrix (
m.prob_matrix) used to
model the Markov chain, and the list of unique pages (
in the clickstream.
Visualising as a heatmap¶
The probability matrix can be visualised as a heatmap as follows:
sns.heatmap(m.prob_matrix, xticklabels=m.pages, yticklabels=m.pages)
Visualising the Markov chain¶
A Markov chain can be thought of as a graph of nodes and edges, with the edges
representing the transitions from each state.
markovclick provides a
wrapper function around the
graphviz package to visualise the Markov chain
in this manner.
from markovclick.viz imoport visualise_markov_chain graph = visualise_markov_chain(m)
visualise_markov_chain() returns a
Digraph object, which
can be viewed directly inside a Jupyter notebook by simply calling the
reference to the object returned. It can also be outputted to a PDF file by
render() function on the object.
visualise_markov_chain(markov_chain: markovclick.models.MarkovClickstream) → graphviz.dot.Digraph¶
Visualises Markov chain for clickstream as a graph, with individual pages as nodes, and edges between the first and second most likely nodes (pages). Probabilities for these transitions are annotated on the edges (arrows).
Parameters: markov_chain (MarkovClickstream) – Initialised MarkovClickstream object with probabilities computed. Returns:
- Graphviz Digraph object, which can be rendered as an image or
- PDF, or displayed inside a Jupyter notebook.
Return type: Digraph
In the graph produced, the nodes representing the individual pages are shown in green, and up to 3 edges from each node are rendered. The first edge is in a thick blue arrow, depicting the most likely transition from this page / state to the next page / state. The second edge depicted by a thinner blue arrow, depicts the second most likely transition from this state. Finally, a third edge is shown that depicts the transition from this page / state back to itself (light grey). This edge is only shown if the the two most likely transitions are not already to itself. For all transitions, the probability is shown next to the edge (arrow).
Clickstream processing with
markovclick provides functions to process clickstream data such as server
logs, which contain unique identifiers such as cookie IDs associated with each
click. This allows clicks to be aggregated into groups, whereby clicks from
the same browser (identified by the unique identifier) are grouped such that
the difference between individual clicks does not exceed the maximum session
timeout (typically taken to be 30 minutes).
Sessionise clickstream data¶
To sessionise clickstream data, the following code can be used that require a pandas DataFrame object.
from markovclic.preprocessing import Sessionise sessioniser = Sessionise(df, unique_id_col='cookie_id', datetime_col='timestamp', session_timeout=30)
Sessionise(df, unique_id_col: str, datetime_col: str, session_timeout: int = 30)¶
Class with functions to sessionise a pandas DataFrame containing clickstream data.
__init__(df, unique_id_col: str, datetime_col: str, session_timeout: int = 30) → None¶
Instantiates object of
- df (pd.DataFrame) –
pandasDataFrame object containing clickstream data. Must contain atleast a timestamp column, unique identifier column such as cookie ID.
- unique_id_col (str) – Column name of unique identifier, e.g.
- datetime_col (str) – Column name of timestamp column.
- session_timeout (int, optional) – Defaults to 30. Maximum time in minutes after which a session is broken.
- df (pd.DataFrame) –
Sessionise object instantiated, the
can then be called. This function supports multi-processing, enabling you the
split job into multiple processes to take advantage of a multi-core CPU.
assign_sessions(self, n_jobs: int = 1)¶
Assigns unique session IDs to individual clicks that form the sessions. Supports parallel processing through setting
n_jobsto higher than 1.
Parameters: n_jobs (int, optional) – Defaults to 1. If 2 or higher, enables parallel processing. Returns: Returns sessionised DataFrame, with session IDs stored in
Return type: pd.DataFrame
assign_sessions() function returns the DataFrame, with an additional
column added storing the unique identifier for the session. Rows of the
DataFrame can then be grouped using this column.