majortrack.tracker¶

class MajorTrack(clusterings, history, **kwargs)[source]¶

Bases: object

Parameters

clusterings (list, dict) –
Sequence of clusterings. Each clustering must be present in the form of a list of membership-sets, i.e. a list of clusters with each cluster being defined by a set of data source associated to it (its members).

If provided as a `dict`:

keys: float, datetime
The time points.

values: list, dict
The membership list of each clustering indicating to which cluster a data source belongs. See clusterings for details.
history (int) – sets the number of time points (or slices) the algorithm can maximally go back in time to check for majority matches.
optional parameter (**kwargs) –

timepoints: list
The time points of each clustering.

Note

If clusterings if of type dict then the keys will be used as time points and this optional parameter is ignored, even if provided.

slice_widths: list, float (default=None)
The temporal duration of each snapshot in the sequence of clusterings. If not provided then simply the difference between time point i and i+1 is used as the width of slice i. The width of the last slice is assumed to be the same as the duration of the 2nd last slice.

individuals: list
A list of all distinct data sources present in the dataset.

Todo

Build it from self.clusterings.

group_matchup_method: str (default=’fraction’)
Set the method to calculate the similarity between two clusters from different clusterings. By default the fraction of identical members is used as explained in the original article [LB19].

use_lazylists: bool (default=False)
Determine if LazyList’s should be used to store data about dynamic clusters or normal lists. Most likely you want to use normal lists.
bibliography: (.) – ../references.bib:

clusterings¶

Holds for each time point the configuration of the respective clustering. The clustering is given by a list of member-set`s, with each `set containing the data sources in a cluster.

Type: list(list(set)) or LazyList

dcs¶

Ensemble of all dynamic clusters.

Todo

What’s the type of an element? Is it just an identifier?

Type: list or LazyList

length¶

Number of slices present in the dataset.

Type: int

cluster_trace¶

Ensemble of all tracing paths the dynamic clusters.

Todo

What’s the type of an element?

Type: list or LazyList

group_matchup¶

Holds for each time point the tracing and mapping sets of all clusters. Each element is a dict with the keys 'forward' and 'backward'. Both hold a dict indicating for a cluster the best matching cluster along with the similarity score of the particular relation in a tuple.

Example

self.group_matchup[1] = {
    'backward': {0: (0, 1.0), ...},
#                ^   ^  ^
#                |   |  similarity score
#                |   cluster from previous time point
#                cluster from current time point.
    }

Type: list

group_mappings¶

Holds for each slice a list of mapping sets. The list is ordered like clusterings.

Example

mt = MajorTrack(...)
idx, cluster_id = 0, 1
# get the set of data sources in this cluster
c_members = mt.clusterings[0][1]
# get the corresponding mapping set (from index idx + 1)
mapping_set = mt.group_mappings[0][1]

Type: list(list)

group_tracings¶

Holds for each slice a list of tracing sets. The list is ordered like clusterings.

Type: list(list)

group_mappers¶

Holds for each slice a list of mapper sets. The list is ordered like clusterings.

Type: list(list)

group_tracers¶

Holds for each slice a list of tracer sets. The list is ordered like clustering.

Type: list(list)

comm_nbr¶

Number of dynamic clusters present in dataset.

Type: int

comm_all¶

List of all dynamic clusters present in the dataset.

Type: list(dc index what type?)

comm_group_members¶

Todo

Unsure about this.

Type: ?

comm_members¶

holding for each slice of the dataset a dictionary indicating for each cluster (key) a list of data sources (values).

Todo

Rename to dc_members

Type: list(dict)

individuals¶

holds all data sources.

Type: list

individual_group_membership¶

holding for each slice of the dataset a dictionary indicating for a data source the cluster it belongs to.

Type: list(dict)

individual_membership¶

holding for each slice of the dataset a dictionary indicating for a data source the dynamic cluster it belongs to.

Type: list(dict)

community_births¶

holding all dynamic cluster birth events.

Todo

Check and report format of this attribute.

Type: list(tuple)

community_deaths¶

A list holding all dynamic cluster death events.

Todo

Check and report format of this attribute.

Type: list(tuple)

community_lifespans¶

providing for each dynamic cluster the lifespan in the unit slices: {comm_id: nbr_slices}

Type: dict

community_splits¶

holds all split events of dynamic clusters.

Type: list(list)

community_cby_splits¶

dynamic clusters that occurred through a split.

Type: list(list)

community_cby_split_merges¶

dynamic clusters that occurred through a split-merge event.

Type: list(list)

community_dby_splits¶

dynamic clusters that vanished after a split event.

Type: list(list)

community_dby_split_merges¶

dynamic clusters that vanished after a split-merge event.

Type: list(list)

community_merges¶

holds all merge events of dynamic clusters.

Type: list(list)

community_cby_merges¶

dynamic clusters that occurred through a merge event.

Type: list(list)

community_dby_merges¶

dynamic clusters that vanished after a merge event.

Type: list(list)

community_growths¶

reports all growth events, i.e. changes in the size of a dynamic cluster that are not related to split or merge events.

Type: list(list)

community_shrinkages¶

reports all shrinkage events, i.e. decreases in the size of a dynamic cluster that are not related to split or merge events.

Type: list(list)

community_autocorrs¶

hold for each dynamic cluster a dictionary with the auto-correlation (value) between the index of a slice (key) and the previous slice. The autocorrelation is given by:

\[\frac{|dc_{i} \cap dc_{j}|}{|dc_{i} \cup dc_{j}|_{res}}\]

where \(i, j\) are the indices from_dix and to_idx and \(|<selection>|_{res}\) is the number of data sources within <selection> counting all data sources (if residents=False) or only those present in both slices (residents=True).

Type: dict(dc index, list)

combined_population(idx_prev=None, idx_next=None, *args, **kwargs)[source]¶

Returns combination of the populations of two (or more) time points.

This is simply the union of the populations at both time points. If not arguments are provided then an iterator is returned that gets the set of combined individuals between each slice.

If only one index is provided then the other one will be completed, i.e. idx_prev = idx_next - 1 or idx_next = idx_prev + 1

If further arguments are provided (all have to be unnamed), then the union is taken between all of these time points.

Example

self.resident_population(2,4,5)

This will return the combined population of the time points 2, 4 and 5.

Parameters

idx_prev (int (default=None)) – index of the 1st time point to get the population from.
idx_next (int (default=None)) –
index of the 2nd time point to get the population from.

Note

If both idx_prev and idx_next are None then a pairwise iterator is returned that allows to loop over the combined population of neighbouring time points.

get_alluvialdiagram(axes, iterator=None, cluster_width=datetime.timedelta(1), *args, **kwargs)[source]¶

Takes a matplotlib axes and draws an alluvialdiagram on it. iterator is the iterator s to draw the clusters for. If `iterator is not provided, then the alluvialdiagram will contain all the clustrings in the time series.

Parameters

axes (matplotlib.axes.Axes) – Axes to draw an Alluvial diagram on.
iterator (iter (default=None)) – An iterator for the indices of the time series to include in the alluvial diagram. If not provided then the entire time series is used.
cluster_width (float) – with of the clusters. This should be provided in the same units as timepoints.
optional parameter (**kwargs) – Will be forwarded to the pyalluv.AlluvialPlot call.
optional parameter –
cluster_location: str (default=’center’)
either ‘center’, ‘start, ‘end’ location withing the aggregation time window where the cluster should be put.

cluster_label: str (default=None)
determine how to label cluster. Possible options are:
- ’groupsize’
- ’group_index’
merged_edgecolor: str (default=None)
edgecolor of merged clusters.

merged_facecolor: str (default=None)
facecolor of merged clusters.

merged_linewidth: float (default=None)
linewidth of merged clusters.

cluster_facecolor: str, dict(dict)
facecolor of clusters. Either provide a single color or a dict with indices of the time series as keys, holding a dict with cluster_id as key and colours as values.

cluster_edgecolor: str, dict(dict)
edgecolor of clusters. Either provide a single color or a dict with indices of the time series as keys, holding a dict with cluster_id as key and colours as values.

flux_facecolor: str, dict
either provide a single color, a keyword or a dict.

Valid keywords are: 'cluster'.

If a dictionary is provided then the idx of the time series must be the keys with another dict as value holding a dict with a tuple as key and a color as value. The tuple’s first element must be a group id form time step idx and the second a group id k form time step `idx`+1

new_coloring: bool (default=False)
if a new color sequence should be generated or not.

distinct_colors: colorseq.DistinctColors (default=None)
the sequence of distinct colour to use.

target_clusters: list (default=None)
list of dynamic cluster id’s to display in the alluvial diagram. If provided, only the dynamic clusters specified in this list will be displayed.

get_auto_corrs(residents=True)[source]¶

Get the auto-correlation between any two consecutive slices.

This method computes for all dynamic clusters the auto-correlation between any two consecutive slices, if the dynamic community exists in both. If residents==True, then only the individuals present in both time points are considered.

Parameters

residents (bool (default=True)) – determines if only resident data sources (i.e. that are present in both slices) should be considered, or also data sources present in only one of the two slices.

Returns

None – Adds new attributes:

community_autocorrs

Return type

None

get_community_avg_lifespan(mode='ensemble')[source]¶

Determines the lifespans of all dynamic clusters.

Parameters: mode (str (default='ensemble')) – Determines what type of average should be computed. Possible are either ensemble (default) or ‘weighted_per_indiv_per_slice’. The ensemble average simply consists of the arithmetic mean of all lifespans. The weighted_per_indiv_per_slice yields the average value of the life span of a dynamic cluster a randomly picked data source belongs to during at randomly picked slice.
Returns: avg_dc_lifespan – the average number of slices a dynamic cluster exists.
Return type: float

get_community_births()[source]¶

Determines all birth events.

Returns

None – Adds new attributes:

community_births

Return type

None

get_community_coloring(n=None, iterator=None, **kwargs)[source]¶

get_community_deaths()[source]¶

Determines all dynamic community death events.

Returns

None – Adds new attributes:

community_deaths

Return type

None

get_community_events()[source]¶: Compute all dynamic community life-cycle related events.

get_community_group_membership()[source]¶

Defines per timepoint a list of clusters belonging to a dynamic cluster

Returns

None – Adds new attributes:

comm_group_members
comm_all
comm_nbr

Return type

None

get_community_growths()[source]¶

Returns

None – Adds new attributes:

community_growths

Return type

None

get_community_lifespans()[source]¶

Determines the lifespans of all dynamic clusters.

Returns

None – Adds new attributes:

community_lifespans

Return type

None

get_community_membership()[source]¶

Defines for each time point a membership list of data sources for each existing dynamic cluster.

Returns

None – Adds new attributes:

comm_members

Return type

None

get_community_merges()[source]¶

Get merge events and determine the DC’s born through merge events.

A merge event occurs whenever members of two distinct DC at some time point are found together in the same DC one time point later.

Returns

None – Adds new attributes:

community_merges
community_cby_merges
community_dby_merges

Return type

None

get_community_shrinkages()[source]¶

Returns

None – Adds new attributes:

community_shrinkages

Return type

None

get_community_splits()[source]¶

Get all split events and determine what clusters arise through a pure split event, i.e. not a split-merge combination.

Returns

None – Adds new attributes:

community_splits
community_cby_splits
community_cby_split_merges
community_dby_splits
community_dby_split_merges

Return type

None

get_dcs(bijective_paths=True, **kwargs)[source]¶

Derives from the history of dynamic clusters from group_matchup.

Todo

Rename to get_dc

Parameters

bijective_paths (bool (default=True)) – If set to True then at each step in the construction of the tracing flow a mapping flow needs to map forward to the target cluster in order to continue to extend the tracing flow.
optional parameter (**kwargs) –

from_idx: int
starting index.

Note

At the starting index all clusters are per definition new dynamic clusters.

to_idx: int
Stopping index. The community detection algorithm will stop at this index (including it).

get_flow(idx, source_set, bwd=True, max_dist=None, **kwargs)[source]¶

Parameters

idx (int) – time series index defining the starting point
source_set (set) – set of clusters at the starting point slice.
bwd (bool (default=True)) – indicating the direction, True is backward, False forward.
max_dist (int (default=None)) – set the maximal length of the flow.
optional parameter (**kwargs) –

majority: bool (default=True)
allows to specify if of only the majority should be used to move between time-points.

validate_path: function (default=:meth:~.MajorTrack._from_flow
Provide a validation method to use during the construction of a flow.

Returns

flow – identity flow starting (including) from the source set.

Return type

list

get_group_matchup(matchup_method=None)[source]¶

Determine majority relation between neighbouring snapshots.

Parameters: matchup_method (str (default=None)) – If provided this overwrites group_matchup_method. It determines the method to use when calculating similarities between clusters from neighbouring snapshots.
Returns: self – with new attribute group_matchup.
Return type: MajorTrack

get_individual_group_membership()[source]¶

Defines for each time point a dict holding for each data source its cluster membership.

Returns

None – Adds new attributes:

individual_group_membership

Return type

None

get_individual_membership()[source]¶

Defines for each time point a dict holding for each data source its dynamic cluster membership.

Returns

None – Adds new attributes:

individual_membership

Return type

None

get_marginal_flows(idx, included_flows)[source]¶

Determines the ensemble of marginal clusters given a target cluster and its identity flow.

Parameters

idx (int) – index of the slice in which target cluster is situated.
included_flows (list) – Identity flow of the target cluster.

get_span(idx, span_set, get_indivs=True)[source]¶

Create the tracer tree.

Parameters

idx (int) – index of the slice in which to start.
span_set (int, str) –
If an int is provided it specifies the index of the target cluster. If a str is given, it is considered as the label of a data source and the containing cluster is selected.

Todo

The label of a cluster should be the only option.
get_indivs (bool (default=True)) – If set to True a list of sets of individual is returned for each slice starting from the index. If it is set to False a list of cluster labels is returned for each slice.

resident_fraction(idx_prev=None, idx_next=None, *args)[source]¶

Get the fraction of the combined population of tow (or more) slices.

This indicates the population fraction that is present at all time points (or slices).

This is simply the size of the intersection divided by the size of the union of the populations If further arguments are provided (all have to be unnamed), then the resident fraction is computed between all of these time points.

Parameters

idx_pres (int (default=None)) –
index of a slice.

Note

If no index is provided then an iterator is returned that yields the resident fraction of the data sources present in neighbouring time points.
idx_next (int (default=None)) – index of a slice.

Returns

resident_fraction – indicating the fraction of the population of data sources (union of all) that is present in all slices. If no values for the parameters idx_prev and idx_next are provided this method returns an iterator that will yield the fraction of the resident population between any two consecutive slices.

Return type

float, iterator

resident_population(idx_prev=None, idx_next=None, *args, **kwargs)[source]¶

Return the resident population between two time points.

The resident population is simply the intersect of the populations at both time points.

If not arguments are provided then an iterator is returned that gets the set of resident individuals between each slice.

If only one index is provided then the other one will be completed, i.e. idx_prev = idx_next - 1 or idx_next = idx_prev + 1

If further arguments are provided (all have to be unnamed), then the intersect is taken between all of these time points.

Example:

self.resident_population(2,4,5) will return the resident
population between the time points 2, 4 and 5

Parameters

idx_prev (int (default=None)) – index of one of the two data slices.
idx_next (int (default=None)) – index of one of the two data slices.

Returns

resident_population – contains all data sources that are in both data slices.

Return type

set