majortrack.tracker

This python package implements the evolutionary clustering algorithm presented in the paper by [LB19].

LB19(1,2)

Jonas I Liechti and Sebastian Bonhoeffer. A time resolved clustering method revealing longterm structures and their short-term internal dynamics. arXiv preprint arXiv:1912.04261, 2019.

class MajorTrack(clusterings, history, **kwargs)[source]

Bases: object

Parameters
  • clusterings (list, dict) –

    Sequence of clusterings.

    If provided as a `dict`:

    keys: float, datetime

    The time points.

    values: list, dict

    The membership list of each clustering indicating to which cluster a data source belongs. See memberships for details.

  • history (int) – sets the number of time points (or slices) the algorithm can maximally go back in time to check for majority matches.

  • optional parameter (**kwargs) –

    timepoints: list

    The time points of each clustering.

    Note

    If clusterings if of type dict then the keys will be used as time points and this optional parameter is ignored, even if provided.

    slice_widths: list, float (default=None)

    The temporal duration of each snapshot in the sequence of clusterings. If not provided then simply the difference between time point i and i+1 is used as the width of slice i. The width of the last slice is assumed to be the same as the duration of the 2nd last slice.

    individuals: list

    A list of all distinct data sources present in the dataset.

    Todo

    Build it from self.clusterings.

    group_matchup_method: str (default=’fraction’)

    Set the method to calculate the similarity between two clusters from different clusterings. By default the fraction of identical members is used as explained in the original article [LB19].

    use_lazylists: bool (default=False)

    Determine if LazyList’s should be used to store data about dynamic clusters or normal lists. Most likely you want to use normal lists.

clusters

Ensemble of all dynamic clusters.

Todo

What’s the type of an element? Is it just an identifier?

Type

list or LazyList

length

Number of slices present in the dataset.

Type

int

cluster_trace

Ensemble of all tracing paths the dynamic clusters.

Todo

What’s the type of an element?

Type

list or LazyList

group_matchup

Holds for each time point the tracing and mapping sets of all clusters. Each element is a dict with the keys 'forward' and 'backward'. Both hold a dict indicating for a cluster the best matching cluster along with the similarity score of the particular relation in a tuple.

Example

self.group_matchup[1] = {
    'backward': {0: (0, 1.0), ...},
#                ^   ^  ^
#                |   |  similarity score
#                |   cluster from previous time point
#                cluster from current time point.
    }
Type

list

group_mappings

Holds for each slice a list of mapping sets. The list is ordered like grougings.

Example

mt = MajorTrack(...)
idx, cluster_id = 0, 1
# get the set of data sources in this cluster
c_members = mt.clusterings[0][1]
# get the corresponding mapping set (from index idx + 1)
mapping_set = mt.group_mappings[0][1]
Type

list(list)

group_tracings

Holds for each slice a list of tracing sets. The list is ordered like grougings.

Type

list(list)

group_mappers

Holds for each slice a list of mapper sets. The list is ordered like grougings.

Type

list(list)

group_tracers

Holds for each slice a list of tracer sets. The list is ordered like grougings.

Type

list(list)

comm_nbr

Number of dynamic clusters present in dataset.

Type

int

comm_all

List of all dynamic clusters present in the dataset.

Type

list(dc index what type?)

comm_group_members

Todo

Unsure about this.

Type

?

comm_members

holding for each slice of the dataset a dictionary indicating for each cluster (key) a list of data sources (values).

Todo

Rename to dc_members

Type

list(dict)

individuals

holds all data sources.

Type

list

individual_group_membership

holding for each slice of the dataset a dictionary indicating for a data source the cluster it belongs to.

Type

list(dict)

individual_membership

holding for each slice of the dataset a dictionary indicating for a data source the dynamic cluster it belongs to.

Type

list(dict)

community_births

holding all dynamic cluster birth events.

Todo

Check and report format of this attribute.

Type

list(tuple)

community_deaths

A list holding all dynamic cluster death events.

Todo

Check and report format of this attribute.

Type

list(tuple)

community_lifespans

providing for each dynamic cluster the lifespan in the unit slices: {comm_id: nbr_slices}

Type

dict

community_splits

holds all split events of dynamic clusters.

Type

list(list)

community_cby_splits

dynamic clusters that occurred through a split.

Type

list(list)

community_cby_split_merges

dynamic clusters that occurred through a split-merge event.

Type

list(list)

community_dby_splits

dynamic clusters that vanished after a split event.

Type

list(list)

community_dby_split_merges

dynamic clusters that vanished after a split-merge event.

Type

list(list)

community_merges

holds all merge events of dynamic clusters.

Type

list(list)

community_cby_merges

dynamic clusters that occurred through a merge event.

Type

list(list)

community_dby_merges

dynamic clusters that vanished after a merge event.

Type

list(list)

community_growths

reports all growth events, i.e. changes in the size of a dynamic cluster that are not related to split or merge events.

Type

list(list)

community_shrinkages

reports all shrinkage events, i.e. decreases in the size of a dynamic cluster that are not related to split or merge events.

Type

list(list)

community_autocorrs

hold for each dynamic cluster a dictionary with the auto-correlation (value) between the index of a slice (key) and the previous slice. The autocorrelation is given by:

\[\frac{|dc_{i} \cap dc_{j}|}{|dc_{i} \cup dc_{j}|_{res}}\]

where \(i, j\) are the indices from_dix and to_idx and \(|<selection>|_{res}\) is the number of data sources within <selection> counting all data sources (if residents=False) or only those present in both slices (residents=True).

Type

dict(dc index, list)

combined_population(idx_prev=None, idx_next=None, *args, **kwargs)[source]

Returns combination of the populations of two (or more) time points.

This is simply the union of the populations at both time points. If not arguments are provided then an iterator is returned that gets the set of combined individuals between each slice.

If only one index is provided then the other one will be completed, i.e. idx_prev = idx_next - 1 or idx_next = idx_prev + 1

If further arguments are provided (all have to be unnamed), then the union is taken between all of these time points.

Example

self.resident_population(2,4,5)

This will return the combined population of the time points 2, 4 and 5.

Parameters
  • idx_prev (int (default=None)) – index of the 1st time point to get the population from.

  • idx_next (int (default=None)) –

    index of the 2nd time point to get the population from.

    Note

    If both idx_prev and idx_next are None then a pairwise iterator is returned that allows to loop over the combined population of neighbouring time points.

get_alluvialdiagram(axes, iterator=None, cluster_width=datetime.timedelta(1), *args, **kwargs)[source]

Takes a matplotlib axes and draws an alluvialdiagram on it. iterator is the iterator s to draw the clusters for. If `iterator is not provided, then the alluvialdiagram will contain all the clustrings in the time series.

Parameters

kwargs

  • cluster_location: ‘center’, ‘start, ‘end’ location withing the

    aggregation time window where the cluster should be put. Default is ‘center’

  • cluster_width: with of the clusters

  • cluster_label: None (default), ‘groupsize’, ‘group_index’

  • merged_edgecolor: edgecolor of merged groups

  • merged_facecolor: facecolor of merged groups

  • merged_linewidth: linewidth of merged groups

  • cluster_facecolor: either single color or dict with idx as keys

    holding a dict with group_id as key

  • cluster_edgecolor: either single color or dict with idx as keys

    holding a dict with group_id as key

  • flux_facecolor: either single color or dict with idx as keys

    holding a dict with cluster tuple as key, with the first element a group id form time step idx and the second a group id form time step idx+1

  • new_coloring: False

  • distinct_colors: can be an instance of DistinctColors that will

    be used for the coloring

get_auto_corrs(residents=True)[source]

Get the auto-correlation between any two consecutive slices.

This method computes for all dynamic clusters the auto-correlation between any two consecutive slices, if the dynamic community exists in both. If residents==True, then only the individuals present in both time points are considered.

Parameters

residents (bool (default=True)) – determines if only resident data sources (i.e. that are present in both slices) should be considered, or also data sources present in only one of the two slices.

Returns

None – Adds new attributes:

  • attr:~.MajorTrack.community_autocorrs

Return type

None

get_community_avg_lifespan(mode='ensemble')[source]

Determines the lifespans of all dynamic clusters.

Parameters

mode (str (default='ensemble')) – Determines what type of average should be computed. Possible are either ensemble (default) or ‘weighted_per_indiv_per_slice’. The ensemble average simply consists of the arithmetic mean of all lifespans. The weighted_per_indiv_per_slice yields the average value of the life span of a dynamic cluster a randomly picked data source belongs to during at randomly picked slice.

Returns

avg_dc_lifespan – the average number of slices a dynamic cluster exists.

Return type

float

get_community_births()[source]

Determines all birth events.

Returns

None – Adds new attributes:

  • attr:~.MajorTrack.community_births

Return type

None

get_community_coloring(n=None, iterator=None, *args, **kwargs)[source]
get_community_deaths()[source]

Determines all dynamic community death events.

Returns

None – Adds new attributes:

  • attr:~.MajorTrack.community_deaths

Return type

None

get_community_events()[source]

Compute all dynamic community life-cycle related events.

get_community_group_membership()[source]

Defines per timepoint a list of clusters belonging to a dynamic cluster

Returns

None – Adds new attributes:

  • attr:~.MajorTrack.comm_group_members

  • attr:~.MajorTrack.comm_all

  • attr:~.MajorTrack.comm_nbr

Return type

None

get_community_growths()[source]
Returns

None – Adds new attributes:

  • attr:~.MajorTrack.community_growths

Return type

None

get_community_lifespans()[source]

Determines the lifespans of all dynamic clusters.

Returns

None – Adds new attributes:

  • attr:~.MajorTrack.community_lifespans

Return type

None

get_community_membership()[source]

Defines for each time point a membership list of data sources for each existing dynamic cluster.

Returns

None – Adds new attributes:

  • attr:~.MajorTrack.comm_members

Return type

None

get_community_merges()[source]

Get all merge events and determine what clusters arise through pure merge events.

Returns

None – Adds new attributes:

  • attr:~.MajorTrack.community_merges

  • attr:~.MajorTrack.community_cby_merges

  • attr:~.MajorTrack.community_dby_merges

Return type

None

get_community_shrinkages()[source]
Returns

None – Adds new attributes:

  • attr:~.MajorTrack.community_shrinkages

Return type

None

get_community_splits()[source]

Get all split events and determine what clusters arise through a pure split event, i.e. not a split-merge combination.

Returns

None – Adds new attributes:

  • attr:~.MajorTrack.community_splits

  • attr:~.MajorTrack.community_cby_splits

  • attr:~.MajorTrack.community_cby_split_merges

  • attr:~.MajorTrack.community_dby_splits

  • attr:~.MajorTrack.community_dby_split_merges

Return type

None

get_dcs(bijective_paths=True, **kwargs)[source]

Derives from the history of dynamic clusters from group_matchup.

Todo

Rename to get_dc

Parameters
  • bijective_paths (bool (default=True)) – If set to True then at each step in the construction of the tracing flow a mapping flow needs to map forward to the target cluster in order to continue to extend the tracing flow.

  • optional parameter (**kwargs) –

    from_idx: int

    starting index.

    Note

    At the starting index all clusters are per definition new dynamic clusters.

    to_idx: int

    Stopping index. The community detection algorithm will stop at this index (including it).

get_flow(idx, source_set, bwd=True, max_dist=None, **kwargs)[source]
Parameters
  • idx (int) – time series index defining the starting point

  • source_set (set) – set of clusters at the starting point slice.

  • bwd (bool (default=True)) – indicating the direction, True is backward, False forward.

  • max_dist (int (default=None)) – set the maximal length of the flow.

  • optional parameter (**kwargs) –

    majority: bool (default=True)

    allows to specify if of only the majority should be used to move between time-points.

    validate_path: function (default=:meth:~.MajorTrack._from_flow

    Provide a validation method to use during the construction of a flow.

Returns

flow – identity flow starting (including) from the source set.

Return type

list

get_group_matchup(matchup_method=None)[source]

Determine majority relation between neighbouring snapshots.

Parameters

matchup_method (str (default=None)) – If provided this overwrites group_matchup_method. It determines the method to use when calculating similarities between clusters from neighbouring snapshots.

Returns

self – with new attribute group_matchup.

Return type

MajorTrack

get_individual_group_membership()[source]

Defines for each time point a dict holding for each data source its cluster membership.

Returns

None – Adds new attributes:

  • attr:~.MajorTrack.individual_group_membership

Return type

None

get_individual_membership()[source]

Defines for each time point a dict holding for each data source its dynamic cluster membership.

Returns

None – Adds new attributes:

  • attr:~.MajorTrack.individual_membership

Return type

None

get_marginal_flows(idx, included_flows)[source]

Determines the ensemble of marginal clusters given a target cluster and its identity flow.

Parameters
  • idx (int) – index of the slice in which target cluster is situated.

  • included_flows (list) – Identity flow of the target cluster.

get_span(idx, span_set, get_indivs=True)[source]

Create the tracer tree.

Parameters
  • idx (int) – index of the slice in which to start.

  • span_set (int, str) –

    If an int is provided it specifies the index of the target cluster. If a str is given, it is considered as the label of a data source and the containing cluster is selected.

    Todo

    The label of a cluster should be the only option.

  • get_indivs (bool (default=True)) – If set to True a list of sets of individual is returned for each slice starting from the index. If it is set to False a list of cluster labels is returned for each slice.

resident_fraction(idx_prev=None, idx_next=None, *args)[source]

Get the fraction of the combined population of tow (or more) slices.

This indicates the population fraction that is present at all time points (or slices).

This is simply the size of the intersection divided by the size of the union of the populations If further arguments are provided (all have to be unnamed), then the resident fraction is computed between all of these time points.

Parameters
  • idx_pres (int (default=None)) –

    index of a slice.

    Note

    If no index is provided then an iterator is returned that yields the resident fraction of the data sources present in neighbouring time points.

  • idx_next (int (default=None)) – index of a slice.

Returns

resident_fraction – indicating the fraction of the population of data sources (union of all) that is present in all slices. If no values for the parameters idx_prev and idx_next are provided this method returns an iterator that will yield the fraction of the resident population between any two consecutive slices.

Return type

float, iterator

resident_population(idx_prev=None, idx_next=None, *args, **kwargs)[source]

Return the resident population between two time points.

The resident population is simply the intersect of the populations at both time points.

If not arguments are provided then an iterator is returned that gets the set of resident individuals between each slice.

If only one index is provided then the other one will be completed, i.e. idx_prev = idx_next - 1 or idx_next = idx_prev + 1

If further arguments are provided (all have to be unnamed), then the intersect is taken between all of these time points.

Example:

self.resident_population(2,4,5) will return the resident
population between the time points 2, 4 and 5
Parameters
  • idx_prev (int (default=None)) – index of one of the two data slices.

  • idx_next (int (default=None)) – index of one of the two data slices.

Returns

resident_population – contains all data sources that are in both data slices.

Return type

set