majortrack.tracker

class MajorTrack(clusterings, history, **kwargs)[source]

Bases: object

Parameters
  • clusterings (list, dict) –

    Sequence of clusterings. Each clustering must be present in the form of a list of membership-sets, i.e. a list of clusters with each cluster being defined by a set of data source associated to it (its members).

    If provided as a `dict`:

    keys: float, datetime

    The time points.

    values: list, dict

    The membership list of each clustering indicating to which cluster a data source belongs. See clusterings for details.

  • history (int) – sets the number of time points (or slices) the algorithm can maximally go back in time to check for majority matches.

  • optional parameter (**kwargs) –

    timepoints: list

    The time points of each clustering.

    Note

    If clusterings if of type dict then the keys will be used as time points and this optional parameter is ignored, even if provided.

    slice_widths: list, float (default=None)

    The temporal duration of each snapshot in the sequence of clusterings. If not provided then simply the difference between time point i and i+1 is used as the width of slice i. The width of the last slice is assumed to be the same as the duration of the 2nd last slice.

    individuals: list

    A list of all distinct data sources present in the dataset.

    Todo

    Build it from self.clusterings.

    group_matchup_method: str (default=’fraction’)

    Set the method to calculate the similarity between two clusters from different clusterings. By default the fraction of identical members is used as explained in the original article [LB19].

    use_lazylists: bool (default=False)

    Determine if LazyList’s should be used to store data about dynamic clusters or normal lists. Most likely you want to use normal lists.

  • bibliography: (.) – ../references.bib:

clusterings

Holds for each time point the configuration of the respective clustering. The clustering is given by a list of member-set`s, with each `set containing the data sources in a cluster.

Type

list(list(set)) or LazyList

dcs

Ensemble of all dynamic clusters.

Todo

What’s the type of an element? Is it just an identifier?

Type

list or LazyList

length

Number of slices present in the dataset.

Type

int

cluster_trace

Ensemble of all tracing paths the dynamic clusters.

Todo

What’s the type of an element?

Type

list or LazyList

group_matchup

Holds for each time point the tracing and mapping sets of all clusters. Each element is a dict with the keys 'forward' and 'backward'. Both hold a dict indicating for a cluster the best matching cluster along with the similarity score of the particular relation in a tuple.

Example

self.group_matchup[1] = {
    'backward': {0: (0, 1.0), ...},
#                ^   ^  ^
#                |   |  similarity score
#                |   cluster from previous time point
#                cluster from current time point.
    }
Type

list

group_mappings

Holds for each slice a list of mapping sets. The list is ordered like clusterings.

Example

mt = MajorTrack(...)
idx, cluster_id = 0, 1
# get the set of data sources in this cluster
c_members = mt.clusterings[0][1]
# get the corresponding mapping set (from index idx + 1)
mapping_set = mt.group_mappings[0][1]
Type

list(list)

group_tracings

Holds for each slice a list of tracing sets. The list is ordered like clusterings.

Type

list(list)

group_mappers

Holds for each slice a list of mapper sets. The list is ordered like clusterings.

Type

list(list)

group_tracers

Holds for each slice a list of tracer sets. The list is ordered like clustering.

Type

list(list)

comm_nbr

Number of dynamic clusters present in dataset.

Type

int

comm_all

List of all dynamic clusters present in the dataset.

Type

list(dc index what type?)

comm_group_members

Todo

Unsure about this.

Type

?

comm_members

holding for each slice of the dataset a dictionary indicating for each cluster (key) a list of data sources (values).

Todo

Rename to dc_members

Type

list(dict)

individuals

holds all data sources.

Type

list

individual_group_membership

holding for each slice of the dataset a dictionary indicating for a data source the cluster it belongs to.

Type

list(dict)

individual_membership

holding for each slice of the dataset a dictionary indicating for a data source the dynamic cluster it belongs to.

Type

list(dict)

community_births

holding all dynamic cluster birth events.

Todo

Check and report format of this attribute.

Type

list(tuple)

community_deaths

A list holding all dynamic cluster death events.

Todo

Check and report format of this attribute.

Type

list(tuple)

community_lifespans

providing for each dynamic cluster the lifespan in the unit slices: {comm_id: nbr_slices}

Type

dict

community_splits

holds all split events of dynamic clusters.

Type

list(list)

community_cby_splits

dynamic clusters that occurred through a split.

Type

list(list)

community_cby_split_merges

dynamic clusters that occurred through a split-merge event.

Type

list(list)

community_dby_splits

dynamic clusters that vanished after a split event.

Type

list(list)

community_dby_split_merges

dynamic clusters that vanished after a split-merge event.

Type

list(list)

community_merges

holds all merge events of dynamic clusters.

Type

list(list)

community_cby_merges

dynamic clusters that occurred through a merge event.

Type

list(list)

community_dby_merges

dynamic clusters that vanished after a merge event.

Type

list(list)

community_growths

reports all growth events, i.e. changes in the size of a dynamic cluster that are not related to split or merge events.

Type

list(list)

community_shrinkages

reports all shrinkage events, i.e. decreases in the size of a dynamic cluster that are not related to split or merge events.

Type

list(list)

community_autocorrs

hold for each dynamic cluster a dictionary with the auto-correlation (value) between the index of a slice (key) and the previous slice. The autocorrelation is given by:

\[\frac{|dc_{i} \cap dc_{j}|}{|dc_{i} \cup dc_{j}|_{res}}\]

where \(i, j\) are the indices from_dix and to_idx and \(|<selection>|_{res}\) is the number of data sources within <selection> counting all data sources (if residents=False) or only those present in both slices (residents=True).

Type

dict(dc index, list)

combined_population(idx_prev=None, idx_next=None, *args, **kwargs)[source]

Returns combination of the populations of two (or more) time points.

This is simply the union of the populations at both time points. If not arguments are provided then an iterator is returned that gets the set of combined individuals between each slice.

If only one index is provided then the other one will be completed, i.e. idx_prev = idx_next - 1 or idx_next = idx_prev + 1

If further arguments are provided (all have to be unnamed), then the union is taken between all of these time points.

Example

self.resident_population(2,4,5)

This will return the combined population of the time points 2, 4 and 5.

Parameters
  • idx_prev (int (default=None)) – index of the 1st time point to get the population from.

  • idx_next (int (default=None)) –

    index of the 2nd time point to get the population from.

    Note

    If both idx_prev and idx_next are None then a pairwise iterator is returned that allows to loop over the combined population of neighbouring time points.

get_alluvialdiagram(axes, iterator=None, cluster_width=datetime.timedelta(1), *args, **kwargs)[source]

Takes a matplotlib axes and draws an alluvialdiagram on it. iterator is the iterator s to draw the clusters for. If `iterator is not provided, then the alluvialdiagram will contain all the clustrings in the time series.

Parameters
  • axes (matplotlib.axes.Axes) – Axes to draw an Alluvial diagram on.

  • iterator (iter (default=None)) – An iterator for the indices of the time series to include in the alluvial diagram. If not provided then the entire time series is used.

  • cluster_width (float) – with of the clusters. This should be provided in the same units as timepoints.

  • optional parameter (**kwargs) – Will be forwarded to the pyalluv.AlluvialPlot call.

  • optional parameter

    cluster_location: str (default=’center’)

    either ‘center’, ‘start, ‘end’ location withing the aggregation time window where the cluster should be put.

    cluster_label: str (default=None)

    determine how to label cluster. Possible options are:

    • ’groupsize’

    • ’group_index’

    merged_edgecolor: str (default=None)

    edgecolor of merged clusters.

    merged_facecolor: str (default=None)

    facecolor of merged clusters.

    merged_linewidth: float (default=None)

    linewidth of merged clusters.

    cluster_facecolor: str, dict(dict)

    facecolor of clusters. Either provide a single color or a dict with indices of the time series as keys, holding a dict with cluster_id as key and colours as values.

    cluster_edgecolor: str, dict(dict)

    edgecolor of clusters. Either provide a single color or a dict with indices of the time series as keys, holding a dict with cluster_id as key and colours as values.

    flux_facecolor: str, dict

    either provide a single color, a keyword or a dict.

    Valid keywords are: 'cluster'.

    If a dictionary is provided then the idx of the time series must be the keys with another dict as value holding a dict with a tuple as key and a color as value. The tuple’s first element must be a group id form time step idx and the second a group id k form time step `idx`+1

    new_coloring: bool (default=False)

    if a new color sequence should be generated or not.

    distinct_colors: colorseq.DistinctColors (default=None)

    the sequence of distinct colour to use.

    target_clusters: list (default=None)

    list of dynamic cluster id’s to display in the alluvial diagram. If provided, only the dynamic clusters specified in this list will be displayed.

get_auto_corrs(residents=True)[source]

Get the auto-correlation between any two consecutive slices.

This method computes for all dynamic clusters the auto-correlation between any two consecutive slices, if the dynamic community exists in both. If residents==True, then only the individuals present in both time points are considered.

Parameters

residents (bool (default=True)) – determines if only resident data sources (i.e. that are present in both slices) should be considered, or also data sources present in only one of the two slices.

Returns

None – Adds new attributes:

Return type

None

get_community_avg_lifespan(mode='ensemble')[source]

Determines the lifespans of all dynamic clusters.

Parameters

mode (str (default='ensemble')) – Determines what type of average should be computed. Possible are either ensemble (default) or ‘weighted_per_indiv_per_slice’. The ensemble average simply consists of the arithmetic mean of all lifespans. The weighted_per_indiv_per_slice yields the average value of the life span of a dynamic cluster a randomly picked data source belongs to during at randomly picked slice.

Returns

avg_dc_lifespan – the average number of slices a dynamic cluster exists.

Return type

float

get_community_births()[source]

Determines all birth events.

Returns

None – Adds new attributes:

Return type

None

get_community_coloring(n=None, iterator=None, **kwargs)[source]
get_community_deaths()[source]

Determines all dynamic community death events.

Returns

None – Adds new attributes:

Return type

None

get_community_events()[source]

Compute all dynamic community life-cycle related events.

get_community_group_membership()[source]

Defines per timepoint a list of clusters belonging to a dynamic cluster

Returns

None – Adds new attributes:

Return type

None

get_community_growths()[source]
Returns

None – Adds new attributes:

Return type

None

get_community_lifespans()[source]

Determines the lifespans of all dynamic clusters.

Returns

None – Adds new attributes:

Return type

None

get_community_membership()[source]

Defines for each time point a membership list of data sources for each existing dynamic cluster.

Returns

None – Adds new attributes:

Return type

None

get_community_merges()[source]

Get merge events and determine the DC’s born through merge events.

A merge event occurs whenever members of two distinct DC at some time point are found together in the same DC one time point later.

Returns

None – Adds new attributes:

Return type

None

get_community_shrinkages()[source]
Returns

None – Adds new attributes:

Return type

None

get_community_splits()[source]

Get all split events and determine what clusters arise through a pure split event, i.e. not a split-merge combination.

Returns

None – Adds new attributes:

Return type

None

get_dcs(bijective_paths=True, **kwargs)[source]

Derives from the history of dynamic clusters from group_matchup.

Todo

Rename to get_dc

Parameters
  • bijective_paths (bool (default=True)) – If set to True then at each step in the construction of the tracing flow a mapping flow needs to map forward to the target cluster in order to continue to extend the tracing flow.

  • optional parameter (**kwargs) –

    from_idx: int

    starting index.

    Note

    At the starting index all clusters are per definition new dynamic clusters.

    to_idx: int

    Stopping index. The community detection algorithm will stop at this index (including it).

get_flow(idx, source_set, bwd=True, max_dist=None, **kwargs)[source]
Parameters
  • idx (int) – time series index defining the starting point

  • source_set (set) – set of clusters at the starting point slice.

  • bwd (bool (default=True)) – indicating the direction, True is backward, False forward.

  • max_dist (int (default=None)) – set the maximal length of the flow.

  • optional parameter (**kwargs) –

    majority: bool (default=True)

    allows to specify if of only the majority should be used to move between time-points.

    validate_path: function (default=:meth:~.MajorTrack._from_flow

    Provide a validation method to use during the construction of a flow.

Returns

flow – identity flow starting (including) from the source set.

Return type

list

get_group_matchup(matchup_method=None)[source]

Determine majority relation between neighbouring snapshots.

Parameters

matchup_method (str (default=None)) – If provided this overwrites group_matchup_method. It determines the method to use when calculating similarities between clusters from neighbouring snapshots.

Returns

self – with new attribute group_matchup.

Return type

MajorTrack

get_individual_group_membership()[source]

Defines for each time point a dict holding for each data source its cluster membership.

Returns

None – Adds new attributes:

Return type

None

get_individual_membership()[source]

Defines for each time point a dict holding for each data source its dynamic cluster membership.

Returns

None – Adds new attributes:

Return type

None

get_marginal_flows(idx, included_flows)[source]

Determines the ensemble of marginal clusters given a target cluster and its identity flow.

Parameters
  • idx (int) – index of the slice in which target cluster is situated.

  • included_flows (list) – Identity flow of the target cluster.

get_span(idx, span_set, get_indivs=True)[source]

Create the tracer tree.

Parameters
  • idx (int) – index of the slice in which to start.

  • span_set (int, str) –

    If an int is provided it specifies the index of the target cluster. If a str is given, it is considered as the label of a data source and the containing cluster is selected.

    Todo

    The label of a cluster should be the only option.

  • get_indivs (bool (default=True)) – If set to True a list of sets of individual is returned for each slice starting from the index. If it is set to False a list of cluster labels is returned for each slice.

resident_fraction(idx_prev=None, idx_next=None, *args)[source]

Get the fraction of the combined population of tow (or more) slices.

This indicates the population fraction that is present at all time points (or slices).

This is simply the size of the intersection divided by the size of the union of the populations If further arguments are provided (all have to be unnamed), then the resident fraction is computed between all of these time points.

Parameters
  • idx_pres (int (default=None)) –

    index of a slice.

    Note

    If no index is provided then an iterator is returned that yields the resident fraction of the data sources present in neighbouring time points.

  • idx_next (int (default=None)) – index of a slice.

Returns

resident_fraction – indicating the fraction of the population of data sources (union of all) that is present in all slices. If no values for the parameters idx_prev and idx_next are provided this method returns an iterator that will yield the fraction of the resident population between any two consecutive slices.

Return type

float, iterator

resident_population(idx_prev=None, idx_next=None, *args, **kwargs)[source]

Return the resident population between two time points.

The resident population is simply the intersect of the populations at both time points.

If not arguments are provided then an iterator is returned that gets the set of resident individuals between each slice.

If only one index is provided then the other one will be completed, i.e. idx_prev = idx_next - 1 or idx_next = idx_prev + 1

If further arguments are provided (all have to be unnamed), then the intersect is taken between all of these time points.

Example:

self.resident_population(2,4,5) will return the resident
population between the time points 2, 4 and 5
Parameters
  • idx_prev (int (default=None)) – index of one of the two data slices.

  • idx_next (int (default=None)) – index of one of the two data slices.

Returns

resident_population – contains all data sources that are in both data slices.

Return type

set