Search

Methods for searching for conjunctions in a binary (formal) search context.

Overview

Context(attributes, objects[, sort_attributes])

Formal context, i.e., a binary relation between a set of objects and a set of attributes, i.e., Boolean functions that are defined on the objects.

CoreQueryTreeSearch(ctx, obj, bnd[, order, …])

Searches the prefix-tree of core queries given a certain Context.

GreedySearch(ctx, obj[, bdn, verbose])

Simple best-in greedy search for conjunctive query.

search_methods

Dictionary of available search methods.

Details

class realkd.search.Context(attributes, objects, sort_attributes=True)

Formal context, i.e., a binary relation between a set of objects and a set of attributes, i.e., Boolean functions that are defined on the objects.

A formal context provides a search context (search space) over conjunctions that can be formed from the individual attributes.

complete_closure(gen_idx, bit_extension, part_closure)
Parameters
  • gen_idx

  • bit_extension

  • closure_prefix

Returns

>>> table = [[0, 1, 0, 1],
...          [1, 1, 1, 0],
...          [1, 0, 1, 0],
...          [0, 1, 0, 1]]
>>> ctx = Context.from_tab(table)
>>> clo = bitarray('1000')
>>> ctx.complete_closure(0, bitarray('0110'), clo)
2
>>> clo
bitarray('1010')
>>> clo = bitarray('1110')
>>> ctx.complete_closure(1, bitarray('0100'), clo)
4
>>> clo
bitarray('1110')
extension(intent)
Parameters

intent – attributes describing a set of objects

Returns

indices of objects that have all attributes in intent in common

find_small_crit_index(gen_idx, bit_extension, part_closure)
>>> table = [[0, 1, 0, 1],
...          [1, 1, 1, 0],
...          [1, 0, 1, 0],
...          [0, 1, 0, 1]]
>>> ctx = Context.from_tab(table)
>>> root = Node([], bitarray('0000'), array([0,1,2,3]), bitarray('1111'), -1, -4, 1, inf)
>>> ctx.find_small_crit_index(0, bitarray('0110'), bitarray('1000'))
4
>>> ctx.find_small_crit_index(1, bitarray('1101'), bitarray('0100'))
4
>>> ctx.find_small_crit_index(2, bitarray('0110'), bitarray('0010'))
0
>>> ctx.find_small_crit_index(3, bitarray('1001'), bitarray('0001'))
1
>>> ctx.find_small_crit_index(3, bitarray('1001'), bitarray('0101'))
4
static from_df(df, without=None, max_col_attr=10, sort_attributes=True, discretization=<function qcut>, **kwargs)

Generates formal context from pandas dataframe by applying inter-ordinal scaling to numerical data columns and for object columns creating one attribute per value.

For inter-ordinal scaling a maximum number of attributes per column can be specified. If required, threshold values are then selected by the provided discretization function (per default quantile-based).

The restriction should also be implemented for object columns in the future (by merging small categories into disjunctive propositions).

The generated attributes correspond to pandas-compatible query strings. For example:

>>> titanic_df = pd.read_csv("../datasets/titanic/train.csv")
>>> titanic_df.drop(columns=['PassengerId', 'Name', 'Ticket', 'Cabin'], inplace=True)
>>> titanic_ctx = Context.from_df(titanic_df, max_col_attr=6, sort_attributes=False)
>>> titanic_ctx.m
891
>>> titanic_ctx.attributes 
[Survived<=0, Survived>=1, Pclass<=1, Pclass<=2, Pclass>=2, Pclass>=3, Sex==male, Sex==female, Age<=23.0,
Age>=23.0, Age<=34.0, Age>=34.0, Age<=80.0, Age>=80.0, SibSp<=8.0, SibSp>=8.0, Parch<=6.0, Parch>=6.0,
Fare<=8.6625, Fare>=8.6625, Fare<=26.0, Fare>=26.0, Fare<=512.3292, Fare>=512.3292, Embarked==S, Embarked==C,
Embarked==Q, Embarked==nan]
>>> titanic_ctx.n
28
>>> titanic_df.query('Survived>=1 & Pclass>=3 & Sex=="male" & Age>=34')
     Survived  Pclass   Sex   Age  SibSp  Parch   Fare Embarked
338         1       3  male  45.0      0      0  8.050        S
400         1       3  male  39.0      0      0  7.925        S
414         1       3  male  44.0      0      0  7.925        S
>>> titanic_ctx.extension([1, 5, 6, 11])
array([338, 400, 414])
>>> titanic_ctx = Context.from_df(titanic_df, max_col_attr=defaultdict(lambda: None, Age=6, Fare=6),
...                               sort_attributes=False)
>>> titanic_ctx.attributes 
[Survived<=0, Survived>=1, Pclass<=1, Pclass<=2, Pclass>=2, Pclass>=3, Sex==male, Sex==female, Age<=23.0,
Age>=23.0, Age<=34.0, Age>=34.0, Age<=80.0, Age>=80.0, SibSp<=0, SibSp<=1, SibSp>=1, SibSp<=2, SibSp>=2,
SibSp<=3, SibSp>=3, SibSp<=4, SibSp>=4, SibSp<=5, SibSp>=5, SibSp>=8, Parch<=0, Parch<=1, Parch>=1, Parch<=2,
Parch>=2, Parch<=3, Parch>=3, Parch<=4, Parch>=4, Parch<=5, Parch>=5, Parch>=6, Fare<=8.6625, Fare>=8.6625,
Fare<=26.0, Fare>=26.0, Fare<=512.3292, Fare>=512.3292, Embarked==S, Embarked==C, Embarked==Q, Embarked==nan]
Parameters
  • df (DataFrame) – pandas dataframe to be converted to formal context

  • max_col_attr (int) – maximum number of attributes generated per column; or None if an arbitrary number of attributes is permitted; or dict (usually defaultdict) with keys being columns ids of df and values being the maximum number of attributes for the corresponding column (again using None if no bound for a specific column); Note: use defaultdict(lambda: None) instead of defaultdict(None) to specify no maximum per default

  • discretization (callable) – the discretization function to be used when number of thresholds has to be reduced to a specificed maximum (function has to have identical signature to pandas.qcut, which is the default)

  • without (Iterable[str]) – columns to ommit

Returns

Context representing dataframe

static from_tab(table, sort_attributes=False)

Converts an input table where each row represents an object into a formal context (which uses column-based representation).

Uses Boolean interpretation of table values to determine attribute presence for an object.

For instance:

>>> table = [[0, 1, 0, 1],
...          [1, 1, 1, 0],
...          [1, 0, 1, 0],
...          [0, 1, 0, 1]]
>>> ctx = Context.from_tab(table)
>>> list(ctx.extension([0, 2]))
[1, 2]
Parameters

table – table that explicitly encodes object/attribute relation

Returns

context over table row indices as objects and one tabulated feature per column index

class realkd.search.CoreQueryTreeSearch(ctx, obj, bnd, order='bestboundfirst', apx=1.0, max_depth=10, verbose=False, **kwargs)

Searches the prefix-tree of core queries given a certain Context.

The idea of core queries is that there is only one core query per potential query extensions. Thus using them as solution candidates results in a strongly condensed search space compared to naively searching all conjunctions.

Core queries have been originally introduced for closed itemset mining [UAUA04]. The following is a simple recursive definition that does not require the notion of closures:

The trivial query \(q = \top\) is a core query. Moreover, the tail augmentation \(qp_i\) of a core query \(q\) is also a core query if \(q \not\rightarrow p_i\) and

\begin{equation} \text{for all } j < i, \text{ if } q' \rightarrow p_j \text{ then } q \rightarrow p_j \enspace . \end{equation}
Parameters
  • ctx (Context) – the context defining the search space

  • obj (callable) – objective function

  • bnd (callable) – bounding function satisfying that bnd(q) >= max{obj(r) for r in successors(q)}

  • order (str) – traversal order ('breadthfirst', 'depthfirst', 'bestboundfirst', or 'bestvaluefirst'; see traversal_orders)

  • apx (float) – approximation factor that determines guarantee of what fraction of search space will be traversed, i.e., all nodes q are visited with obj(q) >= apx * opt (default 1.0, i.e., optimum will be visited; smaller values means less traversal elements

  • max_depth (int) – maximum depth of explored search nodes

  • verbose (int) – level of verbosity

run()

Runs the configured search.

Returns

Conjunction that (approximately) maximizes objective

traversal()

A first example with trivial objective and bounding function is as follows. In this example the optimal extension is the empty extension, which is generated via the the non-lexicographically smallest generator [0, 3].

>>> table = [[0, 1, 0, 1],
...          [1, 1, 1, 0],
...          [1, 0, 1, 0],
...          [0, 1, 0, 1]]
>>> ctx = Context.from_tab(table)
>>> search = CoreQueryTreeSearch(ctx, lambda e: -len(e), lambda e: 0, order='breadthfirst')
>>> for n in search.traversal():
...     print(n)
N([], [], -4, inf, [0 1 2 3])
N([0], [0 2], -2, 0, [1 2])
N([1], [1], -3, 0, [0 1 3])
N([2], [0 2], -2, 0, [1 2])
N([3], [1 3], -2, 0, [0 3])
N([0, 1], [0 1 2], -1, 0, [1])
N([0, 3], [0 1 2 3], 0, 0, [])
>>> search.verbose = True
>>> search.run()

Found optimum after inspecting 7 nodes: [0, 3]
Completing closure
Greedy simplification: [0, 3]
c0 & c3
>>> CoreQueryTreeSearch(ctx, lambda e: 5-len(e), lambda e: 4+(len(e)>=2), apx=0.7).run()
c0 & c1

Let’s use more realistic objective and bounding functions based on values associated with each object (row in the table).

>>> values = [-1, 1, 1, -1]
>>> f = lambda e: sum((values[i] for i in e))/4
>>> g = lambda e: sum((values[i] for i in e if values[i]==1))/4
>>> search = CoreQueryTreeSearch(ctx, f, g)
>>> for n in search.traversal():
...     print(n)
N([], [], 0, inf, [0 1 2 3])
N([0], [0 2], 0.5, 0.5, [1 2])
N([2], [0 2], 0.5, 0.5, [1 2])

Finally, here is a complex example taken from the UdS seminar on subgroup discovery.

>>> table = [[1, 1, 1, 1, 0],
...          [1, 1, 0, 0, 0],
...          [1, 0, 1, 0, 0],
...          [0, 1, 1, 1, 1],
...          [0, 0, 1, 1, 1],
...          [1, 1, 0, 0, 1]]
>>> ctx = Context.from_tab(table)
>>> labels = [1, 0, 1, 0, 0, 0]
>>> from realkd.legacy import impact, cov_incr_mean_bound, impact_count_mean
>>> f = impact(labels)
>>> g = cov_incr_mean_bound(labels, impact_count_mean(labels))
>>> search = CoreQueryTreeSearch(ctx, f, g)
>>> for n in search.traversal():
...     print(n)
N([], [], 0, inf, [0 1 2 3 4 5])
N([0], [0], 0.11111, 0.22222, [0 1 2 5])
N([1], [1], -0.055556, 0.11111, [0 1 3 5])
N([2], [2], 0.11111, 0.22222, [0 2 3 4])
N([3], [2 3], 0, 0.11111, [0 3 4])
N([0, 2], [0 2], 0.22222, 0.22222, [0 2])
>>> search.run()
c0 & c2
>>> labels = [1, 0, 0, 1, 1, 0]
>>> f = impact(labels)
>>> g = cov_incr_mean_bound(labels, impact_count_mean(labels))
>>> search = CoreQueryTreeSearch(ctx, f, g)
>>> for n in search.traversal():
...     print(n)
N([], [], 0, inf, [0 1 2 3 4 5])
N([0], [0], -0.16667, 0.083333, [0 1 2 5])
N([1], [1], 0, 0.16667, [0 1 3 5])
N([2], [2], 0.16667, 0.25, [0 2 3 4])
N([3], [2 3], 0.25, 0.25, [0 3 4])
traversal_orders = {'bestboundfirst': <class 'realkd.search.BestBoundFirstBoundary'>, 'bestvaluefirst': <class 'realkd.search.BestValueFirstBoundary'>, 'breadthfirst': <class 'realkd.search.BreadthFirstBoundary'>, 'depthfirst': <class 'realkd.search.DepthFirstBoundary'>}

dictionary of available traversal orders for core query search

class realkd.search.GreedySearch(ctx, obj, bdn=None, verbose=False, **kwargs)

Simple best-in greedy search for conjunctive query.

Search starts from the trivial (empty) query and adds an objective-maximizing conditions until no further improvement with a single augmentation is possible.

>>> table = [[1, 1, 1, 1, 0],
...          [1, 1, 0, 0, 0],
...          [1, 0, 1, 0, 0],
...          [0, 1, 1, 1, 1],
...          [0, 0, 1, 1, 1],
...          [1, 1, 0, 0, 1]]
>>> ctx = Context.from_tab(table)
>>> labels = [1, 0, 1, 0, 0, 0]
>>> from realkd.legacy import impact
>>> f = impact(labels)
>>> GreedySearch(ctx, f).run()
c0 & c2
Parameters
  • ctx (Context) – the context defining the search space

  • obj (callable) – objective function

  • bnd (callable) – bounding function satisfying that bnd(q) >= max{obj(r) for r in successors(q)} (for signature compatibility only, not currently used)

  • verbose (int) – level of verbosity

run()

Runs the configured search.

Returns

Conjunction that (approximately) maximizes objective

realkd.search.search_methods = {'exhaustive': <class 'realkd.search.CoreQueryTreeSearch'>, 'greedy': <class 'realkd.search.GreedySearch'>}

Dictionary of available search methods.

References

UAUA04

Takeaki Uno, Tatsuya Asai, Yuzo Uchida, and Hiroki Arimura. An efficient algorithm for enumerating closed patterns in transaction databases. In Disc Science, 16–31. Springer, 2004.