Search¶

Methods for searching for conjunctions in a binary (formal) search context.

Overview¶

`Context`(attributes, objects[, sort_attributes])	Formal context, i.e., a binary relation between a set of objects and a set of attributes, i.e., Boolean functions that are defined on the objects.
`CoreQueryTreeSearch`(ctx, obj, bnd[, order, …])	Searches the prefix-tree of core queries given a certain `Context`.
`GreedySearch`(ctx, obj[, bdn, verbose])	Simple best-in greedy search for conjunctive query.
`search_methods`	Dictionary of available search methods.

Details¶

class realkd.search.Context(attributes, objects, sort_attributes=True)¶

Formal context, i.e., a binary relation between a set of objects and a set of attributes, i.e., Boolean functions that are defined on the objects.

A formal context provides a search context (search space) over conjunctions that can be formed from the individual attributes.

complete_closure(gen_idx, bit_extension, part_closure)¶

Parameters

gen_idx –
bit_extension –
closure_prefix –

Returns

>>> table = [[0, 1, 0, 1],
...          [1, 1, 1, 0],
...          [1, 0, 1, 0],
...          [0, 1, 0, 1]]
>>> ctx = Context.from_tab(table)
>>> clo = bitarray('1000')
>>> ctx.complete_closure(0, bitarray('0110'), clo)
2
>>> clo
bitarray('1010')
>>> clo = bitarray('1110')
>>> ctx.complete_closure(1, bitarray('0100'), clo)
4
>>> clo
bitarray('1110')

extension(intent)¶

Parameters: intent – attributes describing a set of objects
Returns: indices of objects that have all attributes in intent in common

find_small_crit_index(gen_idx, bit_extension, part_closure)¶

>>> table = [[0, 1, 0, 1],
...          [1, 1, 1, 0],
...          [1, 0, 1, 0],
...          [0, 1, 0, 1]]
>>> ctx = Context.from_tab(table)
>>> root = Node([], bitarray('0000'), array([0,1,2,3]), bitarray('1111'), -1, -4, 1, inf)
>>> ctx.find_small_crit_index(0, bitarray('0110'), bitarray('1000'))
4
>>> ctx.find_small_crit_index(1, bitarray('1101'), bitarray('0100'))
4
>>> ctx.find_small_crit_index(2, bitarray('0110'), bitarray('0010'))
0
>>> ctx.find_small_crit_index(3, bitarray('1001'), bitarray('0001'))
1
>>> ctx.find_small_crit_index(3, bitarray('1001'), bitarray('0101'))
4

static from_df(df, without=None, max_col_attr=10, sort_attributes=True, discretization=<function qcut>, **kwargs)¶

Generates formal context from pandas dataframe by applying inter-ordinal scaling to numerical data columns and for object columns creating one attribute per value.

For inter-ordinal scaling a maximum number of attributes per column can be specified. If required, threshold values are then selected by the provided discretization function (per default quantile-based).

The restriction should also be implemented for object columns in the future (by merging small categories into disjunctive propositions).

The generated attributes correspond to pandas-compatible query strings. For example:

>>> titanic_df = pd.read_csv("../datasets/titanic/train.csv")
>>> titanic_df.drop(columns=['PassengerId', 'Name', 'Ticket', 'Cabin'], inplace=True)
>>> titanic_ctx = Context.from_df(titanic_df, max_col_attr=6, sort_attributes=False)
>>> titanic_ctx.m
891
>>> titanic_ctx.attributes 
[Survived<=0, Survived>=1, Pclass<=1, Pclass<=2, Pclass>=2, Pclass>=3, Sex==male, Sex==female, Age<=23.0,
Age>=23.0, Age<=34.0, Age>=34.0, Age<=80.0, Age>=80.0, SibSp<=8.0, SibSp>=8.0, Parch<=6.0, Parch>=6.0,
Fare<=8.6625, Fare>=8.6625, Fare<=26.0, Fare>=26.0, Fare<=512.3292, Fare>=512.3292, Embarked==S, Embarked==C,
Embarked==Q, Embarked==nan]
>>> titanic_ctx.n
28
>>> titanic_df.query('Survived>=1 & Pclass>=3 & Sex=="male" & Age>=34')
     Survived  Pclass   Sex   Age  SibSp  Parch   Fare Embarked
338         1       3  male  45.0      0      0  8.050        S
400         1       3  male  39.0      0      0  7.925        S
414         1       3  male  44.0      0      0  7.925        S
>>> titanic_ctx.extension([1, 5, 6, 11])
array([338, 400, 414])

>>> titanic_ctx = Context.from_df(titanic_df, max_col_attr=defaultdict(lambda: None, Age=6, Fare=6),
...                               sort_attributes=False)
>>> titanic_ctx.attributes 
[Survived<=0, Survived>=1, Pclass<=1, Pclass<=2, Pclass>=2, Pclass>=3, Sex==male, Sex==female, Age<=23.0,
Age>=23.0, Age<=34.0, Age>=34.0, Age<=80.0, Age>=80.0, SibSp<=0, SibSp<=1, SibSp>=1, SibSp<=2, SibSp>=2,
SibSp<=3, SibSp>=3, SibSp<=4, SibSp>=4, SibSp<=5, SibSp>=5, SibSp>=8, Parch<=0, Parch<=1, Parch>=1, Parch<=2,
Parch>=2, Parch<=3, Parch>=3, Parch<=4, Parch>=4, Parch<=5, Parch>=5, Parch>=6, Fare<=8.6625, Fare>=8.6625,
Fare<=26.0, Fare>=26.0, Fare<=512.3292, Fare>=512.3292, Embarked==S, Embarked==C, Embarked==Q, Embarked==nan]

Parameters

df (DataFrame) – pandas dataframe to be converted to formal context
max_col_attr (int) – maximum number of attributes generated per column; or None if an arbitrary number of attributes is permitted; or dict (usually defaultdict) with keys being columns ids of df and values being the maximum number of attributes for the corresponding column (again using None if no bound for a specific column); Note: use defaultdict(lambda: None) instead of defaultdict(None) to specify no maximum per default
discretization (callable) – the discretization function to be used when number of thresholds has to be reduced to a specificed maximum (function has to have identical signature to pandas.qcut, which is the default)
without (Iterable[str]) – columns to ommit

Returns

Context representing dataframe

static from_tab(table, sort_attributes=False)¶

Converts an input table where each row represents an object into a formal context (which uses column-based representation).

Uses Boolean interpretation of table values to determine attribute presence for an object.

For instance:

>>> table = [[0, 1, 0, 1],
...          [1, 1, 1, 0],
...          [1, 0, 1, 0],
...          [0, 1, 0, 1]]
>>> ctx = Context.from_tab(table)
>>> list(ctx.extension([0, 2]))
[1, 2]

Parameters: table – table that explicitly encodes object/attribute relation
Returns: context over table row indices as objects and one tabulated feature per column index

class realkd.search.CoreQueryTreeSearch(ctx, obj, bnd, order='bestboundfirst', apx=1.0, max_depth=10, verbose=False, **kwargs)¶

Searches the prefix-tree of core queries given a certain Context.

The idea of core queries is that there is only one core query per potential query extensions. Thus using them as solution candidates results in a strongly condensed search space compared to naively searching all conjunctions.

Core queries have been originally introduced for closed itemset mining [UAUA04]. The following is a simple recursive definition that does not require the notion of closures:

The trivial query \(q = \top\) is a core query. Moreover, the tail augmentation \(qp_i\) of a core query \(q\) is also a core query if \(q \not\rightarrow p_i\) and

\begin{equation} \text{for all } j < i, \text{ if } q' \rightarrow p_j \text{ then } q \rightarrow p_j \enspace . \end{equation}

Parameters

ctx (Context) – the context defining the search space
obj (callable) – objective function
bnd (callable) – bounding function satisfying that bnd(q) >= max{obj(r) for r in successors(q)}
order (str) – traversal order ('breadthfirst', 'depthfirst', 'bestboundfirst', or 'bestvaluefirst'; see traversal_orders)
apx (float) – approximation factor that determines guarantee of what fraction of search space will be traversed, i.e., all nodes q are visited with obj(q) >= apx * opt (default 1.0, i.e., optimum will be visited; smaller values means less traversal elements
max_depth (int) – maximum depth of explored search nodes
verbose (int) – level of verbosity

run()¶

Runs the configured search.

Returns: Conjunction that (approximately) maximizes objective

traversal()¶

A first example with trivial objective and bounding function is as follows. In this example the optimal extension is the empty extension, which is generated via the the non-lexicographically smallest generator [0, 3].

>>> table = [[0, 1, 0, 1],
...          [1, 1, 1, 0],
...          [1, 0, 1, 0],
...          [0, 1, 0, 1]]
>>> ctx = Context.from_tab(table)
>>> search = CoreQueryTreeSearch(ctx, lambda e: -len(e), lambda e: 0, order='breadthfirst')
>>> for n in search.traversal():
...     print(n)
N([], [], -4, inf, [0 1 2 3])
N([0], [0 2], -2, 0, [1 2])
N([1], [1], -3, 0, [0 1 3])
N([2], [0 2], -2, 0, [1 2])
N([3], [1 3], -2, 0, [0 3])
N([0, 1], [0 1 2], -1, 0, [1])
N([0, 3], [0 1 2 3], 0, 0, [])

>>> search.verbose = True
>>> search.run()

Found optimum after inspecting 7 nodes: [0, 3]
Completing closure
Greedy simplification: [0, 3]
c0 & c3

>>> CoreQueryTreeSearch(ctx, lambda e: 5-len(e), lambda e: 4+(len(e)>=2), apx=0.7).run()
c0 & c1

Let’s use more realistic objective and bounding functions based on values associated with each object (row in the table).

>>> values = [-1, 1, 1, -1]
>>> f = lambda e: sum((values[i] for i in e))/4
>>> g = lambda e: sum((values[i] for i in e if values[i]==1))/4
>>> search = CoreQueryTreeSearch(ctx, f, g)
>>> for n in search.traversal():
...     print(n)
N([], [], 0, inf, [0 1 2 3])
N([0], [0 2], 0.5, 0.5, [1 2])
N([2], [0 2], 0.5, 0.5, [1 2])

Finally, here is a complex example taken from the UdS seminar on subgroup discovery.

>>> table = [[1, 1, 1, 1, 0],
...          [1, 1, 0, 0, 0],
...          [1, 0, 1, 0, 0],
...          [0, 1, 1, 1, 1],
...          [0, 0, 1, 1, 1],
...          [1, 1, 0, 0, 1]]
>>> ctx = Context.from_tab(table)
>>> labels = [1, 0, 1, 0, 0, 0]
>>> from realkd.legacy import impact, cov_incr_mean_bound, impact_count_mean
>>> f = impact(labels)
>>> g = cov_incr_mean_bound(labels, impact_count_mean(labels))
>>> search = CoreQueryTreeSearch(ctx, f, g)
>>> for n in search.traversal():
...     print(n)
N([], [], 0, inf, [0 1 2 3 4 5])
N([0], [0], 0.11111, 0.22222, [0 1 2 5])
N([1], [1], -0.055556, 0.11111, [0 1 3 5])
N([2], [2], 0.11111, 0.22222, [0 2 3 4])
N([3], [2 3], 0, 0.11111, [0 3 4])
N([0, 2], [0 2], 0.22222, 0.22222, [0 2])
>>> search.run()
c0 & c2

>>> labels = [1, 0, 0, 1, 1, 0]
>>> f = impact(labels)
>>> g = cov_incr_mean_bound(labels, impact_count_mean(labels))
>>> search = CoreQueryTreeSearch(ctx, f, g)
>>> for n in search.traversal():
...     print(n)
N([], [], 0, inf, [0 1 2 3 4 5])
N([0], [0], -0.16667, 0.083333, [0 1 2 5])
N([1], [1], 0, 0.16667, [0 1 3 5])
N([2], [2], 0.16667, 0.25, [0 2 3 4])
N([3], [2 3], 0.25, 0.25, [0 3 4])

traversal_orders = {'bestboundfirst': <class 'realkd.search.BestBoundFirstBoundary'>, 'bestvaluefirst': <class 'realkd.search.BestValueFirstBoundary'>, 'breadthfirst': <class 'realkd.search.BreadthFirstBoundary'>, 'depthfirst': <class 'realkd.search.DepthFirstBoundary'>}¶: dictionary of available traversal orders for core query search

class realkd.search.GreedySearch(ctx, obj, bdn=None, verbose=False, **kwargs)¶

Simple best-in greedy search for conjunctive query.

Search starts from the trivial (empty) query and adds an objective-maximizing conditions until no further improvement with a single augmentation is possible.

>>> table = [[1, 1, 1, 1, 0],
...          [1, 1, 0, 0, 0],
...          [1, 0, 1, 0, 0],
...          [0, 1, 1, 1, 1],
...          [0, 0, 1, 1, 1],
...          [1, 1, 0, 0, 1]]
>>> ctx = Context.from_tab(table)
>>> labels = [1, 0, 1, 0, 0, 0]
>>> from realkd.legacy import impact
>>> f = impact(labels)
>>> GreedySearch(ctx, f).run()
c0 & c2

Parameters

ctx (Context) – the context defining the search space
obj (callable) – objective function
bnd (callable) – bounding function satisfying that bnd(q) >= max{obj(r) for r in successors(q)} (for signature compatibility only, not currently used)
verbose (int) – level of verbosity

run()¶

Runs the configured search.

Returns: Conjunction that (approximately) maximizes objective

realkd.search.search_methods = {'exhaustive': <class 'realkd.search.CoreQueryTreeSearch'>, 'greedy': <class 'realkd.search.GreedySearch'>}¶: Dictionary of available search methods.

References¶

UAUA04: Takeaki Uno, Tatsuya Asai, Yuzo Uchida, and Hiroki Arimura. An efficient algorithm for enumerating closed patterns in transaction databases. In Disc Science, 16–31. Springer, 2004.