Skip to content

[WIP] Experiments/better fix topics and other mods #104

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 50 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
50 commits
Select commit Hold shift + click to select a range
7c94ebe
fix fix topics
Alvant Mar 23, 2024
5733971
fix diversity, debug fix in topic bank
Alvant Jul 20, 2024
37fa99e
add semantic var and focon intratext
Alvant Jul 20, 2024
d058374
lick code
Alvant Jul 20, 2024
baa026b
add tests for new old coherences
Alvant Jul 20, 2024
1df50f7
fix tests
Alvant Jul 20, 2024
d2dd283
return cautious get relatedness (allow unknown words)
Alvant Jul 20, 2024
eebb429
tributize newly added coherences
Alvant Jul 20, 2024
831a20a
xfail semantic var and focon in tests
Alvant Jul 20, 2024
6796a0b
fix scores tests
Alvant Jul 20, 2024
1db643a
fix main modality usage in topic bank train and init funcs
Alvant Jul 20, 2024
f8e90fa
fix topic bank modality in tests
Alvant Jul 20, 2024
57c775f
fix topic bank modality in tests try 2
Alvant Jul 20, 2024
462d803
fix arora, fix copy phi in init func, enhance topic bank tests
Alvant Jul 20, 2024
18ab9c5
Merge branch 'master' into experiments/better-fix-topics-and-other-mods
Alvant Jul 20, 2024
5db31b2
fix topic bank tests with regularization func
Alvant Jul 20, 2024
e0b48e7
update reqs as in tested code, add setup file
Alvant Jul 20, 2024
793d788
allow bigartm 10 in reqs
Alvant Jul 20, 2024
e6c7568
remove protobuf from reqs (it will go with topicnet)
Alvant Jul 21, 2024
0418fc8
move regularizers from notebooks to files
Alvant Jul 21, 2024
ac348c8
refine regularizers usage in topic bank
Alvant Jul 21, 2024
26c506e
fix topic bank (experiment vs code conflict)
Alvant Jul 21, 2024
9572fa2
fix bank phi equality assert (atol)
Alvant Jul 21, 2024
fb93ab6
accelerate intratext
Alvant Jul 21, 2024
caa4056
turn off should compute for intratext (compute only on last iter)
Alvant Jul 21, 2024
b47f17c
return should compute for intratext to sane default (should)
Alvant Jul 21, 2024
f8a316b
make equal semi windows for sum over window coherence
Alvant Jul 21, 2024
31ef6ec
soften assert equal check in topic bank (increase stability)
Alvant Jul 21, 2024
88322ad
add debug message for tb equality assert
Alvant Jul 21, 2024
7a597c8
soften atol in tb check as low as possible to remain decent
Alvant Jul 21, 2024
20e0cb8
add test for sum over different windows
Alvant Jul 21, 2024
2af66ea
trying to speed up topden (try instead if)
Alvant Jul 21, 2024
3a9f102
trying to speed up topden try 2: remove np floor from window
Alvant Jul 21, 2024
b2567c6
speeding up topdep try 3: remove density intersections
Alvant Jul 21, 2024
97579e5
speeding up topdep try 3: remove density intersections (fix)
Alvant Jul 21, 2024
2dbfcab
speeding up topdep: np.sum -> sum
Alvant Jul 22, 2024
255c72d
speeding up topdep: sum(list) -> v += dv
Alvant Jul 22, 2024
b00a4f8
fix right border in +dv
Alvant Jul 22, 2024
d605f50
use lru cache (unlimited) for get_relatedness
Alvant Jul 22, 2024
a05a8bc
use lru cache for get word topic index
Alvant Jul 22, 2024
7a9b928
remove lru cache for get topic index (no speed up)
Alvant Jul 22, 2024
dda0e5b
remove pre-save in topicbank (may lead to inconsistent results)
Alvant Aug 7, 2024
98437d3
comment something in topic bank for somebody
Alvant Mar 19, 2025
7050805
add tests for regs
Alvant Mar 19, 2025
87732f3
refactor regs tests
Alvant Mar 19, 2025
2afc4ae
refine has_bcg usage (as much as possible)
Alvant Mar 19, 2025
fb55653
fix topic bank
Alvant Mar 19, 2025
1bc1813
add some comments for older comments in tb
Alvant Mar 19, 2025
cab4451
return input mode for tb
Alvant Mar 19, 2025
d27e44b
refine code, add pytest rerun (for a couple of intratext coherence te…
Alvant Mar 19, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 12 additions & 11 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -1,12 +1,13 @@
anchor-topic==0.1.2
bigartm==0.9.2
dill==0.3.1.1
lapsolver==1.0.2
matplotlib
numpy==1.22.0
pandas==1.0.1
pytest==5.3.5
scikit-learn==1.5.0
scipy==1.10.0
topicnet>=0.8.0
tqdm==4.66.3
bigartm>=0.9.2
dill==0.3.8
lapsolver==1.1.0
matplotlib==3.7.5
numpy==1.24.4
pandas==2.0.3
pytest==8.1.1
pytest-rerunfailures==14.0
scikit-learn==1.3.2
scipy==1.10.1
topicnet>=0.9.0
tqdm==4.66.2
2 changes: 2 additions & 0 deletions setup.cfg
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
[metadata]
description-file = README.md
45 changes: 45 additions & 0 deletions setup.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
from distutils.core import setup


setup(
name='topnum',
packages=[
'topnum',
'topnum.data',
'topnum.scores',
'topnum.search_methods',
'topnum.search_methods.topic_bank',
'topnum.search_methods.topic_bank.phi_initialization',
'topnum.tests'
],
version='0.3.0',
license='MIT',
description='A set of methods for finding an appropriate number of topics in a text collection',
author='Machine Intelligence Laboratory',
author_email='vasiliy.alekseyev@phystech.edu',
url='https://github.com/machine-intelligence-laboratory/OptimalNumberOfTopics',
keywords=[
'topic modeling',
'document clustering',
'number of clusters',
'ARTM',
'regularization',
],
install_requires=[
'anchor-topic==0.1.2',
'bigartm>=0.9.2',
'dill==0.3.8',
'lapsolver==1.1.0',
'matplotlib==3.7.5',
'numpy==1.24.4',
'pandas==2.0.3',
'pytest==8.1.1',
'scikit-learn==1.3.2',
'scipy==1.10.1',
'topicnet>=0.9.0',
'tqdm==4.66.2',
],
classifiers=[
'Programming Language :: Python :: 3.8',
],
)
79 changes: 79 additions & 0 deletions topnum/model_constructor.py
Original file line number Diff line number Diff line change
Expand Up @@ -102,6 +102,9 @@ def init_model_from_family(
model = init_decorrelated_plsa(
dataset, modalities_to_use, main_modality, num_topics, model_params
)
# model = init_decorrelated_artm(
# dataset, modalities_to_use, main_modality, num_topics, 1, model_params
# )
elif family == "ARTM":
model = init_baseline_artm(
dataset, modalities_to_use, main_modality, num_topics, 1, model_params
Expand Down Expand Up @@ -213,6 +216,82 @@ def init_decorrelated_plsa(
return model


# TODO: is it the same as init_baseline_artm?
def init_decorrelated_artm(
dataset,
modalities_to_use,
main_modality,
num_topics,
bcg_topics,
model_params: dict = None
):
"""
Creates simple artm model with standard scores.

Parameters
----------
dataset : Dataset
modalities_to_use : list of str
main_modality : str
num_topics : int
model_params : dict

Returns
-------
model: artm.ARTM() instance
"""
if model_params is None:
model_params = dict()

model = init_plsa(
dataset, modalities_to_use, main_modality, num_topics
)
tau = model_params.get('decorrelation_tau', 0.01)

specific_topic_names = model.topic_names # let's decorrelate everything
model.regularizers.add(
artm.DecorrelatorPhiRegularizer(
gamma=0,
tau=tau,
name='decorrelation',
topic_names=specific_topic_names,
class_ids=modalities_to_use,
)
)

dictionary = dataset.get_dictionary()
baseline_class_ids = {class_id: 1 for class_id in modalities_to_use}
data_stats = count_vocab_size(dictionary, baseline_class_ids)

background_topic_names = model.topic_names[-bcg_topics:]
specific_topic_names = model.topic_names[:-bcg_topics]

# all coefficients are relative
regularizers = [
artm.SmoothSparsePhiRegularizer(
name='smooth_phi_bcg',
topic_names=background_topic_names,
tau=model_params.get("smooth_bcg_tau", 0.1),
class_ids=[main_modality],
),
artm.SmoothSparseThetaRegularizer(
name='smooth_theta_bcg',
topic_names=background_topic_names,
tau=model_params.get("smooth_bcg_tau", 0.1),
),
]

for reg in regularizers:
model.regularizers.add(transform_regularizer(
data_stats,
reg,
model.class_ids,
n_topics=len(reg.topic_names)
))

return model


def _init_dirichlet_prior(name, num_topics, num_terms):
"""
Adapted from github.com/RaRe-Technologies/gensim/blob/master/gensim/models/ldamodel.py#L521
Expand Down
5 changes: 5 additions & 0 deletions topnum/regularizers/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
from .fix_phi import FastFixPhiRegularizer
from .decorrelate_with_other_phi import (
DecorrelateWithOtherPhiRegularizer,
DecorrelateWithOtherPhiRegularizer2,
)
122 changes: 122 additions & 0 deletions topnum/regularizers/decorrelate_with_other_phi.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
from typing import List, Optional

import numpy as np
from numpy import ndarray
from pandas import DataFrame
from scipy.spatial.distance import cdist

from artm import ARTM
from topicnet.cooking_machine.models.base_regularizer import BaseRegularizer


# TODO: find (and make possible to use) relative taus for these regularizers

class DecorrelateWithOtherPhiRegularizer(BaseRegularizer):
def __init__(
self,
name: str,
tau: float,
topic_names: List[str],
other_phi: DataFrame,
):
"""

Parameters
----------
name
tau
To select a value, try a few test runs to find the tau
that affects the perplexity (worsens, but not very much).
Recommendation based on experimentation: try 1e5 or 1e6.
topic_names
other_phi

"""
super().__init__(name, tau=tau)

self._topic_names = topic_names
self._other_phi = other_phi
self._other_topic_sum = self._other_phi.values.sum(
axis=1, keepdims=True
)

self._topic_indices = None

def grad(self, pwt: DataFrame, nwt: DataFrame) -> ndarray:
rwt = np.zeros_like(pwt)
rwt[:, self._topic_indices] += (
pwt.values[:, self._topic_indices] * self._other_topic_sum
)

return -1 * self.tau * rwt

def attach(self, model: ARTM) -> None:
super().attach(model)

phi = model.get_phi()
self._topic_indices = [
phi.columns.get_loc(topic_name)
for topic_name in self._topic_names
]


class DecorrelateWithOtherPhiRegularizer2(BaseRegularizer):
def __init__(
self,
name: str,
tau: float,
topic_names: List[str],
other_phi: DataFrame,
num_iters: Optional[int] = None,
):
"""

Parameters
----------
name
tau
To select a value, try a few test runs to find the tau
that affects the perplexity (worsens, but not very much).
Recommendation based on experimentation: try 1e8, 1e9, or 1e10.
topic_names
other_phi
num_iters

"""
super().__init__(name, tau=tau)

self._topic_names = topic_names
self._other_phi = other_phi
self._num_iters = num_iters
self._cur_iter = 0

self._topic_indices = None

def grad(self, pwt: DataFrame, nwt: DataFrame) -> ndarray:
rwt = np.zeros_like(pwt)

if self._num_iters is not None and self._cur_iter >= self._num_iters:
return rwt

correlations = cdist(
self._other_phi.values.T,
pwt.values[:, self._topic_indices].T,
lambda u, v: (u * v).sum()
)
weighted_other_topics = self._other_phi.values.dot(correlations)

rwt[:, self._topic_indices] += (
pwt.values[:, self._topic_indices] * weighted_other_topics
)
self._cur_iter += 1

return -1 * self.tau * rwt

def attach(self, model: ARTM) -> None:
super().attach(model)

phi = model.get_phi()
self._topic_indices = [
phi.columns.get_loc(topic_name)
for topic_name in self._topic_names
]
57 changes: 57 additions & 0 deletions topnum/regularizers/fix_phi.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
from typing import List, Optional

import numpy as np
from numpy import ndarray
from pandas import DataFrame

from artm import ARTM
from topicnet.cooking_machine.models.topic_model import TopicModel
from topicnet.cooking_machine.models.base_regularizer import BaseRegularizer


class FastFixPhiRegularizer(BaseRegularizer):
_VERY_BIG_TAU = 10 ** 9

def __init__(
self,
name: str,
topic_names: List[str],
parent_model: Optional[TopicModel] = None, # TODO: TopicModel or ARTM?
parent_phi: DataFrame = None,
tau: float = _VERY_BIG_TAU,
):
super().__init__(name, tau=tau)

if parent_phi is None and parent_model is None:
raise ValueError('Both parent Phi and parent model not specified.')

self._topic_names = topic_names
self._topic_indices = None
self._parent_model = parent_model
self._parent_phi = parent_phi

def grad(self, pwt: DataFrame, nwt: DataFrame) -> ndarray:
rwt = np.zeros_like(pwt)

if self._parent_phi is not None:
parent_phi = self._parent_phi
vals = parent_phi.values
else:
parent_phi = self._parent_model.get_phi()
vals = parent_phi.values[:, self._topic_indices]

assert vals.shape[0] == rwt.shape[0]
assert vals.shape[1] == len(self._topic_indices), (vals.shape[1], len(self._topic_indices))

rwt[:, self._topic_indices] += vals

return self.tau * rwt

def attach(self, model: ARTM) -> None:
super().attach(model)

phi = self._model.get_phi()
self._topic_indices = [
phi.columns.get_loc(topic_name)
for topic_name in self._topic_names
]
Loading