Skip to content

Latest commit

 

History

History
938 lines (650 loc) · 77.8 KB

File metadata and controls

938 lines (650 loc) · 77.8 KB

Rolling Windows API Documentation

The Lexos Rolling Windows module is used to analyse the frequency of patterns across rolling windows (also known as sliding windows) of units in a text. The Rolling Windows module under development for the next release of the Lexos API. It implements a new programming interface for the Rolling Windows tool currently available in the Lexos web app, but with added functionality. For further information on the use if Rolling Windows, users are encouraged to lick the "Help" button at the top right of the Lexos user interface and try out the web app.

Architecture of the Module

The __init__.py file contains the main logic for the file, including the main RollingWindows class used to manage the workflow. There are three main submodules, the functions of which are listed below:

  • filters: Classes to manage the filtering of documents prior to analysis.
  • calculators: Classes to manage statistical calculations.
  • plotters: Classes to manage the plotting of the results of Rolling Windows analyses.

Each submodule contains at least one class, which is registered in registry.py, allowing it to be treated as "built-in". Other built-in classes can be added in future releases, but users can also integrate their own custom classes to manage project-specific tasks not handled by built-in classes.

A fourth submodule, milestones, manages the labelling of structural divisions within documents. Since its use is not limited to Rolling Windows, it will become a component of the main Lexos library in the next release.

The file helpers.py contains functions used by more than one file in the module.

Note

Development of the rollingwindows module suffered from a (still unexplained) malfunction of the development environment, which caused a catastrophic loss of much of the code before it could be pushed to GitHub. The version here is a reconstruction which works but may not be as elegant or efficient as the original code. There may also be legacy code blocks with no function in the current code that have not yet been identified.

Each component of the module is documented separately below.

rollingwindows.__init__

This is the main component of the rollingwindows module, containing the RollingWindows class and associated functions.

rollingwindows.__init__.get_rw_component

Gets a component from the registry using a string id. Note that this is a near duplicate of rollingwindows.scrubber.registry.load_component.

def get_rw_component(id: str)
Parameter Description Required
id: str The string id of the component. Yes

rollingwindows.RollingWindows

The main class for managing the workflow and state of a Rolling Windows analysis.

class RollingWindows(doc: spacy.tokens.doc.Doc, model: str, *, patterns: Union[list, str] = None)

Attributes

Attribute Description Required
doc: spacy.tokens.doc.Doc A spaCy doc. Yes
model: str The name of the language model. Default: xx_sent_ud_sm. Yes
patterns: Union[list, str] A pattern or list of patterns to search in each window. Patterns can be strings, regex patterns, or spaCy Matcher rules. Default is None. No

The RollingWindows.metadata property returns dictionary recording the current configuration state.

Private Methods

rollingwindows.RollingWindows.\_get_search_method

Gets a preliminary search method based on the type of window unit.

def _get_search_method(self, window_units: str = None) -> str
Parameter Description Required
window_units: str The units counted to construct windows: characters, lines, sentences, tokens. Default is None. Yes
rollingwindows.RollingWindows._get_units

Gets a list of characters, sentences, lines, or tokens from the doc.

def _get_units(self, doc: spacy.tokens.doc.Doc, window_units: str = "characters") -> Union[List[spacy.tokens.span.Span], spacy.tokens.doc.Doc]
Parameter Description Required
doc: spacy.tokens.doc.Doc A spaCy Doc object. Yes
window_units: str The units counted to construct windows: characters, lines, sentences, tokens. Default is characters. Yes

Public Methods

rollingwindows.RollingWindows.calculate

Uses a calculator to generates a rolling windows analysis and assigns the result to RollingWindows.result.

RollingWindows.calculate(patterns: Union[list, str] = None, calculator: Union[Callable, str] = "rw_calculator", query: str = "counts", show_spacy_rules: bool = False) -> None
Parameter Description Required
patterns: Union[list, str] The patterns to search for. Default is None. Yes
calculator: Union[Callable, str] The calculator to use. Default is the built-in "rw_calculator" calculator. Yes
query: str The type of data to return ("averages", "counts", or "ratios"). Default is the built-in "ratios". Yes
show_spacy_rules: bool If the calculator uses a spaCy Matcher rule, tell the calculator's to_df method to display the rule as a column header; otherwise, only the value matched by the calculator will be displayed. Default is False. Yes

Note

For development purposes, RollingWindows.calculate() has a timer decorator, which will display the time elapsed when the windows are generated.

rollingwindows.RollingWindows.plot

Uses a plotter to generates a plot of rolling windows analysis and assigns the result to RollingWindows.fig.

RollingWindows.plot(calculator: Union[Callable, str] = "rw_simple_plotter", file: str = None) -> None
Parameter Description Required
plotter: Union[Callable, str] The plotter to use. Default is the built-in "rw_simple_plotter". Yes
file: str The path to a file to save the plot. Default is None. No
rollingwindows.RollingWindows.set_windows

Generates rolling windows, creates a rollingwindows.Windows object, and assigns it to RollingWindows.windows.

RollingWindows.set_windows(n: int = 100, window_units: str = "characters", alignment_mode: str = "strict", filter:  Union[Callable, str] = None)
Parameter Description Required
n: int The number of windows to generate. Default is 1000. Yes
window_units: str The units counted to construct windows: characters, lines, sentences, tokens. Default is characters. Yes
alignment_mode: str How character-based windows snap to token boundaries.

- strict: No snapping
- contract: Window contains all tokens completely within the window's assigned start and end indices.
- expand: Window contains all tokens partially within the window's assigned start and end indices.

Default is strict.
No
filter: str The name of a filter or an instance of a filter to be applied to the doc before windows are generated. Default is None. No

!!! Note For development purposes, RollingWindows.set_windows() has a timer decorator, which will display the time elapsed when the windows are generated.

rollingwindows.sliding_windows

Function to create a generator of sliding windows.

def sliding_windows(input: Union[List[spacy.tokens.span.Span], spacy.tokens.doc.Doc], n: int = 1000, window_units: str = "characters", alignment_mode: str = "strict") -> Iterator:
Parameter Description Required
input: Union[List[spacy.tokens.span.Span], spacy.tokens.doc.Doc] Either a list of spaCy Span objects or a spaCy Doc object.. Yes
n: int The number of units per window. Yes
window_units: str The units counted to construct windows: characters, lines, sentences, tokens. Default is characters. Yes
alignment_mode: str How character-based windows snap to token boundaries.

- strict: No snapping
- contract: Window contains all tokens completely within the window's assigned start and end indices.
- expand: Window contains all tokens partially within the window's assigned start and end indices.

Default is strict.
Yes

rollingwindows.RollingWindows.Windows

A dataclass for storing a generator of rolling windows and associated metadata.

class Windows(windows: Iterable, window_units: str, n: int, alignment_mode: str = "strict")



    def __iter__(self):

        return self.windows
Parameter Description Required
windows: Iterable The number of windows to generate. Default is 1000. Yes
window_units: str The units counted to construct windows: characters, lines, sentences, tokens. Default is characters. Yes
n: int The number of units per window. Yes
alignment_mode: str How character-based windows snap to token boundaries.

- strict: No snapping
- contract: Window contains all tokens completely within the window's assigned start and end indices.
- expand: Window contains all tokens partially within the window's assigned start and end indices.

Default is strict.
Yes

Private Methods

rollingwindows.Windows.__iter__

Returns the value of self.windows.

Note

This value is a generator, so iterating through the Windows object will empty the values

rollingwindows.calculators

Contains registered calculators. There is currently one registered calculators: RWCalculator. Each calculators is a subclass of BaseCalculator, which has a metadata property.

rollingwindows.calculators.RWCalculator

A calculator class to calculate rolling averages, counts, and ratios of a matched pattern or patterns in a document. The property RWCalculator.regex_flags returns the flags used with methods that call the Python re module.

rollingwindows.calculators.RWCalculator has a class attribute id, the value of which is "rw_calculator". This the id registered in the registry.

class RWCalculator(*, patterns: Union[List, str] = None, windows: Windows = None, mode: bool = "exact", case_sensitive: bool = False, alignment_mode: str = "strict", model: str = "xx_sent_ud_sm", original_doc: spacy.tokens.doc.Doc = None, query: str = "counts")
Parameter Description Required
patterns: Union[list, str] A pattern or list of patterns to search in windows. Default is None No
windows: Windows A rollingwindows.Windows object containing the windows to search. No
mode: str The type of search method to use:

- exact: Search for exact matches to a string pattern.
- regex: Search for matches to a regex expression.
- spacy_rule: Search for matches using a spaCy Matcher class rule.
- multi_token: Search for matches to a regex expression across multiple tokens.
- multi_token_exact: Search for matches to string pattern across multiple tokens.

Default is exact. See the explanation below.
No
case_sensitive: bool Whether to make searches case-sensitive. Default is False. No
alignment_mode: str Whether to snap searches to token boundaries:

- strict: No snapping.
- contract: Count all matches that fall completely within token boundaries.
- expand: Count all matches that fall partially within token boundaries.

Default is strict.
No
model: bool The name of language model to be used with spaCy's Matcher class. Default is xx_sent_ud_sm. No
original_doc: spacy.tokens.doc.Doc A copy of the original doc. Default is None. No, except if mode is set to multi_token or multi_token_exact
query: str The type of data to return: "averages", "counts", or "ratios". Default is counts. No

If the window is a string, exact and regex modes will match patterns irrespective of token boundaries. If the window is a list of strings, or a spaCy Span object, patterns will be matched within the boundaries of each token. The spacy_rule option allows for complex searches using spaCy's Matcher class; however, it searches for tokens or sequence of tokens that match the patterns. The alternative is the multi_item mode, which will match a regular expression both within and across token boundaries. The multi_item_exact version escapes regex special characters so that raw strings containing characters like "+\*?^$ ()\[\]{}|\\ can be matched. Both multi_item options can be configured with alignment_mode to determine how the matcher responds to to token boundaries. Note that because character indexes from the original document are required to locate the character range in the original document, multi_item modes require that you also pass the source document to the original_doc attribute.

Private Methods

rollingwindows.calculators.RWCalculator._assign_variable

Try to use configured values if not passed by public functions.

def _assign_variable(self, var_name: str, var: Any) -> Any
Parameter Description Required
var_name: str The name of the variable. Yes
var: Any The variable to be evaluated. Yes
rollingwindows.calculators.RWCalculator._count_pattern_matches

Uses Python count() to count exact character matches in a character window.

def _count_character_patterns_in_character_windows(self, window: str, pattern: str) -> int

| Parameter        | Description           | Required |
|------------------|-----------------------|----------|
| `pattern`: _str_ | The pattern to match. | Yes      |
| `window`: _str_  | The window to search. | Yes      |

##### `rollingwindows.calculators.RWCalculator._count_in_character_window`

_count_in_character_window(self, window: str, pattern: str) -> int:
Chooses the function for counting matches in character windows.

```python
def _count_pattern_matches(self, pattern: Union[dict, list, str]) -> str
Parameter Description Required
pattern: str The pattern to match. Yes
window: str The window to search. Yes
rollingwindows.calculators.RWCalculator._count_token_patterns_in_token_lists

Counts patterns in lists of token strings.

def _count_token_patterns_in_token_lists(self, window: List[str], pattern: str) -> int
Parameter Description Required
pattern: str A string pattern to search for. Yes
window: List[str] A window consisting of a list of strings. Yes
rollingwindows.calculators.RWCalculator._count_token_patterns_in_span

Counts patterns in a spaCy Span object.

_count_token_patterns_in_span(self, window: spacy.tokens.span.Span, pattern: Union[list, str]) -> int
Parameter Description Required
pattern: Union[list, str] A string pattern or spaCy rule to search for. Yes
window: spacy.tokens.span.Span A window consisting of a spaCy Span object. Yes
rollingwindows.calculators.RWCalculator._count_token_patterns_in_span_text

Counts patterns in span otext with token alignment.

_count_token_patterns_in_span_text(self, window: str, pattern: str) -> int
Parameter Description Required
pattern: str A string pattern or spaCy rule to search for. Yes
window: str A string window. Yes
rollingwindows.calculators.RWCalculator._count_in_token_window

Chooses the function for counting matches in token windows.

_count_token_patterns_in_span_text(self, window: str, pattern: str) -> int
Parameter Description Required
pattern: Union[list, str] A string pattern or spaCy rule to search for. Yes
window: Union[List[str], spacy.tokens.span.Span] A window consisting of a list of token strings or a spaCy Span object. Yes
rollingwindows.calculators.RWCalculator._extract_string_pattern

Extracts a string pattern from a spaCy rule.

_extract_string_pattern(self, pattern: Union[dict, list, str]) -> str
Parameter Description Required
pattern: Union[dict, list, str] A pattern to search. Yes
rollingwindows.calculators.RWCalculator._get_ratio

Extracts a string pattern from a spaCy rule.

_get_ratio(self, counts: List[int]) -> float
Parameter Description Required
counts: List[int] A list of two counts. Yes
rollingwindows.calculators.RWCalculator._get_window_count

Calls character or token window methods, as appropriate.

_get_window_count(self, window: Union[List[str], spacy.tokens.span.Span, str], pattern: Union[list, str]) -> int
Parameter Description Required
pattern: Union[list, str] A string pattern or spaCy rule to search for. Yes
window: Union[List[str], spacy.tokens.span.Span, str] A window consisting of a list of token strings, a spaCy Span object, or a string. Yes

Public Methods

rollingwindows.calculators.RWCalculator.get_averages

Calls rollingwindows.calculators.RWCalculator.run with the query set to "averages".

def get_averages(self, windows: Iterable = None, patterns: Union[List, str] = None) -> None
Parameter Description Required
pattern: Union[List, str] string pattern or spaCy rule, or a list of either. No
window: Iterable A RollingWindows.Windows object. No
rollingwindows.calculators.RWCalculator.get_counts

Calls rollingwindows.calculators.RWCalculator.run with the query set to "counts".

def get_averages(self, windows: Iterable = None, patterns: Union[List, str] = None) -> None
Parameter Description Required
pattern: Union[List, str] string pattern or spaCy rule, or a list of either. No
window: Iterable A RollingWindows.Windows object. No
rollingwindows.calculators.RWCalculator.get_ratios

Calls rollingwindows.calculators.RWCalculator.run with the query set to "ratios".

def get_averages(self, windows: Iterable = None, patterns: Union[List, str] = None) -> None
Parameter Description Required
pattern: Union[List, str] A string pattern or spaCy rule, or a list of either. No
window: Iterable A RollingWindows.Windows object. No
rollingwindows.calculators.RWCalculator.run

Runs the calculator, which performs calculations and saves the result to RWCalculator.data.

def runs(windows: Iterable = None, patterns: Union[List, str] = None, query: str = "counts") -> None
Parameter Description Required
pattern: Union[List, str] A string pattern or spaCy rule, or a list of either. No
window: Iterable A RollingWindows.Windows object. No
query: Iterable String designating whether to return "counts", "averages", or "ratios". Default is counts. No
rollingwindows.calculators.RWCalculator.to_df

Converts the data in Averages.data to a pandas DataFrame.

def to_df(self, show_spacy_rules: bool = False) -> pd.DataFrame
Parameter Description Required
show_spacy_rules: bool If the calculator uses a spaCy Matcher rule, tell the calculator's to_df method to display the rule as a column header; otherwise, only the value matched by the calculator will be displayed. Default is False. No

rollingwindows.filters

Contains registered filters. There are currently two registered filters: WordFilter and NonStopwordFilter. Each filter is a subclass of the BaseFilter class, which has a metadata property.

rollingwindows.filters.filter_doc

Applies a filter to a document and returns a new document.

def filter_doc(input: Union[List[spacy.tokens.span.Span], spacy.tokens.doc.Doc], n: int = 1000, window_units: str = "characters", alignment_mode: str = "strict") -> Iterator
Parameter Description Required
doc: spacy.tokens.doc.Doc A spaCy Doc object. Yes
keep_ids: int A list of spaCy Token ids to keep in the filtered Doc. Yes
spacy_attrs: List[str] A list of spaCy Token attributes to keep in the filtered Doc. Default is the SPACY_ATTRS list imported with filters.* No
force_ws: bool Force a whitespace at the end of every token except the last. Default is True. No

* The default list of spaCy token attributes can be inspected by calling filters.SPACY_ATTRS.

rollingwindows.filters.get_doc_array

Converts a spaCy Doc object into a numpy array.

def get_doc_array(doc: spacy.tokens.doc.Doc, spacy_attrs: List[str] = SPACY_ATTRS, force_ws: bool = True) -> np.ndarray
Parameter Description Required
doc: spacy.tokens.doc.Doc A spaCy Doc object. Yes
keep_ids: int A list of spaCy Token ids to keep in the filtered Doc. Yes
spacy_attrs: List[str] A list of spaCy Token attributes to keep in the filtered Doc. Default is the SPACY_ATTRS list imported with filters.* No
force_ws: bool Force a whitespace at the end of every token except the last. Default is True. No

* The default list of spaCy token attributes can be inspected by calling filters.SPACY_ATTRS.

The following options are available for handling whitespace:

  1. force_ws=True ensures that token_with_ws and whitespace_ attributes are preserved, but all tokens will be separated by whitespaces in the text of a doc created from the array.
  2. force_ws=False with SPACY in spacy_attrs preserves the token_with_ws and whitespace_ attributes and their original values. This may cause tokens to be merged if subsequent processing operates on the doc.text.
  3. force_ws=False without SPACY in spacy_attrs does not preserve the token_with_ws and whitespace_ attributes or their values. By default, doc.text displays a single space between each token.

rollingwindows.filters.is_not_roman_numeral

Returns True if a token is not a Roman numeral. Works only on upper-case Roman numerals.

def is_not_roman_numeral(s: str) -> bool
Parameter Description Required
s: str A string to match against the Roman numerals pattern. Yes

rollingwindows.filters.NonStopwordFilter

A filter class to remove stop words from a document. This is a minimal function that strips punctuation and returns the ids of words not flagged as stop words by the language model or in a list of additional_stopwords. The property NonStopwordFilter.word_ids returns the token ids for all tokens in the document that are not stop words according to these criteria.

rollingwindows.filters.NonStopwordFilter has a class attribute id, the value of which is "non_stopword_filter". This the id registered in the registry.

class NonStopwordFilter(doc: spacy.tokens.doc.Doc, *, spacy_attrs: List[str]: SPACY_ATTRS, additional_stopwords: List[str] = None, case_sensitive: bool = False)
Parameter Description Required
doc: spacy.tokens.doc.Doc A spaCy Doc object. Yes
spacy_attrs: List[str] A list of spaCy Token attributes to keep in the filtered Doc. Default is the SPACY_ATTRS list imported with filters. No
additional_stopwords: List[str] A list of stop words to add to those labelled as stop words by the model. Default is None. No
case_sensitive: bool Use only lower case forms if False. Default is True. No

Private Methods

rollingwindows.filters.NonStopwordFilter._is_non_stopword

Returns True if a token is not a stop word.

def _is_non_stopword(self, token: spacy.tokens.Token) -> bool
Parameter Description Required
token: spacy.tokens.Token A spaCy Token object. Yes

Public Methods

rollingwindows.filters.NonStopwordFilter.apply

Applies the filter and returns a new, filtered doc.

def apply(self) -> spacy.tokens.doc.Doc

rollingwindows.filters.WordFilter

A filter class to remove non-words from a document. The property WordFilter.word_ids returns the token ids for all tokens in the document that are identified as words according to supplied criteria.

rollingwindows.filters.WordFilter has a class attribute id, the value of which is "word_filter". This the id registered in the registry.

class WordFilter(doc: spacy.tokens.doc.Doc, *, spacy_attrs: List[str]: SPACY_ATTRS, exclude: Union[List[str], str] = [" ", "\n"], exclude_digits: bool = False, exclude_roman_numerals: bool = False, exclude_pattern: Union[List[str], str] = None)
Parameter Description Required
doc: spacy.tokens.doc.Doc A spaCy Doc object. Yes
spacy_attrs: List[str] A list of spaCy Token attributes to keep in the filtered Doc. Default is the SPACY_ATTRS list imported with filters. No
exclude: List[str] A string/regex or list of strings/regex patterns to exclude. Default is [" ", "\n"]. No
exclude_digits: bool If True, digits will not be treated as words. Default is False. No
exclude_roman_numerals: bool If True, Roman numerals will not be treated as words. However, this only works with capitalised Roman numerals. Default is False. No
exclude_pattern: bool Additional regex patterns to add to the default exclude list. Default is None No

Public Methods

rollingwindows.filters.WordFilter.apply

Applies the filter and returns a new, filtered doc.

def apply(self) -> spacy.tokens.doc.Doc

rollingwindows.helpers

Contains helper functions used by multiple files in the module. rollingwindows.helpers.ensure_doc may be legacy code that is not used in the current version.

rollingwindows.helpers.ensure_doc

Converts input into a spaCy Doc object. The returned Doc is unannotated if it is derived from a string or a list of tokens.

def ensure_doc(input: Union[str, List[str], spacy.tokens.doc.Doc], nlp: Union[Language, str], batch_size: int = 1000) -> spacy.tokens.doc.Doc
Parameter Description Required
input: Union[str, List[str], spacy.tokens.doc.Doc] string, list of tokens, or a spaCy doc. Yes
nlp: _Union[Language, str] The language model to use. Yes
batch_size: int The number of texts to accumulate in an internal buffer. Default is 1000. No

rollingwindows.helpers.ensure_list

Wraps any input in a list if it is not already a list.

def ensure_list(input: Any) -> list
Parameter Description Required
input: Any An input variable. Yes

rollingwindows.helpers.spacy_rule_to_lower

Converts a spaCy Matcher rule to lower case.

def spacy_rule_to_lower(patterns: Union[Dict, List[Dict]], old_key: Union[List[str], str] = ["TEXT", "ORTH"], new_key: str = "LOWER") -> list
Parameter Description Required
patterns: Union[Dict, List[Dict]] A string to match against the Roman numerals pattern. Yes
old_key: Union[List[str], str] A dictionary key or list of keys to rename. Default is ["TEXT", "ORTH"]. No
new_key: str The new key name. Default is LOWER. No

rollingwindows.plotters

Contains registered plotters. There are currently two registered plotters: RWSimplePlotter and RWPlotlyPlotter. Each plotter is a subclass of the BasePlotter class, which has a metadata property.

rollingwindows.plotters.interpolate

Returns interpolated points for plots that use interpolation. The interpolation function may be either scipy.interpolate.pchip_interpolate, numpy.interp, or one of the options for scipy.interpolate.interp1d. Note however, that scipy.interpolate.interp1d is deprecated.

def interpolate(x: np.ndarray, y: np.ndarray, xx: np.ndarray, interpolation_kind: str = None) -> np.ndarray
Parameter Description Required
x: np.ndarray The x values. Yes
y: np.ndarray The x values. Yes
xx: np.ndarray The projected interpolation range. Yes
interpolation_kind: str The interpolation function to use. Default is None. No

rollingwindows.plotters.RWPlotlyPlotter

Generates a plot using Plotly.

rollingwindows.plotters.RWPlotlyPlotter has a class attribute id, the value of which is "rw_plotly_plotter". This the id registered in the registry.

class RWPlotlyPlotter(width: int = 700, height: int = 450, title: Union[dict, str] = "Rolling Windows Plot", xlabel: str = "Token Count", ylabel: str = "Average Frequency", line_color: str = "variable", showlegend: bool = True, titlepad: float = None, show_milestones: bool = True, milestone_marker_style: dict = {"width": 1, "color": "teal"}, show_milestone_labels: bool = False, milestone_labels: List[dict] = None, milestone_label_rotation: float = 0.0, milestone_label_style: dict = {"size": 10.0, "family": "Open Sans, verdana, arial, sans-serif", "color": "teal"}, **kwargs)
Attribute Description Required
width: int The figure width in pixels. Default is 700. No
height: int The figure height in pixels. Default is 450. No
title: Union[dict, str] The title of the figure. Styling can be added by passing a dict with the keywords described in Plotly's documentation. Default is Rolling Windows Plot. No
xlabel: str The text to display along the x axis. Default is Token Count. No
ylabel: str The text to display along the y axis. Default is Average Frequency. No
line_color: float The colour to be used for the lines on the line graph. Default is variable. No
showlegend: bool Whether to show the legend. Default is True. No
titlepad: float The margin in pixels between the title and the top of the graph. If not set, the margin will be calculated automatically from milestone label heights if the are shown. Default is None. No
xlabel: str The text to display along the x axis. Default is Token Count. No
ylabel: str The text to display along the y axis. Default is Average Frequency. No
show_milestones: bool Whether to show the milestone markers. Default is False. No
milestone_marker_style: dict A dict containing the styles to apply to the milestone marker. For valid properties, see the Plotly documentation. Default is {"width": 1, "color": "teal"}. No
show_milestone_labels: bool Whether to show the milestone labels. Default is False. No
milestone_labels: Dict[str, int] A dict with keys as milestone labels and values as points on the x-axis. Default is None. No
milestone_label_rotation: Union[float, int] The clockwise rotation of the milestone labels up to 90 degrees. Default is 0.0. No
milestone_label_style: dict A dict containing the styling information for the milestone labels. For valid properties, see the Plotly documentation. Default is {"size": 10.0, "family": "Open Sans, verdana, arial, sans-serif", "color": "teal"}. No

Tip

When milestone labels are shown and titlepad is not set manually, the class attempts to detect a suitable margin by using the same trick as RWSimplePlotter: it constructs a plot in matplotlib and measures the longest label to use as a guide. This produces reasonable results unless you change the figure height. In that case, it is advisable to set titlepad manually.

Once the figure is generated, it can be accessed with self.fig. You can then call self.fig.update_layout() and modify the figure using any of the parameters available in the Plotly documentation. This is useful to make changes not enabled by the Lexos API.

Private Methods

rollingwindows.plotters.RWPlotlyPlotter._check_duplicate_labels

Adds numeric suffixes for duplicate milestone labels. Returns a dictionary containing unique keys.

def _check_duplicate_labels(self, locations: List[Dict[str, int]]) -> List[Dict[str, int]]
Parameter Description Required
locations: List[Dict[str, int]] A list of location dicts. Yes

Note

The method is not yet implemented. The documentation here is copied from RWSimplePlotter since it should be substantially the same. That said, the class currently requires milestones to be submitted as a dictionary, which requires unique keys. So this needs some further thought.

rollingwindows.plotters.RWPlotlyPlotter._get_axis_and_title_labels

Ensures that the title, xlabel, and ylabel values are dicts.

def _get_axis_and_title_labels(self) -> Tuple[bool, str]
rollingwindows.plotters.RWPlotlyPlotter._get_titlepad

Get a titlepad value based on the height of the longest milestone label if the titlepad class attribute is not set.

def _get_titlepad(self, labels: Dict[str, int]) -> float
Parameter Description Required
labels: Dict[str, int] A dict with the labels as keys. Yes
rollingwindows.plotters.RWPlotlyPlotter._plot_milestone_label

Adds a milestone label to the Plotly figure.

def _plot_milestone_label(self, label: str, x: int) -> None
Parameter Description Required
label: str The text of a milestone label. Yes
x: int The location of the milestone label on the x axis. Yes
rollingwindows.plotters.RWPlotlyPlotter._plot_milestone_marker

Adds a milestone marker (vertical line) to the Plotly figure.

def _plot_milestone_marker(self, x: int, df_val_min: Union[float, int], df_val_max: Union[float, int]) -> None
Parameter Description Required
x: int The location of the milestone label on the x axis. Yes
df_val_min: Union[float, int] The minimum value in the pandas DataFrame. Yes
df_val_max: Union[float, int] The maximum value in the pandas DataFrame. Yes

Public Methods

rollingwindows.plotters.RWPlotlyPlotter.run

Runs the plotter saves the figure to RWPlotlyPlotter.fig.

def runs(self, df: pd.DataFrame) -> None
Parameter Description Required
df: pandas.DataFrame A pandas DataFrame, normally stored in RollingWindows.result. Yes
rollingwindows.plotters.RWPlotlyPlotter.save

Saves the plot to a file.

def save(self, path: str, **kwargs) -> None
Parameter Description Required
path: str The path to the file where the figure is to be saved. Yes

[NOTE] If the path ends in .html, this method will attempt to save the figure as a dynamic HTML file. The method accepts any keyword available for Plotly's Figure.write_html method.

Otherwise, it will attempt to save the figure as a static file in the format suggested the the extension in the filename (e.g. .png, .jpg, .pdf). The method accepts any keyword available for Plotly's Figure.write_image method.

rollingwindows.plotters.RWPlotlyPlotter.show

Displays a generated figure. This method calls matplotlib.pyplot.show. However, since this does not work with an inline backend like Jupyter notebooks, the method tried to detect this environment via a UserWarning and then just calls the plot attribute.

def show(self, config={"displaylogo": False}, **kwargs) -> None
Parameter Description Required
config: dict A dictionary supply Plotly configuration values. No

rollingwindows.plotters.RWSimplePlotter

Generates a plot using matplotlib.pyplot.

rollingwindows.plotters.RWSimplePlotter has a class attribute id, the value of which is "rw_simple_plotter". This the id registered in the registry.

class RWSimplePlotter(width: Union[float, int] = 6.4, height: Union[float, int] = 4.8, figsize: tuple = None, hide_spines: List[str] = ["top", "right"], title: str = "Rolling Windows Plot", titlepad: float = 6.0, title_position: str = "top", show_legend: bool = True, show_grid: bool = False, xlabel: str = "Token Count", ylabel: str = "Average Frequency", show_milestones: bool = False, milestone_colors: Union[List[str], str] = "teal", milestone_style: str = "--", milestone_width: int = 1, show_milestone_labels: bool = False, milestone_labels: List[dict] = None, milestone_labels_ha: str = "left", milestone_labels_va: str = "baseline", milestone_labels_rotation: int = 45, milestone_labels_offset: tuple = (-8, 4), milestone_labels_textcoords: str = "offset pixels", use_interpolation: bool = False, interpolation_num: int = 500, interpolation_kind: str = "pchip", **kwargs)
Attribute Description Default
width: Union[float, int] The figure width in inches. Default is 6.4. 6.4
height: Union[float, int] The figure height in inches. Default is 6.4. 6.4
fig_size: tuple A tuple containing the figure width and height in inches (overrides the width and height settings). Default is None. None
hide_spines: List[str] A list of ["top", "right", "bottom", "left"] indicating which spines to hide. Default is ["top", "right"]. ["top", "right"]
title: str The title to use for the plot. Default is Rolling Windows Plot. Rolling Windows Plot
titlepad: float The padding in points to place between the title and the plot, which may need to be increased if you are showing milestone labels. Default is 6.0. 6.0
title_position: str Show the title on the "bottom" or the "top" of the figure. Default is top. top
show_legend: bool Whether to show the legend. Default is True. True
show_grid: bool Whether to show the grid. Default is False. False
xlabel: str The text to display along the x axis. Default is Token Count. Token Count
ylabel: str The text to display along the y axis. Default is Average Frequency. Average Frequency
show_milestones: bool Whether to show the milestone markers. Default is False. False
milestone_colors: Union[List[str], str] The colour or colours to use for milestone markers. See pyplot.vlines(). Default is teal. teal
milestone_style: str The style of the milestone markers. See pyplot.vlines(). Default is --. --
milestone_width: int The width of the milestone markers. See pyplot.vlines(). Default is 1. 1
show_milestone_labels: bool Whether to show the milestone labels. Default is False. False
milestone_labels: List[dict] A list of dicts with keys as milestone labels and values as token indexes. Default is None. None
milestone_labels_ha: str The horizontal alignment of the milestone labels. See pyplot.annotate(). Default is left. left
milestone_labels_va: str The vertical alignment of the milestone labels. See pyplot.annotate(). Default is baseline. baseline
milestone_labels_rotation: int The rotation of the milestone labels in degrees. See pyplot.annotate(). Default is 45. 45
milestone_labels_offset: tuple A tuple containing the number of pixels along the x and y axes to offset the milestone labels. See pyplot.annotate(). Default is (-8, 4). (-8, 4)
milestone_labels_textcoords: str Whether to offset milestone labels by pixels or points. See pyplot.annotate(). Default is offset pixels. offset pixels
use_interpolation: bool Whether to use interpolation on values. Default is False. False
interpolation_num: int Number of values to add between points. Default is 500. 500
interpolation_kind: str Algorithm to use for interpolation. Default is pchip. pchip

If your RollingWindows.doc has milestones, you can display them as vertical lines on the graph. If show_milestone_labels is set to True, the first token in each milestone will be displayed as a label above the vertical line. If the labels are the same, they will be numbered consecutively ("milestone1", "milestone2", etc.). You can also submit custom labels and locations using the milestone_labels keyword. The other milesone_labels_ parameters control the rotation and location of the labels.

Rolling Windows plots can often produce unattractive, squarish lines, rather than the smooth curves you often see in line graphs for some types of data. This is because there tend to be very abrupt shifts in the frequencies of patterns, rather than gradual changes. With use_interpolation, you can attempt to introduce smoothing by interpolating points between the values calculated by the calculator to produce a more aesthetically pleasing graph. However, the resulting plots should only be used for presentation purposes where the interpretive value is established in a non-interpolated plot. This is because interpolations can introduce distortions which may be deceptive. The user is encouraged to compare interpolated and non-interpolation plots of their analysis. The value of interpolation_num is the number of points to interpolate between points in your data. The interpolation_kind refers to the function used to interpolate the points. The default is scipy's interpolate.pchip_interpolate function. You can also supply any of the kinds allowed by the scipy's interpolate.interp1d method, although in practice, only "cubic" and "quadratic" are likely to make a difference.

Private Methods

rollingwindows.plotters.RWSimplePlotter._check_duplicate_labels

Adds numeric suffixes for duplicate milestone labels. Returns a list of unique location dictionaries.

def _check_duplicate_labels(self, locations: List[Dict[str, int]]) -> List[Dict[str, int]]
Parameter Description Required
locations: List[Dict[str, int]] A list of location dicts. Yes
rollingwindows.plotters.RWSimplePlotter._get_label_height

Returns the height of the longest milestone label by using a separate plot to calculate the label height. The method is used to estimate how high to place the title above the plot.

def _get_label_height(self, milestone_labels: List[dict], milestone_labels_rotation: int) -> float
Parameter Description Required
milestone_labels: List[dict] A list of milestone_label dicts. Yes
milestone_labels_rotation: int The rotation of the labels in degrees. Yes

Public Methods

rollingwindows.plotters.RWSimplePlotter.run

Runs the plotter saves the figure to RWSimplePlotter.fig.

def runs(self, df: pd.DataFrame) -> None
Parameter Description Required
df: pandas.DataFrame A pandas DataFrame, normally stored in RollingWindows.result. Yes
rollingwindows.plotters.RWSimplePlotter.save

Saves the plot to a file. This method is a wrapper for matplotlib.pyplot.savefig().

def save(self, path: str) -> None
Parameter Description Required
path: str The path to the file where the figure is to be saved. The image type (e.g. .png, .jpg, .pdf) is determined by the extension on the filename. Yes
rollingwindows.plotters.RWSimplePlotter.show

Displays a generated figure. This method calls matplotlib.pyplot.show. However, since this does not work with an inline backend like Jupyter notebooks, the method tried to detect this environment via a UserWarning and then just calls the plot attribute.

def show(self, **kwargs) -> None

rollingwindows.registry

A registry of "built-in" rolling windows calculators, filters, and plotters. These can be loaded using their string id attributes with the Python catalogue module.

Custom Components

Custom Calculators

Calculators are implemented with the Calculator protocol, which allows you to produce custom calculator classes. A skeleton calculator is given below.

class MyCustomCalculator(Calculator):
   id: str = "my_custom_calculator"

   def __init__(
      self,
      patterns: Union[list, str],
      windows: Iterable
   ):
   """Create an instance of the calculator."""
   self.patterns = patterns
   self.windows = windows
self.data = None

   def run(self) -> spacy.tokens.doc.Doc:
   """Run the calculator."""
   ...

   def to_df(self) -> pd.DataFrame:
   """Convert the data to a pandas DataFrame."""
   ...

The Calculator protocol automatically builds a metadata dictionary when the class is instantiated. It requires a run() method to perform calculations and save the data to the object's data attribute. It also requires a to_df() method to convert the data to a pandas DataFrame. The data and DataFrame can take any format, as required for your purpose. However, if the data must be compatible with the chosen plotter. For instance, if using rollingwindows.plotters.RWSimplePlotter, the DataFrame must be organised with each pattern in a separate column and each window in a separate row.

Custom Filters

Calculators are implemented with the Filter protocol, which allows you to produce custom filter classes. A skeleton filter is given below.

class MyCustomFilter(Filter):
    id: str = "my_custom_filter"

    def __init__(
      self,
      doc: spacy.tokens.doc.Doc,
      *,
      spacy_attrs: List[str] = SPACY_ATTRS
    ):
      self.doc = doc
      self.spacy_attrs = spacy_attrs

    @property
    def filtered_token_ids(self) -> set:
        """Get a set of token_ids to keep after filtering."""
        return {
            token.i for token in self.doc
            if token.text.startswith("a")
        }

    def apply(self) -> spacy.tokens.doc.Doc:
        """Apply the filter."""
        return filter_doc(
            self.doc,
            self.filtered_token_ids,
            self.spacy_attrs
        )

The name of the filter is stored in the class attribute id. The filtered_token_ids property retrieves a list of token ids to keep. The apply() method returns a new document with all tokens not in the filtered_token_ids list removed. Notice that it calls the filter_doc() function, which is imported with filters. This function returns a new document in which the attribute labels have been copied from the old one. However, you may call your own function if you wish to adopt different procedure. Once you have a filtered document, you can use it to create a new RollingWindows instance.

!!! Note If you wish to pass an arbitrary list of token indexes to filter_doc(), it is wise to pass these indexes as a set. Although, filter_doc() will accept a Python list, this can increase processing times from less than a second to several minutes, depending on the length of the document.

Custom Plotters

Plotters are implemented with the BasePlotter protocol, which allows you to produce custom plotter classes. A skeleton plotter is given below.

class MyCustomPlotter(BasePlotter):
   id: str = "my_custom_plotter"

   def __init__(self, **kwargs):
   """Create an instance of the plotter."""
   # Define any attributes here

   def file(self) -> None:
   """Save the figure to a file."""
   ...

   def run(self, data: Any) -> None:
   """Run the plotter on a set of input data."""
   ...

   def show(self) -> None:
   """Display the plot."""
   ...

The Plotter protocol automatically builds a metadata dictionary when the class is instantiated. The data can be passed to the run() method in any format as long as the run() method handles the logic of generating a plot from it. However, if the data is to be compatible with a built-in calculator, it must take the form of a pandas DataFrame organised with each pattern in a separate column and each window in a separate row.