Rolling Windows API Documentation

The Lexos Rolling Windows module is used to analyse the frequency of patterns across rolling windows (also known as sliding windows) of units in a text. The Rolling Windows module under development for the next release of the Lexos API. It implements a new programming interface for the Rolling Windows tool currently available in the Lexos web app, but with added functionality. For further information on the use if Rolling Windows, users are encouraged to lick the "Help" button at the top right of the Lexos user interface and try out the web app.

Architecture of the Module

The __init__.py file contains the main logic for the file, including the main RollingWindows class used to manage the workflow. There are three main submodules, the functions of which are listed below:

filters: Classes to manage the filtering of documents prior to analysis.
calculators: Classes to manage statistical calculations.
plotters: Classes to manage the plotting of the results of Rolling Windows analyses.

Each submodule contains at least one class, which is registered in registry.py, allowing it to be treated as "built-in". Other built-in classes can be added in future releases, but users can also integrate their own custom classes to manage project-specific tasks not handled by built-in classes.

A fourth submodule, milestones, manages the labelling of structural divisions within documents. Since its use is not limited to Rolling Windows, it will become a component of the main Lexos library in the next release.

The file helpers.py contains functions used by more than one file in the module.

Note

Development of the rollingwindows module suffered from a (still unexplained) malfunction of the development environment, which caused a catastrophic loss of much of the code before it could be pushed to GitHub. The version here is a reconstruction which works but may not be as elegant or efficient as the original code. There may also be legacy code blocks with no function in the current code that have not yet been identified.

Each component of the module is documented separately below.

`rollingwindows.init`

This is the main component of the rollingwindows module, containing the RollingWindows class and associated functions.

`rollingwindows.init.get_rw_component`

Gets a component from the registry using a string id. Note that this is a near duplicate of rollingwindows.scrubber.registry.load_component.

def get_rw_component(id: str)

Parameter	Description	Required
`id`: str	The string id of the component.	Yes

`rollingwindows.RollingWindows`

The main class for managing the workflow and state of a Rolling Windows analysis.

class RollingWindows(doc: spacy.tokens.doc.Doc, model: str, *, patterns: Union[list, str] = None)

Attributes

Attribute	Description	Required
`doc`: spacy.tokens.doc.Doc	A spaCy doc.	Yes
`model`: str	The name of the language model. Default: `xx_sent_ud_sm`.	Yes
`patterns`: Union[list, str]	A pattern or list of patterns to search in each window. Patterns can be strings, regex patterns, or spaCy Matcher rules. Default is `None`.	No

The RollingWindows.metadata property returns dictionary recording the current configuration state.

Private Methods

`rollingwindows.RollingWindows.\_get_search_method`

Gets a preliminary search method based on the type of window unit.

def _get_search_method(self, window_units: str = None) -> str

Parameter	Description	Required
`window_units`: str	The units counted to construct windows: `characters`, `lines`, `sentences`, `tokens`. Default is `None`.	Yes

`rollingwindows.RollingWindows._get_units`

Gets a list of characters, sentences, lines, or tokens from the doc.

def _get_units(self, doc: spacy.tokens.doc.Doc, window_units: str = "characters") -> Union[List[spacy.tokens.span.Span], spacy.tokens.doc.Doc]

Parameter	Description	Required
`doc`: spacy.tokens.doc.Doc	A spaCy `Doc` object.	Yes
`window_units`: str	The units counted to construct windows: `characters`, `lines`, `sentences`, `tokens`. Default is `characters`.	Yes

Public Methods

`rollingwindows.RollingWindows.calculate`

Uses a calculator to generates a rolling windows analysis and assigns the result to RollingWindows.result.

RollingWindows.calculate(patterns: Union[list, str] = None, calculator: Union[Callable, str] = "rw_calculator", query: str = "counts", show_spacy_rules: bool = False) -> None

Parameter	Description	Required
`patterns`: Union[list, str]	The patterns to search for. Default is `None`.	Yes
`calculator`: Union[Callable, str]	The calculator to use. Default is the built-in "rw_calculator" calculator.	Yes
`query`: str	The type of data to return ("averages", "counts", or "ratios"). Default is the built-in "ratios".	Yes
`show_spacy_rules`: bool	If the calculator uses a spaCy `Matcher` rule, tell the calculator's `to_df` method to display the rule as a column header; otherwise, only the value matched by the calculator will be displayed. Default is `False`.	Yes

Note

For development purposes, RollingWindows.calculate() has a timer decorator, which will display the time elapsed when the windows are generated.

`rollingwindows.RollingWindows.plot`

Uses a plotter to generates a plot of rolling windows analysis and assigns the result to RollingWindows.fig.

RollingWindows.plot(calculator: Union[Callable, str] = "rw_simple_plotter", file: str = None) -> None

Parameter	Description	Required
`plotter`: Union[Callable, str]	The plotter to use. Default is the built-in "rw_simple_plotter".	Yes
`file`: str	The path to a file to save the plot. Default is `None`.	No

`rollingwindows.RollingWindows.set_windows`

Generates rolling windows, creates a rollingwindows.Windows object, and assigns it to RollingWindows.windows.

RollingWindows.set_windows(n: int = 100, window_units: str = "characters", alignment_mode: str = "strict", filter:  Union[Callable, str] = None)

Parameter	Description	Required
`n`: int	The number of windows to generate. Default is `1000`.	Yes
`window_units`: str	The units counted to construct windows: `characters`, `lines`, `sentences`, `tokens`. Default is `characters`.	Yes
`alignment_mode`: str	How character-based windows snap to token boundaries. - `strict`: No snapping - `contract`: Window contains all tokens completely within the window's assigned start and end indices. - `expand`: Window contains all tokens partially within the window's assigned start and end indices. Default is `strict`.	No
`filter`: str	The name of a filter or an instance of a filter to be applied to the doc before windows are generated. Default is `None`.	No

!!! Note For development purposes, RollingWindows.set_windows() has a timer decorator, which will display the time elapsed when the windows are generated.

`rollingwindows.sliding_windows`

Function to create a generator of sliding windows.

def sliding_windows(input: Union[List[spacy.tokens.span.Span], spacy.tokens.doc.Doc], n: int = 1000, window_units: str = "characters", alignment_mode: str = "strict") -> Iterator:

Parameter	Description	Required
`input`: Union[List[spacy.tokens.span.Span], spacy.tokens.doc.Doc]	Either a list of spaCy `Span` objects or a spaCy `Doc` object..	Yes
`n`: int	The number of units per window.	Yes
`window_units`: str	The units counted to construct windows: `characters`, `lines`, `sentences`, `tokens`. Default is `characters`.	Yes
`alignment_mode`: str	How character-based windows snap to token boundaries. - `strict`: No snapping - `contract`: Window contains all tokens completely within the window's assigned start and end indices. - `expand`: Window contains all tokens partially within the window's assigned start and end indices. Default is `strict`.	Yes

`rollingwindows.RollingWindows.Windows`

A dataclass for storing a generator of rolling windows and associated metadata.

class Windows(windows: Iterable, window_units: str, n: int, alignment_mode: str = "strict")



    def __iter__(self):

        return self.windows

Parameter	Description	Required
`windows`: Iterable	The number of windows to generate. Default is `1000`.	Yes
`window_units`: str	The units counted to construct windows: `characters`, `lines`, `sentences`, `tokens`. Default is `characters`.	Yes
`n`: int	The number of units per window.	Yes
`alignment_mode`: str	How character-based windows snap to token boundaries. - `strict`: No snapping - `contract`: Window contains all tokens completely within the window's assigned start and end indices. - `expand`: Window contains all tokens partially within the window's assigned start and end indices. Default is `strict`.	Yes

Private Methods

`rollingwindows.Windows.iter`

Returns the value of self.windows.

Note

This value is a generator, so iterating through the Windows object will empty the values

`rollingwindows.calculators`

Contains registered calculators. There is currently one registered calculators: RWCalculator. Each calculators is a subclass of BaseCalculator, which has a metadata property.

`rollingwindows.calculators.RWCalculator`

A calculator class to calculate rolling averages, counts, and ratios of a matched pattern or patterns in a document. The property RWCalculator.regex_flags returns the flags used with methods that call the Python re module.

rollingwindows.calculators.RWCalculator has a class attribute id, the value of which is "rw_calculator". This the id registered in the registry.

class RWCalculator(*, patterns: Union[List, str] = None, windows: Windows = None, mode: bool = "exact", case_sensitive: bool = False, alignment_mode: str = "strict", model: str = "xx_sent_ud_sm", original_doc: spacy.tokens.doc.Doc = None, query: str = "counts")

Parameter	Description	Required
`patterns`: Union[list, str]	A pattern or list of patterns to search in windows. Default is `None`	No
`windows`: Windows	A `rollingwindows.Windows` object containing the windows to search.	No
`mode`: str	The type of search method to use: - `exact`: Search for exact matches to a string pattern. - `regex`: Search for matches to a regex expression. - `spacy_rule`: Search for matches using a spaCy Matcher class rule. - `multi_token`: Search for matches to a regex expression across multiple tokens. - `multi_token_exact`: Search for matches to string pattern across multiple tokens. Default is `exact`. See the explanation below.	No
`case_sensitive`: bool	Whether to make searches case-sensitive. Default is `False`.	No
`alignment_mode`: str	Whether to snap searches to token boundaries: - `strict`: No snapping. - `contract`: Count all matches that fall completely within token boundaries. - `expand`: Count all matches that fall partially within token boundaries. Default is `strict`.	No
`model`: bool	The name of language model to be used with spaCy's Matcher class. Default is `xx_sent_ud_sm`.	No
`original_doc`: spacy.tokens.doc.Doc	A copy of the original doc. Default is `None`.	No, except if `mode` is set to `multi_token` or `multi_token_exact`
`query`: str	The type of data to return: "averages", "counts", or "ratios". Default is `counts`.	No

If the window is a string, exact and regex modes will match patterns irrespective of token boundaries. If the window is a list of strings, or a spaCy Span object, patterns will be matched within the boundaries of each token. The spacy_rule option allows for complex searches using spaCy's Matcher class; however, it searches for tokens or sequence of tokens that match the patterns. The alternative is the multi_item mode, which will match a regular expression both within and across token boundaries. The multi_item_exact version escapes regex special characters so that raw strings containing characters like "+\*?^$ ()\[\]{}|\\ can be matched. Both multi_item options can be configured with alignment_mode to determine how the matcher responds to to token boundaries. Note that because character indexes from the original document are required to locate the character range in the original document, multi_item modes require that you also pass the source document to the original_doc attribute.

Private Methods

`rollingwindows.calculators.RWCalculator._assign_variable`

Try to use configured values if not passed by public functions.

def _assign_variable(self, var_name: str, var: Any) -> Any

Parameter	Description	Required
`var_name`: str	The name of the variable.	Yes
`var`: Any	The variable to be evaluated.	Yes

`rollingwindows.calculators.RWCalculator._count_pattern_matches`

Uses Python count() to count exact character matches in a character window.

def _count_character_patterns_in_character_windows(self, window: str, pattern: str) -> int

| Parameter        | Description           | Required |
|------------------|-----------------------|----------|
| `pattern`: _str_ | The pattern to match. | Yes      |
| `window`: _str_  | The window to search. | Yes      |

##### `rollingwindows.calculators.RWCalculator._count_in_character_window`

_count_in_character_window(self, window: str, pattern: str) -> int:
Chooses the function for counting matches in character windows.

```python
def _count_pattern_matches(self, pattern: Union[dict, list, str]) -> str

Parameter	Description	Required
`pattern`: str	The pattern to match.	Yes
`window`: str	The window to search.	Yes

`rollingwindows.calculators.RWCalculator._count_token_patterns_in_token_lists`

Counts patterns in lists of token strings.

def _count_token_patterns_in_token_lists(self, window: List[str], pattern: str) -> int

Parameter	Description	Required
`pattern`: str	A string pattern to search for.	Yes
`window`: List[str]	A window consisting of a list of strings.	Yes

`rollingwindows.calculators.RWCalculator._count_token_patterns_in_span`

Counts patterns in a spaCy Span object.

_count_token_patterns_in_span(self, window: spacy.tokens.span.Span, pattern: Union[list, str]) -> int

Parameter	Description	Required
`pattern`: Union[list, str]	A string pattern or spaCy rule to search for.	Yes
`window`: spacy.tokens.span.Span	A window consisting of a spaCy `Span` object.	Yes

`rollingwindows.calculators.RWCalculator._count_token_patterns_in_span_text`

Counts patterns in span otext with token alignment.

_count_token_patterns_in_span_text(self, window: str, pattern: str) -> int

Parameter	Description	Required
`pattern`: str	A string pattern or spaCy rule to search for.	Yes
`window`: str	A string window.	Yes

`rollingwindows.calculators.RWCalculator._count_in_token_window`

Chooses the function for counting matches in token windows.

_count_token_patterns_in_span_text(self, window: str, pattern: str) -> int

Parameter	Description	Required
`pattern`: Union[list, str]	A string pattern or spaCy rule to search for.	Yes
`window`: Union[List[str], spacy.tokens.span.Span]	A window consisting of a list of token strings or a spaCy `Span` object.	Yes

`rollingwindows.calculators.RWCalculator._extract_string_pattern`

Extracts a string pattern from a spaCy rule.

_extract_string_pattern(self, pattern: Union[dict, list, str]) -> str

Parameter	Description	Required
`pattern`: Union[dict, list, str]	A pattern to search.	Yes

`rollingwindows.calculators.RWCalculator._get_ratio`

Extracts a string pattern from a spaCy rule.

_get_ratio(self, counts: List[int]) -> float

Parameter	Description	Required
`counts`: List[int]	A list of two counts.	Yes

`rollingwindows.calculators.RWCalculator._get_window_count`

Calls character or token window methods, as appropriate.

_get_window_count(self, window: Union[List[str], spacy.tokens.span.Span, str], pattern: Union[list, str]) -> int

Parameter	Description	Required
`pattern`: Union[list, str]	A string pattern or spaCy rule to search for.	Yes
`window`: Union[List[str], spacy.tokens.span.Span, str]	A window consisting of a list of token strings, a spaCy `Span` object, or a string.	Yes

Public Methods

`rollingwindows.calculators.RWCalculator.get_averages`

Calls rollingwindows.calculators.RWCalculator.run with the query set to "averages".

def get_averages(self, windows: Iterable = None, patterns: Union[List, str] = None) -> None

Parameter	Description	Required
`pattern`: Union[List, str]	string pattern or spaCy rule, or a list of either.	No
`window`: Iterable	A `RollingWindows.Windows` object.	No

`rollingwindows.calculators.RWCalculator.get_counts`

Calls rollingwindows.calculators.RWCalculator.run with the query set to "counts".

def get_averages(self, windows: Iterable = None, patterns: Union[List, str] = None) -> None

Parameter	Description	Required
`pattern`: Union[List, str]	string pattern or spaCy rule, or a list of either.	No
`window`: Iterable	A `RollingWindows.Windows` object.	No

`rollingwindows.calculators.RWCalculator.get_ratios`

Calls rollingwindows.calculators.RWCalculator.run with the query set to "ratios".

def get_averages(self, windows: Iterable = None, patterns: Union[List, str] = None) -> None

Parameter	Description	Required
`pattern`: Union[List, str]	A string pattern or spaCy rule, or a list of either.	No
`window`: Iterable	A `RollingWindows.Windows` object.	No

`rollingwindows.calculators.RWCalculator.run`

Runs the calculator, which performs calculations and saves the result to RWCalculator.data.

def runs(windows: Iterable = None, patterns: Union[List, str] = None, query: str = "counts") -> None

Parameter	Description	Required
`pattern`: Union[List, str]	A string pattern or spaCy rule, or a list of either.	No
`window`: Iterable	A `RollingWindows.Windows` object.	No
`query`: Iterable	String designating whether to return "counts", "averages", or "ratios". Default is `counts`.	No

`rollingwindows.calculators.RWCalculator.to_df`

Converts the data in Averages.data to a pandas DataFrame.

def to_df(self, show_spacy_rules: bool = False) -> pd.DataFrame

Parameter	Description	Required
`show_spacy_rules`: bool	If the calculator uses a spaCy `Matcher` rule, tell the calculator's `to_df` method to display the rule as a column header; otherwise, only the value matched by the calculator will be displayed. Default is `False`.	No

`rollingwindows.filters`

Contains registered filters. There are currently two registered filters: WordFilter and NonStopwordFilter. Each filter is a subclass of the BaseFilter class, which has a metadata property.

`rollingwindows.filters.filter_doc`

Applies a filter to a document and returns a new document.

def filter_doc(input: Union[List[spacy.tokens.span.Span], spacy.tokens.doc.Doc], n: int = 1000, window_units: str = "characters", alignment_mode: str = "strict") -> Iterator

Parameter	Description	Required
`doc`: spacy.tokens.doc.Doc	A spaCy `Doc` object.	Yes
`keep_ids`: int	A list of spaCy `Token` ids to keep in the filtered `Doc`.	Yes
`spacy_attrs`: List[str]	A list of spaCy `Token` attributes to keep in the filtered `Doc`. Default is the `SPACY_ATTRS` list imported with `filters`.*	No
`force_ws`: bool	Force a whitespace at the end of every token except the last. Default is `True`.	No

* The default list of spaCy token attributes can be inspected by calling filters.SPACY_ATTRS.

`rollingwindows.filters.get_doc_array`

Converts a spaCy Doc object into a numpy array.

def get_doc_array(doc: spacy.tokens.doc.Doc, spacy_attrs: List[str] = SPACY_ATTRS, force_ws: bool = True) -> np.ndarray

Parameter	Description	Required
`doc`: spacy.tokens.doc.Doc	A spaCy `Doc` object.	Yes
`keep_ids`: int	A list of spaCy `Token` ids to keep in the filtered `Doc`.	Yes
`spacy_attrs`: List[str]	A list of spaCy `Token` attributes to keep in the filtered `Doc`. Default is the `SPACY_ATTRS` list imported with `filters`.*	No
`force_ws`: bool	Force a whitespace at the end of every token except the last. Default is `True`.	No

* The default list of spaCy token attributes can be inspected by calling filters.SPACY_ATTRS.

The following options are available for handling whitespace:

force_ws=True ensures that token_with_ws and whitespace_ attributes are preserved, but all tokens will be separated by whitespaces in the text of a doc created from the array.
force_ws=False with SPACY in spacy_attrs preserves the token_with_ws and whitespace_ attributes and their original values. This may cause tokens to be merged if subsequent processing operates on the doc.text.
force_ws=False without SPACY in spacy_attrs does not preserve the token_with_ws and whitespace_ attributes or their values. By default, doc.text displays a single space between each token.

`rollingwindows.filters.is_not_roman_numeral`

Returns True if a token is not a Roman numeral. Works only on upper-case Roman numerals.

def is_not_roman_numeral(s: str) -> bool

Parameter	Description	Required
`s`: str	A string to match against the Roman numerals pattern.	Yes

`rollingwindows.filters.NonStopwordFilter`

A filter class to remove stop words from a document. This is a minimal function that strips punctuation and returns the ids of words not flagged as stop words by the language model or in a list of additional_stopwords. The property NonStopwordFilter.word_ids returns the token ids for all tokens in the document that are not stop words according to these criteria.

rollingwindows.filters.NonStopwordFilter has a class attribute id, the value of which is "non_stopword_filter". This the id registered in the registry.

class NonStopwordFilter(doc: spacy.tokens.doc.Doc, *, spacy_attrs: List[str]: SPACY_ATTRS, additional_stopwords: List[str] = None, case_sensitive: bool = False)

Parameter	Description	Required
`doc`: spacy.tokens.doc.Doc	A spaCy `Doc` object.	Yes
`spacy_attrs`: List[str]	A list of spaCy `Token` attributes to keep in the filtered `Doc`. Default is the `SPACY_ATTRS` list imported with `filters`.	No
`additional_stopwords`: List[str]	A list of stop words to add to those labelled as stop words by the model. Default is `None`.	No
`case_sensitive`: bool	Use only lower case forms if `False`. Default is `True`.	No

Private Methods

`rollingwindows.filters.NonStopwordFilter._is_non_stopword`

Returns True if a token is not a stop word.

def _is_non_stopword(self, token: spacy.tokens.Token) -> bool

Parameter	Description	Required
`token`: spacy.tokens.Token	A spaCy `Token` object.	Yes

Public Methods

`rollingwindows.filters.NonStopwordFilter.apply`

Applies the filter and returns a new, filtered doc.

def apply(self) -> spacy.tokens.doc.Doc

`rollingwindows.filters.WordFilter`

A filter class to remove non-words from a document. The property WordFilter.word_ids returns the token ids for all tokens in the document that are identified as words according to supplied criteria.

rollingwindows.filters.WordFilter has a class attribute id, the value of which is "word_filter". This the id registered in the registry.

class WordFilter(doc: spacy.tokens.doc.Doc, *, spacy_attrs: List[str]: SPACY_ATTRS, exclude: Union[List[str], str] = [" ", "\n"], exclude_digits: bool = False, exclude_roman_numerals: bool = False, exclude_pattern: Union[List[str], str] = None)

Parameter	Description	Required
`doc`: spacy.tokens.doc.Doc	A spaCy `Doc` object.	Yes
`spacy_attrs`: List[str]	A list of spaCy `Token` attributes to keep in the filtered `Doc`. Default is the `SPACY_ATTRS` list imported with `filters`.	No
`exclude`: List[str]	A string/regex or list of strings/regex patterns to exclude. Default is `[" ", "\n"]`.	No
`exclude_digits`: bool	If True, digits will not be treated as words. Default is `False`.	No
`exclude_roman_numerals`: bool	If True, Roman numerals will not be treated as words. However, this only works with capitalised Roman numerals. Default is `False`.	No
`exclude_pattern`: bool	Additional regex patterns to add to the default `exclude` list. Default is `None`	No

Public Methods

`rollingwindows.filters.WordFilter.apply`

Applies the filter and returns a new, filtered doc.

def apply(self) -> spacy.tokens.doc.Doc

`rollingwindows.helpers`

Contains helper functions used by multiple files in the module. rollingwindows.helpers.ensure_doc may be legacy code that is not used in the current version.

`rollingwindows.helpers.ensure_doc`

Converts input into a spaCy Doc object. The returned Doc is unannotated if it is derived from a string or a list of tokens.

def ensure_doc(input: Union[str, List[str], spacy.tokens.doc.Doc], nlp: Union[Language, str], batch_size: int = 1000) -> spacy.tokens.doc.Doc

Parameter	Description	Required
`input`: Union[str, List[str], spacy.tokens.doc.Doc]	string, list of tokens, or a spaCy doc.	Yes
`nlp`: _Union[Language, str]	The language model to use.	Yes
`batch_size`: int	The number of texts to accumulate in an internal buffer. Default is `1000`.	No

`rollingwindows.helpers.ensure_list`

Wraps any input in a list if it is not already a list.

def ensure_list(input: Any) -> list

Parameter	Description	Required
`input`: Any	An input variable.	Yes

`rollingwindows.helpers.spacy_rule_to_lower`

Converts a spaCy Matcher rule to lower case.

def spacy_rule_to_lower(patterns: Union[Dict, List[Dict]], old_key: Union[List[str], str] = ["TEXT", "ORTH"], new_key: str = "LOWER") -> list

Parameter	Description	Required
`patterns`: Union[Dict, List[Dict]]	A string to match against the Roman numerals pattern.	Yes
`old_key`: Union[List[str], str]	A dictionary key or list of keys to rename. Default is `["TEXT", "ORTH"]`.	No
`new_key`: str	The new key name. Default is `LOWER`.	No

`rollingwindows.plotters`

Contains registered plotters. There are currently two registered plotters: RWSimplePlotter and RWPlotlyPlotter. Each plotter is a subclass of the BasePlotter class, which has a metadata property.

`rollingwindows.plotters.interpolate`

Returns interpolated points for plots that use interpolation. The interpolation function may be either scipy.interpolate.pchip_interpolate, numpy.interp, or one of the options for scipy.interpolate.interp1d. Note however, that scipy.interpolate.interp1d is deprecated.

def interpolate(x: np.ndarray, y: np.ndarray, xx: np.ndarray, interpolation_kind: str = None) -> np.ndarray

Parameter	Description	Required
`x`: np.ndarray	The x values.	Yes
`y`: np.ndarray	The x values.	Yes
`xx`: np.ndarray	The projected interpolation range.	Yes
`interpolation_kind`: str	The interpolation function to use. Default is `None`.	No

`rollingwindows.plotters.RWPlotlyPlotter`

Generates a plot using Plotly.

rollingwindows.plotters.RWPlotlyPlotter has a class attribute id, the value of which is "rw_plotly_plotter". This the id registered in the registry.

class RWPlotlyPlotter(width: int = 700, height: int = 450, title: Union[dict, str] = "Rolling Windows Plot", xlabel: str = "Token Count", ylabel: str = "Average Frequency", line_color: str = "variable", showlegend: bool = True, titlepad: float = None, show_milestones: bool = True, milestone_marker_style: dict = {"width": 1, "color": "teal"}, show_milestone_labels: bool = False, milestone_labels: List[dict] = None, milestone_label_rotation: float = 0.0, milestone_label_style: dict = {"size": 10.0, "family": "Open Sans, verdana, arial, sans-serif", "color": "teal"}, **kwargs)

Attribute	Description	Required
`width`: int	The figure width in pixels. Default is `700`.	No
`height`: int	The figure height in pixels. Default is `450`.	No
`title`: Union[dict, str]	The title of the figure. Styling can be added by passing a dict with the keywords described in Plotly's documentation. Default is `Rolling Windows Plot`.	No
`xlabel`: str	The text to display along the x axis. Default is `Token Count`.	No
`ylabel`: str	The text to display along the y axis. Default is `Average Frequency`.	No
`line_color`: float	The colour to be used for the lines on the line graph. Default is `variable`.	No
`showlegend`: bool	Whether to show the legend. Default is `True`.	No
`titlepad`: float	The margin in pixels between the title and the top of the graph. If not set, the margin will be calculated automatically from milestone label heights if the are shown. Default is `None`.	No
`xlabel`: str	The text to display along the x axis. Default is `Token Count`.	No
`ylabel`: str	The text to display along the y axis. Default is `Average Frequency`.	No
`show_milestones`: bool	Whether to show the milestone markers. Default is `False`.	No
`milestone_marker_style`: dict	A dict containing the styles to apply to the milestone marker. For valid properties, see the Plotly documentation. Default is `{"width": 1, "color": "teal"}`.	No
`show_milestone_labels`: bool	Whether to show the milestone labels. Default is `False`.	No
`milestone_labels`: Dict[str, int]	A dict with keys as milestone labels and values as points on the x-axis. Default is `None`.	No
`milestone_label_rotation`: Union[float, int]	The clockwise rotation of the milestone labels up to 90 degrees. Default is `0.0`.	No
`milestone_label_style`: dict	A dict containing the styling information for the milestone labels. For valid properties, see the Plotly documentation. Default is `{"size": 10.0, "family": "Open Sans, verdana, arial, sans-serif", "color": "teal"}`.	No

Tip

When milestone labels are shown and titlepad is not set manually, the class attempts to detect a suitable margin by using the same trick as RWSimplePlotter: it constructs a plot in matplotlib and measures the longest label to use as a guide. This produces reasonable results unless you change the figure height. In that case, it is advisable to set titlepad manually.

Once the figure is generated, it can be accessed with self.fig. You can then call self.fig.update_layout() and modify the figure using any of the parameters available in the Plotly documentation. This is useful to make changes not enabled by the Lexos API.

Private Methods

`rollingwindows.plotters.RWPlotlyPlotter._check_duplicate_labels`

Adds numeric suffixes for duplicate milestone labels. Returns a dictionary containing unique keys.

def _check_duplicate_labels(self, locations: List[Dict[str, int]]) -> List[Dict[str, int]]

Parameter	Description	Required
`locations`: List[Dict[str, int]]	A list of location dicts.	Yes

Note

The method is not yet implemented. The documentation here is copied from RWSimplePlotter since it should be substantially the same. That said, the class currently requires milestones to be submitted as a dictionary, which requires unique keys. So this needs some further thought.

`rollingwindows.plotters.RWPlotlyPlotter._get_axis_and_title_labels`

Ensures that the title, xlabel, and ylabel values are dicts.

def _get_axis_and_title_labels(self) -> Tuple[bool, str]

`rollingwindows.plotters.RWPlotlyPlotter._get_titlepad`

Get a titlepad value based on the height of the longest milestone label if the titlepad class attribute is not set.

def _get_titlepad(self, labels: Dict[str, int]) -> float

Parameter	Description	Required
`labels`: Dict[str, int]	A dict with the labels as keys.	Yes

`rollingwindows.plotters.RWPlotlyPlotter._plot_milestone_label`

Adds a milestone label to the Plotly figure.

def _plot_milestone_label(self, label: str, x: int) -> None

Parameter	Description	Required
`label`: str	The text of a milestone label.	Yes
`x`: int	The location of the milestone label on the x axis.	Yes

`rollingwindows.plotters.RWPlotlyPlotter._plot_milestone_marker`

Adds a milestone marker (vertical line) to the Plotly figure.

def _plot_milestone_marker(self, x: int, df_val_min: Union[float, int], df_val_max: Union[float, int]) -> None

Parameter	Description	Required
`x`: int	The location of the milestone label on the x axis.	Yes
`df_val_min`: Union[float, int]	The minimum value in the pandas DataFrame.	Yes
`df_val_max`: Union[float, int]	The maximum value in the pandas DataFrame.	Yes

Public Methods

`rollingwindows.plotters.RWPlotlyPlotter.run`

Runs the plotter saves the figure to RWPlotlyPlotter.fig.

def runs(self, df: pd.DataFrame) -> None

Parameter	Description	Required
`df`: pandas.DataFrame	A pandas DataFrame, normally stored in `RollingWindows.result`.	Yes

`rollingwindows.plotters.RWPlotlyPlotter.save`

Saves the plot to a file.

def save(self, path: str, **kwargs) -> None

Parameter	Description	Required
`path`: str	The path to the file where the figure is to be saved.	Yes

[NOTE] If the path ends in .html, this method will attempt to save the figure as a dynamic HTML file. The method accepts any keyword available for Plotly's Figure.write_html method.

Otherwise, it will attempt to save the figure as a static file in the format suggested the the extension in the filename (e.g. .png, .jpg, .pdf). The method accepts any keyword available for Plotly's Figure.write_image method.

`rollingwindows.plotters.RWPlotlyPlotter.show`

Displays a generated figure. This method calls matplotlib.pyplot.show. However, since this does not work with an inline backend like Jupyter notebooks, the method tried to detect this environment via a UserWarning and then just calls the plot attribute.

def show(self, config={"displaylogo": False}, **kwargs) -> None

Parameter	Description	Required
`config`: dict	A dictionary supply Plotly configuration values.	No

`rollingwindows.plotters.RWSimplePlotter`

Generates a plot using matplotlib.pyplot.

rollingwindows.plotters.RWSimplePlotter has a class attribute id, the value of which is "rw_simple_plotter". This the id registered in the registry.

class RWSimplePlotter(width: Union[float, int] = 6.4, height: Union[float, int] = 4.8, figsize: tuple = None, hide_spines: List[str] = ["top", "right"], title: str = "Rolling Windows Plot", titlepad: float = 6.0, title_position: str = "top", show_legend: bool = True, show_grid: bool = False, xlabel: str = "Token Count", ylabel: str = "Average Frequency", show_milestones: bool = False, milestone_colors: Union[List[str], str] = "teal", milestone_style: str = "--", milestone_width: int = 1, show_milestone_labels: bool = False, milestone_labels: List[dict] = None, milestone_labels_ha: str = "left", milestone_labels_va: str = "baseline", milestone_labels_rotation: int = 45, milestone_labels_offset: tuple = (-8, 4), milestone_labels_textcoords: str = "offset pixels", use_interpolation: bool = False, interpolation_num: int = 500, interpolation_kind: str = "pchip", **kwargs)

Attribute	Description	Default
`width`: Union[float, int]	The figure width in inches. Default is `6.4`.	`6.4`
`height`: Union[float, int]	The figure height in inches. Default is `6.4`.	`6.4`
`fig_size`: tuple	A tuple containing the figure width and height in inches (overrides the `width` and `height` settings). Default is `None`.	`None`
`hide_spines`: List[str]	A list of ["top", "right", "bottom", "left"] indicating which spines to hide. Default is `["top", "right"]`.	`["top", "right"]`
`title`: str	The title to use for the plot. Default is `Rolling Windows Plot`.	`Rolling Windows Plot`
`titlepad`: float	The padding in points to place between the title and the plot, which may need to be increased if you are showing milestone labels. Default is `6.0`.	`6.0`
`title_position`: str	Show the title on the "bottom" or the "top" of the figure. Default is `top`.	`top`
`show_legend`: bool	Whether to show the legend. Default is `True`.	`True`
`show_grid`: bool	Whether to show the grid. Default is `False`.	`False`
`xlabel`: str	The text to display along the x axis. Default is `Token Count`.	`Token Count`
`ylabel`: str	The text to display along the y axis. Default is `Average Frequency`.	`Average Frequency`
`show_milestones`: bool	Whether to show the milestone markers. Default is `False`.	`False`
`milestone_colors`: Union[List[str], str]	The colour or colours to use for milestone markers. See pyplot.vlines(). Default is `teal`.	`teal`
`milestone_style`: str	The style of the milestone markers. See pyplot.vlines(). Default is `--`.	`--`
`milestone_width`: int	The width of the milestone markers. See pyplot.vlines(). Default is `1`.	`1`
`show_milestone_labels`: bool	Whether to show the milestone labels. Default is `False`.	`False`
`milestone_labels`: List[dict]	A list of dicts with keys as milestone labels and values as token indexes. Default is `None`.	`None`
`milestone_labels_ha`: str	The horizontal alignment of the milestone labels. See pyplot.annotate(). Default is `left`.	`left`
`milestone_labels_va`: str	The vertical alignment of the milestone labels. See pyplot.annotate(). Default is `baseline`.	`baseline`
`milestone_labels_rotation`: int	The rotation of the milestone labels in degrees. See pyplot.annotate(). Default is `45`.	`45`
`milestone_labels_offset`: tuple	A tuple containing the number of pixels along the x and y axes to offset the milestone labels. See pyplot.annotate(). Default is `(-8, 4)`.	`(-8, 4)`
`milestone_labels_textcoords`: str	Whether to offset milestone labels by pixels or points. See pyplot.annotate(). Default is `offset pixels`.	`offset pixels`
`use_interpolation`: bool	Whether to use interpolation on values. Default is `False`.	`False`
`interpolation_num`: int	Number of values to add between points. Default is `500`.	`500`
`interpolation_kind`: str	Algorithm to use for interpolation. Default is `pchip`.	`pchip`

If your RollingWindows.doc has milestones, you can display them as vertical lines on the graph. If show_milestone_labels is set to True, the first token in each milestone will be displayed as a label above the vertical line. If the labels are the same, they will be numbered consecutively ("milestone1", "milestone2", etc.). You can also submit custom labels and locations using the milestone_labels keyword. The other milesone_labels_ parameters control the rotation and location of the labels.

Rolling Windows plots can often produce unattractive, squarish lines, rather than the smooth curves you often see in line graphs for some types of data. This is because there tend to be very abrupt shifts in the frequencies of patterns, rather than gradual changes. With use_interpolation, you can attempt to introduce smoothing by interpolating points between the values calculated by the calculator to produce a more aesthetically pleasing graph. However, the resulting plots should only be used for presentation purposes where the interpretive value is established in a non-interpolated plot. This is because interpolations can introduce distortions which may be deceptive. The user is encouraged to compare interpolated and non-interpolation plots of their analysis. The value of interpolation_num is the number of points to interpolate between points in your data. The interpolation_kind refers to the function used to interpolate the points. The default is scipy's interpolate.pchip_interpolate function. You can also supply any of the kinds allowed by the scipy's interpolate.interp1d method, although in practice, only "cubic" and "quadratic" are likely to make a difference.

Private Methods

`rollingwindows.plotters.RWSimplePlotter._check_duplicate_labels`

Adds numeric suffixes for duplicate milestone labels. Returns a list of unique location dictionaries.

def _check_duplicate_labels(self, locations: List[Dict[str, int]]) -> List[Dict[str, int]]

Parameter	Description	Required
`locations`: List[Dict[str, int]]	A list of location dicts.	Yes

`rollingwindows.plotters.RWSimplePlotter._get_label_height`

Returns the height of the longest milestone label by using a separate plot to calculate the label height. The method is used to estimate how high to place the title above the plot.

def _get_label_height(self, milestone_labels: List[dict], milestone_labels_rotation: int) -> float

Parameter	Description	Required
`milestone_labels`: List[dict]	A list of milestone_label dicts.	Yes
`milestone_labels_rotation`: int	The rotation of the labels in degrees.	Yes

Public Methods

`rollingwindows.plotters.RWSimplePlotter.run`

Runs the plotter saves the figure to RWSimplePlotter.fig.

def runs(self, df: pd.DataFrame) -> None

Parameter	Description	Required
`df`: pandas.DataFrame	A pandas DataFrame, normally stored in `RollingWindows.result`.	Yes

`rollingwindows.plotters.RWSimplePlotter.save`

Saves the plot to a file. This method is a wrapper for matplotlib.pyplot.savefig().

def save(self, path: str) -> None

Parameter	Description	Required
`path`: str	The path to the file where the figure is to be saved. The image type (e.g. `.png`, `.jpg`, `.pdf`) is determined by the extension on the filename.	Yes

`rollingwindows.plotters.RWSimplePlotter.show`

def show(self, **kwargs) -> None

`rollingwindows.registry`

A registry of "built-in" rolling windows calculators, filters, and plotters. These can be loaded using their string id attributes with the Python catalogue module.

Custom Components

Custom Calculators

Calculators are implemented with the Calculator protocol, which allows you to produce custom calculator classes. A skeleton calculator is given below.

class MyCustomCalculator(Calculator):
   id: str = "my_custom_calculator"

   def __init__(
      self,
      patterns: Union[list, str],
      windows: Iterable
   ):
   """Create an instance of the calculator."""
   self.patterns = patterns
   self.windows = windows
self.data = None

   def run(self) -> spacy.tokens.doc.Doc:
   """Run the calculator."""
   ...

   def to_df(self) -> pd.DataFrame:
   """Convert the data to a pandas DataFrame."""
   ...

The Calculator protocol automatically builds a metadata dictionary when the class is instantiated. It requires a run() method to perform calculations and save the data to the object's data attribute. It also requires a to_df() method to convert the data to a pandas DataFrame. The data and DataFrame can take any format, as required for your purpose. However, if the data must be compatible with the chosen plotter. For instance, if using rollingwindows.plotters.RWSimplePlotter, the DataFrame must be organised with each pattern in a separate column and each window in a separate row.

Custom Filters

Calculators are implemented with the Filter protocol, which allows you to produce custom filter classes. A skeleton filter is given below.

class MyCustomFilter(Filter):
    id: str = "my_custom_filter"

    def __init__(
      self,
      doc: spacy.tokens.doc.Doc,
      *,
      spacy_attrs: List[str] = SPACY_ATTRS
    ):
      self.doc = doc
      self.spacy_attrs = spacy_attrs

    @property
    def filtered_token_ids(self) -> set:
        """Get a set of token_ids to keep after filtering."""
        return {
            token.i for token in self.doc
            if token.text.startswith("a")
        }

    def apply(self) -> spacy.tokens.doc.Doc:
        """Apply the filter."""
        return filter_doc(
            self.doc,
            self.filtered_token_ids,
            self.spacy_attrs
        )

The name of the filter is stored in the class attribute id. The filtered_token_ids property retrieves a list of token ids to keep. The apply() method returns a new document with all tokens not in the filtered_token_ids list removed. Notice that it calls the filter_doc() function, which is imported with filters. This function returns a new document in which the attribute labels have been copied from the old one. However, you may call your own function if you wish to adopt different procedure. Once you have a filtered document, you can use it to create a new RollingWindows instance.

!!! Note If you wish to pass an arbitrary list of token indexes to filter_doc(), it is wise to pass these indexes as a set. Although, filter_doc() will accept a Python list, this can increase processing times from less than a second to several minutes, depending on the length of the document.

Custom Plotters

Plotters are implemented with the BasePlotter protocol, which allows you to produce custom plotter classes. A skeleton plotter is given below.

class MyCustomPlotter(BasePlotter):
   id: str = "my_custom_plotter"

   def __init__(self, **kwargs):
   """Create an instance of the plotter."""
   # Define any attributes here

   def file(self) -> None:
   """Save the figure to a file."""
   ...

   def run(self, data: Any) -> None:
   """Run the plotter on a set of input data."""
   ...

   def show(self) -> None:
   """Display the plot."""
   ...

The Plotter protocol automatically builds a metadata dictionary when the class is instantiated. The data can be passed to the run() method in any format as long as the run() method handles the logic of generating a plot from it. However, if the data is to be compatible with a built-in calculator, it must take the form of a pandas DataFrame organised with each pattern in a separate column and each window in a separate row.

FilesExpand file tree

rolling_windows_api_docs.md

Latest commit

History

rolling_windows_api_docs.md

File metadata and controls

Rolling Windows API Documentation

Architecture of the Module

rollingwindows.__init__

rollingwindows.__init__.get_rw_component

rollingwindows.RollingWindows

Attributes

Private Methods

rollingwindows.RollingWindows.\_get_search_method

rollingwindows.RollingWindows._get_units

Public Methods

rollingwindows.RollingWindows.calculate

rollingwindows.RollingWindows.plot

rollingwindows.RollingWindows.set_windows

rollingwindows.sliding_windows

rollingwindows.RollingWindows.Windows

Private Methods

rollingwindows.Windows.__iter__

rollingwindows.calculators

rollingwindows.calculators.RWCalculator

Private Methods

rollingwindows.calculators.RWCalculator._assign_variable

rollingwindows.calculators.RWCalculator._count_pattern_matches

rollingwindows.calculators.RWCalculator._count_token_patterns_in_token_lists

rollingwindows.calculators.RWCalculator._count_token_patterns_in_span

rollingwindows.calculators.RWCalculator._count_token_patterns_in_span_text

rollingwindows.calculators.RWCalculator._count_in_token_window

rollingwindows.calculators.RWCalculator._extract_string_pattern

rollingwindows.calculators.RWCalculator._get_ratio

rollingwindows.calculators.RWCalculator._get_window_count

Public Methods

rollingwindows.calculators.RWCalculator.get_averages

rollingwindows.calculators.RWCalculator.get_counts

rollingwindows.calculators.RWCalculator.get_ratios

rollingwindows.calculators.RWCalculator.run

rollingwindows.calculators.RWCalculator.to_df

rollingwindows.filters

rollingwindows.filters.filter_doc

rollingwindows.filters.get_doc_array

rollingwindows.filters.is_not_roman_numeral

rollingwindows.filters.NonStopwordFilter

Private Methods

rollingwindows.filters.NonStopwordFilter._is_non_stopword

Public Methods

rollingwindows.filters.NonStopwordFilter.apply

rollingwindows.filters.WordFilter

Public Methods

rollingwindows.filters.WordFilter.apply

rollingwindows.helpers

rollingwindows.helpers.ensure_doc

rollingwindows.helpers.ensure_list

rollingwindows.helpers.spacy_rule_to_lower

rollingwindows.plotters

rollingwindows.plotters.interpolate

rollingwindows.plotters.RWPlotlyPlotter

Private Methods

rollingwindows.plotters.RWPlotlyPlotter._check_duplicate_labels

rollingwindows.plotters.RWPlotlyPlotter._get_axis_and_title_labels

rollingwindows.plotters.RWPlotlyPlotter._get_titlepad

rollingwindows.plotters.RWPlotlyPlotter._plot_milestone_label

rollingwindows.plotters.RWPlotlyPlotter._plot_milestone_marker

Public Methods

rollingwindows.plotters.RWPlotlyPlotter.run

rollingwindows.plotters.RWPlotlyPlotter.save

rollingwindows.plotters.RWPlotlyPlotter.show

rollingwindows.plotters.RWSimplePlotter

Private Methods

rollingwindows.plotters.RWSimplePlotter._check_duplicate_labels

rollingwindows.plotters.RWSimplePlotter._get_label_height

Public Methods

rollingwindows.plotters.RWSimplePlotter.run

rollingwindows.plotters.RWSimplePlotter.save

rollingwindows.plotters.RWSimplePlotter.show

rollingwindows.registry

Custom Components

`rollingwindows.init`

`rollingwindows.init.get_rw_component`

`rollingwindows.RollingWindows`

`rollingwindows.RollingWindows.\_get_search_method`

`rollingwindows.RollingWindows._get_units`

`rollingwindows.RollingWindows.calculate`

`rollingwindows.RollingWindows.plot`

`rollingwindows.RollingWindows.set_windows`

`rollingwindows.sliding_windows`

`rollingwindows.RollingWindows.Windows`

`rollingwindows.Windows.iter`

`rollingwindows.calculators`

`rollingwindows.calculators.RWCalculator`

`rollingwindows.calculators.RWCalculator._assign_variable`

`rollingwindows.calculators.RWCalculator._count_pattern_matches`

`rollingwindows.calculators.RWCalculator._count_token_patterns_in_token_lists`

`rollingwindows.calculators.RWCalculator._count_token_patterns_in_span`

`rollingwindows.calculators.RWCalculator._count_token_patterns_in_span_text`

`rollingwindows.calculators.RWCalculator._count_in_token_window`

`rollingwindows.calculators.RWCalculator._extract_string_pattern`

`rollingwindows.calculators.RWCalculator._get_ratio`

`rollingwindows.calculators.RWCalculator._get_window_count`

`rollingwindows.calculators.RWCalculator.get_averages`

`rollingwindows.calculators.RWCalculator.get_counts`

`rollingwindows.calculators.RWCalculator.get_ratios`

`rollingwindows.calculators.RWCalculator.run`

`rollingwindows.calculators.RWCalculator.to_df`

`rollingwindows.filters`

`rollingwindows.filters.filter_doc`

`rollingwindows.filters.get_doc_array`

`rollingwindows.filters.is_not_roman_numeral`

`rollingwindows.filters.NonStopwordFilter`

`rollingwindows.filters.NonStopwordFilter._is_non_stopword`

`rollingwindows.filters.NonStopwordFilter.apply`

`rollingwindows.filters.WordFilter`

`rollingwindows.filters.WordFilter.apply`

`rollingwindows.helpers`

`rollingwindows.helpers.ensure_doc`

`rollingwindows.helpers.ensure_list`

`rollingwindows.helpers.spacy_rule_to_lower`

`rollingwindows.plotters`

`rollingwindows.plotters.interpolate`

`rollingwindows.plotters.RWPlotlyPlotter`

`rollingwindows.plotters.RWPlotlyPlotter._check_duplicate_labels`

`rollingwindows.plotters.RWPlotlyPlotter._get_axis_and_title_labels`

`rollingwindows.plotters.RWPlotlyPlotter._get_titlepad`

`rollingwindows.plotters.RWPlotlyPlotter._plot_milestone_label`

`rollingwindows.plotters.RWPlotlyPlotter._plot_milestone_marker`

`rollingwindows.plotters.RWPlotlyPlotter.run`

`rollingwindows.plotters.RWPlotlyPlotter.save`

`rollingwindows.plotters.RWPlotlyPlotter.show`

`rollingwindows.plotters.RWSimplePlotter`

`rollingwindows.plotters.RWSimplePlotter._check_duplicate_labels`

`rollingwindows.plotters.RWSimplePlotter._get_label_height`

`rollingwindows.plotters.RWSimplePlotter.run`

`rollingwindows.plotters.RWSimplePlotter.save`

`rollingwindows.plotters.RWSimplePlotter.show`

`rollingwindows.registry`