The Lexos Rolling Windows module is used to analyse the frequency of patterns across rolling windows (also known as sliding windows) of units in a text. The Rolling Windows module under development for the next release of the Lexos API. It implements a new programming interface for the Rolling Windows tool currently available in the Lexos web app, but with added functionality. For further information on the use if Rolling Windows, users are encouraged to lick the "Help" button at the top right of the Lexos user interface and try out the web app.
The __init__.py file contains the main logic for the file, including the main RollingWindows class used to manage the workflow. There are three main submodules, the functions of which are listed below:
filters: Classes to manage the filtering of documents prior to analysis.calculators: Classes to manage statistical calculations.plotters: Classes to manage the plotting of the results of Rolling Windows analyses.
Each submodule contains at least one class, which is registered in registry.py, allowing it to be treated as "built-in". Other built-in classes can be added in future releases, but users can also integrate their own custom classes to manage project-specific tasks not handled by built-in classes.
A fourth submodule, milestones, manages the labelling of structural divisions within documents. Since its use is not limited to Rolling Windows, it will become a component of the main Lexos library in the next release.
The file helpers.py contains functions used by more than one file in the module.
Note
Development of the rollingwindows module suffered from a (still unexplained) malfunction of the development environment, which caused a catastrophic loss of much of the code before it could be pushed to GitHub. The version here is a reconstruction which works but may not be as elegant or efficient as the original code. There may also be legacy code blocks with no function in the current code that have not yet been identified.
Each component of the module is documented separately below.
This is the main component of the rollingwindows module, containing the RollingWindows class and associated functions.
Gets a component from the registry using a string id. Note that this is a near duplicate of rollingwindows.scrubber.registry.load_component.
def get_rw_component(id: str)| Parameter | Description | Required |
|---|---|---|
id: str |
The string id of the component. | Yes |
The main class for managing the workflow and state of a Rolling Windows analysis.
class RollingWindows(doc: spacy.tokens.doc.Doc, model: str, *, patterns: Union[list, str] = None)| Attribute | Description | Required |
|---|---|---|
doc: spacy.tokens.doc.Doc |
A spaCy doc. | Yes |
model: str |
The name of the language model. Default: xx_sent_ud_sm. |
Yes |
patterns: Union[list, str] |
A pattern or list of patterns to search in each window. Patterns can be strings, regex patterns, or spaCy Matcher rules. Default is None. |
No |
The RollingWindows.metadata property returns dictionary recording the current configuration state.
Gets a preliminary search method based on the type of window unit.
def _get_search_method(self, window_units: str = None) -> str| Parameter | Description | Required |
|---|---|---|
window_units: str |
The units counted to construct windows: characters, lines, sentences, tokens. Default is None. |
Yes |
Gets a list of characters, sentences, lines, or tokens from the doc.
def _get_units(self, doc: spacy.tokens.doc.Doc, window_units: str = "characters") -> Union[List[spacy.tokens.span.Span], spacy.tokens.doc.Doc]| Parameter | Description | Required |
|---|---|---|
doc: spacy.tokens.doc.Doc |
A spaCy Doc object. |
Yes |
window_units: str |
The units counted to construct windows: characters, lines, sentences, tokens. Default is characters. |
Yes |
Uses a calculator to generates a rolling windows analysis and assigns the result to RollingWindows.result.
RollingWindows.calculate(patterns: Union[list, str] = None, calculator: Union[Callable, str] = "rw_calculator", query: str = "counts", show_spacy_rules: bool = False) -> None| Parameter | Description | Required |
|---|---|---|
patterns: Union[list, str] |
The patterns to search for. Default is None. |
Yes |
calculator: Union[Callable, str] |
The calculator to use. Default is the built-in "rw_calculator" calculator. | Yes |
query: str |
The type of data to return ("averages", "counts", or "ratios"). Default is the built-in "ratios". | Yes |
show_spacy_rules: bool |
If the calculator uses a spaCy Matcher rule, tell the calculator's to_df method to display the rule as a column header; otherwise, only the value matched by the calculator will be displayed. Default is False. |
Yes |
Note
For development purposes, RollingWindows.calculate() has a timer decorator, which will display the time elapsed when the windows are generated.
Uses a plotter to generates a plot of rolling windows analysis and assigns the result to RollingWindows.fig.
RollingWindows.plot(calculator: Union[Callable, str] = "rw_simple_plotter", file: str = None) -> None| Parameter | Description | Required |
|---|---|---|
plotter: Union[Callable, str] |
The plotter to use. Default is the built-in "rw_simple_plotter". | Yes |
file: str |
The path to a file to save the plot. Default is None. |
No |
Generates rolling windows, creates a rollingwindows.Windows object, and assigns it to RollingWindows.windows.
RollingWindows.set_windows(n: int = 100, window_units: str = "characters", alignment_mode: str = "strict", filter: Union[Callable, str] = None)| Parameter | Description | Required |
|---|---|---|
n: int |
The number of windows to generate. Default is 1000. |
Yes |
window_units: str |
The units counted to construct windows: characters, lines, sentences, tokens. Default is characters. |
Yes |
alignment_mode: str |
How character-based windows snap to token boundaries. - strict: No snapping- contract: Window contains all tokens completely within the window's assigned start and end indices.- expand: Window contains all tokens partially within the window's assigned start and end indices.Default is strict. |
No |
filter: str |
The name of a filter or an instance of a filter to be applied to the doc before windows are generated. Default is None. |
No |
!!! Note
For development purposes, RollingWindows.set_windows() has a timer decorator, which will display the time elapsed when the windows are generated.
Function to create a generator of sliding windows.
def sliding_windows(input: Union[List[spacy.tokens.span.Span], spacy.tokens.doc.Doc], n: int = 1000, window_units: str = "characters", alignment_mode: str = "strict") -> Iterator:
| Parameter | Description | Required |
|---|---|---|
input: Union[List[spacy.tokens.span.Span], spacy.tokens.doc.Doc] |
Either a list of spaCy Span objects or a spaCy Doc object.. |
Yes |
n: int |
The number of units per window. | Yes |
window_units: str |
The units counted to construct windows: characters, lines, sentences, tokens. Default is characters. |
Yes |
alignment_mode: str |
How character-based windows snap to token boundaries. - strict: No snapping- contract: Window contains all tokens completely within the window's assigned start and end indices.- expand: Window contains all tokens partially within the window's assigned start and end indices.Default is strict. |
Yes |
A dataclass for storing a generator of rolling windows and associated metadata.
class Windows(windows: Iterable, window_units: str, n: int, alignment_mode: str = "strict")
def __iter__(self):
return self.windows| Parameter | Description | Required |
|---|---|---|
windows: Iterable |
The number of windows to generate. Default is 1000. |
Yes |
window_units: str |
The units counted to construct windows: characters, lines, sentences, tokens. Default is characters. |
Yes |
n: int |
The number of units per window. | Yes |
alignment_mode: str |
How character-based windows snap to token boundaries. - strict: No snapping- contract: Window contains all tokens completely within the window's assigned start and end indices.- expand: Window contains all tokens partially within the window's assigned start and end indices.Default is strict. |
Yes |
Returns the value of self.windows.
Note
This value is a generator, so iterating through the Windows object will empty the values
Contains registered calculators. There is currently one registered calculators: RWCalculator. Each calculators is a subclass of BaseCalculator, which has a metadata property.
A calculator class to calculate rolling averages, counts, and ratios of a matched pattern or patterns in a document. The property RWCalculator.regex_flags returns the flags used with methods that call the Python re module.
rollingwindows.calculators.RWCalculator has a class attribute id, the value of which is "rw_calculator". This the id registered in the registry.
class RWCalculator(*, patterns: Union[List, str] = None, windows: Windows = None, mode: bool = "exact", case_sensitive: bool = False, alignment_mode: str = "strict", model: str = "xx_sent_ud_sm", original_doc: spacy.tokens.doc.Doc = None, query: str = "counts")| Parameter | Description | Required |
|---|---|---|
patterns: Union[list, str] |
A pattern or list of patterns to search in windows. Default is None |
No |
windows: Windows |
A rollingwindows.Windows object containing the windows to search. |
No |
mode: str |
The type of search method to use: - exact: Search for exact matches to a string pattern.- regex: Search for matches to a regex expression.- spacy_rule: Search for matches using a spaCy Matcher class rule.- multi_token: Search for matches to a regex expression across multiple tokens.- multi_token_exact: Search for matches to string pattern across multiple tokens.Default is exact. See the explanation below. |
No |
case_sensitive: bool |
Whether to make searches case-sensitive. Default is False. |
No |
alignment_mode: str |
Whether to snap searches to token boundaries: - strict: No snapping.- contract: Count all matches that fall completely within token boundaries.- expand: Count all matches that fall partially within token boundaries.Default is strict. |
No |
model: bool |
The name of language model to be used with spaCy's Matcher class. Default is xx_sent_ud_sm. |
No |
original_doc: spacy.tokens.doc.Doc |
A copy of the original doc. Default is None. |
No, except if mode is set to multi_token or multi_token_exact |
query: str |
The type of data to return: "averages", "counts", or "ratios". Default is counts. |
No |
If the window is a string, exact and regex modes will match patterns irrespective of token boundaries. If the window is a list of strings, or a spaCy Span object, patterns will be matched within the boundaries of each token. The spacy_rule option allows for complex searches using spaCy's Matcher class; however, it searches for tokens or sequence of tokens that match the patterns. The alternative is the multi_item mode, which will match a regular expression both within and across token boundaries. The multi_item_exact version escapes regex special characters so that raw strings containing characters like "+\*?^$ ()\[\]{}|\\ can be matched. Both multi_item options can be configured with alignment_mode to determine how the matcher responds to to token boundaries. Note that because character indexes from the original document are required to locate the character range in the original document, multi_item modes require that you also pass the source document to the original_doc attribute.
Try to use configured values if not passed by public functions.
def _assign_variable(self, var_name: str, var: Any) -> Any| Parameter | Description | Required |
|---|---|---|
var_name: str |
The name of the variable. | Yes |
var: Any |
The variable to be evaluated. | Yes |
Uses Python count() to count exact character matches in a character window.
def _count_character_patterns_in_character_windows(self, window: str, pattern: str) -> int
| Parameter | Description | Required |
|------------------|-----------------------|----------|
| `pattern`: _str_ | The pattern to match. | Yes |
| `window`: _str_ | The window to search. | Yes |
##### `rollingwindows.calculators.RWCalculator._count_in_character_window`
_count_in_character_window(self, window: str, pattern: str) -> int:
Chooses the function for counting matches in character windows.
```python
def _count_pattern_matches(self, pattern: Union[dict, list, str]) -> str| Parameter | Description | Required |
|---|---|---|
pattern: str |
The pattern to match. | Yes |
window: str |
The window to search. | Yes |
Counts patterns in lists of token strings.
def _count_token_patterns_in_token_lists(self, window: List[str], pattern: str) -> int| Parameter | Description | Required |
|---|---|---|
pattern: str |
A string pattern to search for. | Yes |
window: List[str] |
A window consisting of a list of strings. | Yes |
Counts patterns in a spaCy Span object.
_count_token_patterns_in_span(self, window: spacy.tokens.span.Span, pattern: Union[list, str]) -> int| Parameter | Description | Required |
|---|---|---|
pattern: Union[list, str] |
A string pattern or spaCy rule to search for. | Yes |
window: spacy.tokens.span.Span |
A window consisting of a spaCy Span object. |
Yes |
Counts patterns in span otext with token alignment.
_count_token_patterns_in_span_text(self, window: str, pattern: str) -> int| Parameter | Description | Required |
|---|---|---|
pattern: str |
A string pattern or spaCy rule to search for. | Yes |
window: str |
A string window. | Yes |
Chooses the function for counting matches in token windows.
_count_token_patterns_in_span_text(self, window: str, pattern: str) -> int| Parameter | Description | Required |
|---|---|---|
pattern: Union[list, str] |
A string pattern or spaCy rule to search for. | Yes |
window: Union[List[str], spacy.tokens.span.Span] |
A window consisting of a list of token strings or a spaCy Span object. |
Yes |
Extracts a string pattern from a spaCy rule.
_extract_string_pattern(self, pattern: Union[dict, list, str]) -> str| Parameter | Description | Required |
|---|---|---|
pattern: Union[dict, list, str] |
A pattern to search. | Yes |
Extracts a string pattern from a spaCy rule.
_get_ratio(self, counts: List[int]) -> float| Parameter | Description | Required |
|---|---|---|
counts: List[int] |
A list of two counts. | Yes |
Calls character or token window methods, as appropriate.
_get_window_count(self, window: Union[List[str], spacy.tokens.span.Span, str], pattern: Union[list, str]) -> int| Parameter | Description | Required |
|---|---|---|
pattern: Union[list, str] |
A string pattern or spaCy rule to search for. | Yes |
window: Union[List[str], spacy.tokens.span.Span, str] |
A window consisting of a list of token strings, a spaCy Span object, or a string. |
Yes |
Calls rollingwindows.calculators.RWCalculator.run with the query set to "averages".
def get_averages(self, windows: Iterable = None, patterns: Union[List, str] = None) -> None| Parameter | Description | Required |
|---|---|---|
pattern: Union[List, str] |
string pattern or spaCy rule, or a list of either. | No |
window: Iterable |
A RollingWindows.Windows object. |
No |
Calls rollingwindows.calculators.RWCalculator.run with the query set to "counts".
def get_averages(self, windows: Iterable = None, patterns: Union[List, str] = None) -> None| Parameter | Description | Required |
|---|---|---|
pattern: Union[List, str] |
string pattern or spaCy rule, or a list of either. | No |
window: Iterable |
A RollingWindows.Windows object. |
No |
Calls rollingwindows.calculators.RWCalculator.run with the query set to "ratios".
def get_averages(self, windows: Iterable = None, patterns: Union[List, str] = None) -> None| Parameter | Description | Required |
|---|---|---|
pattern: Union[List, str] |
A string pattern or spaCy rule, or a list of either. | No |
window: Iterable |
A RollingWindows.Windows object. |
No |
Runs the calculator, which performs calculations and saves the result to RWCalculator.data.
def runs(windows: Iterable = None, patterns: Union[List, str] = None, query: str = "counts") -> None| Parameter | Description | Required |
|---|---|---|
pattern: Union[List, str] |
A string pattern or spaCy rule, or a list of either. | No |
window: Iterable |
A RollingWindows.Windows object. |
No |
query: Iterable |
String designating whether to return "counts", "averages", or "ratios". Default is counts. |
No |
Converts the data in Averages.data to a pandas DataFrame.
def to_df(self, show_spacy_rules: bool = False) -> pd.DataFrame| Parameter | Description | Required |
|---|---|---|
show_spacy_rules: bool |
If the calculator uses a spaCy Matcher rule, tell the calculator's to_df method to display the rule as a column header; otherwise, only the value matched by the calculator will be displayed. Default is False. |
No |
Contains registered filters. There are currently two registered filters: WordFilter and NonStopwordFilter. Each filter is a subclass of the BaseFilter class, which has a metadata property.
Applies a filter to a document and returns a new document.
def filter_doc(input: Union[List[spacy.tokens.span.Span], spacy.tokens.doc.Doc], n: int = 1000, window_units: str = "characters", alignment_mode: str = "strict") -> Iterator| Parameter | Description | Required |
|---|---|---|
doc: spacy.tokens.doc.Doc |
A spaCy Doc object. |
Yes |
keep_ids: int |
A list of spaCy Token ids to keep in the filtered Doc. |
Yes |
spacy_attrs: List[str] |
A list of spaCy Token attributes to keep in the filtered Doc. Default is the SPACY_ATTRS list imported with filters.* |
No |
force_ws: bool |
Force a whitespace at the end of every token except the last. Default is True. |
No |
* The default list of spaCy token attributes can be inspected by calling filters.SPACY_ATTRS.
Converts a spaCy Doc object into a numpy array.
def get_doc_array(doc: spacy.tokens.doc.Doc, spacy_attrs: List[str] = SPACY_ATTRS, force_ws: bool = True) -> np.ndarray| Parameter | Description | Required |
|---|---|---|
doc: spacy.tokens.doc.Doc |
A spaCy Doc object. |
Yes |
keep_ids: int |
A list of spaCy Token ids to keep in the filtered Doc. |
Yes |
spacy_attrs: List[str] |
A list of spaCy Token attributes to keep in the filtered Doc. Default is the SPACY_ATTRS list imported with filters.* |
No |
force_ws: bool |
Force a whitespace at the end of every token except the last. Default is True. |
No |
* The default list of spaCy token attributes can be inspected by calling filters.SPACY_ATTRS.
The following options are available for handling whitespace:
force_ws=Trueensures thattoken_with_wsandwhitespace_attributes are preserved, but all tokens will be separated by whitespaces in the text of a doc created from the array.force_ws=FalsewithSPACYinspacy_attrspreserves thetoken_with_wsandwhitespace_attributes and their original values. This may cause tokens to be merged if subsequent processing operates on thedoc.text.force_ws=FalsewithoutSPACYinspacy_attrsdoes not preserve thetoken_with_wsandwhitespace_attributes or their values. By default,doc.textdisplays a single space between each token.
Returns True if a token is not a Roman numeral. Works only on upper-case Roman numerals.
def is_not_roman_numeral(s: str) -> bool| Parameter | Description | Required |
|---|---|---|
s: str |
A string to match against the Roman numerals pattern. | Yes |
A filter class to remove stop words from a document. This is a minimal function that strips punctuation and returns the ids of words not flagged as stop words by the language model or in a list of additional_stopwords. The property NonStopwordFilter.word_ids returns the token ids for all tokens in the document that are not stop words according to these criteria.
rollingwindows.filters.NonStopwordFilter has a class attribute id, the value of which is "non_stopword_filter". This the id registered in the registry.
class NonStopwordFilter(doc: spacy.tokens.doc.Doc, *, spacy_attrs: List[str]: SPACY_ATTRS, additional_stopwords: List[str] = None, case_sensitive: bool = False)| Parameter | Description | Required |
|---|---|---|
doc: spacy.tokens.doc.Doc |
A spaCy Doc object. |
Yes |
spacy_attrs: List[str] |
A list of spaCy Token attributes to keep in the filtered Doc. Default is the SPACY_ATTRS list imported with filters. |
No |
additional_stopwords: List[str] |
A list of stop words to add to those labelled as stop words by the model. Default is None. |
No |
case_sensitive: bool |
Use only lower case forms if False. Default is True. |
No |
Returns True if a token is not a stop word.
def _is_non_stopword(self, token: spacy.tokens.Token) -> bool| Parameter | Description | Required |
|---|---|---|
token: spacy.tokens.Token |
A spaCy Token object. |
Yes |
Applies the filter and returns a new, filtered doc.
def apply(self) -> spacy.tokens.doc.DocA filter class to remove non-words from a document. The property WordFilter.word_ids returns the token ids for all tokens in the document that are identified as words according to supplied criteria.
rollingwindows.filters.WordFilter has a class attribute id, the value of which is "word_filter". This the id registered in the registry.
class WordFilter(doc: spacy.tokens.doc.Doc, *, spacy_attrs: List[str]: SPACY_ATTRS, exclude: Union[List[str], str] = [" ", "\n"], exclude_digits: bool = False, exclude_roman_numerals: bool = False, exclude_pattern: Union[List[str], str] = None)| Parameter | Description | Required |
|---|---|---|
doc: spacy.tokens.doc.Doc |
A spaCy Doc object. |
Yes |
spacy_attrs: List[str] |
A list of spaCy Token attributes to keep in the filtered Doc. Default is the SPACY_ATTRS list imported with filters. |
No |
exclude: List[str] |
A string/regex or list of strings/regex patterns to exclude. Default is [" ", "\n"]. |
No |
exclude_digits: bool |
If True, digits will not be treated as words. Default is False. |
No |
exclude_roman_numerals: bool |
If True, Roman numerals will not be treated as words. However, this only works with capitalised Roman numerals. Default is False. |
No |
exclude_pattern: bool |
Additional regex patterns to add to the default exclude list. Default is None |
No |
Applies the filter and returns a new, filtered doc.
def apply(self) -> spacy.tokens.doc.DocContains helper functions used by multiple files in the module. rollingwindows.helpers.ensure_doc may be legacy code that is not used in the current version.
Converts input into a spaCy Doc object. The returned Doc is unannotated if it is derived from a string or a list of tokens.
def ensure_doc(input: Union[str, List[str], spacy.tokens.doc.Doc], nlp: Union[Language, str], batch_size: int = 1000) -> spacy.tokens.doc.Doc| Parameter | Description | Required |
|---|---|---|
input: Union[str, List[str], spacy.tokens.doc.Doc] |
string, list of tokens, or a spaCy doc. | Yes |
nlp: _Union[Language, str] |
The language model to use. | Yes |
batch_size: int |
The number of texts to accumulate in an internal buffer. Default is 1000. |
No |
Wraps any input in a list if it is not already a list.
def ensure_list(input: Any) -> list| Parameter | Description | Required |
|---|---|---|
input: Any |
An input variable. | Yes |
Converts a spaCy Matcher rule to lower case.
def spacy_rule_to_lower(patterns: Union[Dict, List[Dict]], old_key: Union[List[str], str] = ["TEXT", "ORTH"], new_key: str = "LOWER") -> list| Parameter | Description | Required |
|---|---|---|
patterns: Union[Dict, List[Dict]] |
A string to match against the Roman numerals pattern. | Yes |
old_key: Union[List[str], str] |
A dictionary key or list of keys to rename. Default is ["TEXT", "ORTH"]. |
No |
new_key: str |
The new key name. Default is LOWER. |
No |
Contains registered plotters. There are currently two registered plotters: RWSimplePlotter and RWPlotlyPlotter. Each plotter is a subclass of the BasePlotter class, which has a metadata property.
Returns interpolated points for plots that use interpolation. The interpolation function may be either scipy.interpolate.pchip_interpolate, numpy.interp, or one of the options for scipy.interpolate.interp1d. Note however, that scipy.interpolate.interp1d is deprecated.
def interpolate(x: np.ndarray, y: np.ndarray, xx: np.ndarray, interpolation_kind: str = None) -> np.ndarray| Parameter | Description | Required |
|---|---|---|
x: np.ndarray |
The x values. | Yes |
y: np.ndarray |
The x values. | Yes |
xx: np.ndarray |
The projected interpolation range. | Yes |
interpolation_kind: str |
The interpolation function to use. Default is None. |
No |
Generates a plot using Plotly.
rollingwindows.plotters.RWPlotlyPlotter has a class attribute id, the value of which is "rw_plotly_plotter". This the id registered in the registry.
class RWPlotlyPlotter(width: int = 700, height: int = 450, title: Union[dict, str] = "Rolling Windows Plot", xlabel: str = "Token Count", ylabel: str = "Average Frequency", line_color: str = "variable", showlegend: bool = True, titlepad: float = None, show_milestones: bool = True, milestone_marker_style: dict = {"width": 1, "color": "teal"}, show_milestone_labels: bool = False, milestone_labels: List[dict] = None, milestone_label_rotation: float = 0.0, milestone_label_style: dict = {"size": 10.0, "family": "Open Sans, verdana, arial, sans-serif", "color": "teal"}, **kwargs)| Attribute | Description | Required |
|---|---|---|
width: int |
The figure width in pixels. Default is 700. |
No |
height: int |
The figure height in pixels. Default is 450. |
No |
title: Union[dict, str] |
The title of the figure. Styling can be added by passing a dict with the keywords described in Plotly's documentation. Default is Rolling Windows Plot. |
No |
xlabel: str |
The text to display along the x axis. Default is Token Count. |
No |
ylabel: str |
The text to display along the y axis. Default is Average Frequency. |
No |
line_color: float |
The colour to be used for the lines on the line graph. Default is variable. |
No |
showlegend: bool |
Whether to show the legend. Default is True. |
No |
titlepad: float |
The margin in pixels between the title and the top of the graph. If not set, the margin will be calculated automatically from milestone label heights if the are shown. Default is None. |
No |
xlabel: str |
The text to display along the x axis. Default is Token Count. |
No |
ylabel: str |
The text to display along the y axis. Default is Average Frequency. |
No |
show_milestones: bool |
Whether to show the milestone markers. Default is False. |
No |
milestone_marker_style: dict |
A dict containing the styles to apply to the milestone marker. For valid properties, see the Plotly documentation. Default is {"width": 1, "color": "teal"}. |
No |
show_milestone_labels: bool |
Whether to show the milestone labels. Default is False. |
No |
milestone_labels: Dict[str, int] |
A dict with keys as milestone labels and values as points on the x-axis. Default is None. |
No |
milestone_label_rotation: Union[float, int] |
The clockwise rotation of the milestone labels up to 90 degrees. Default is 0.0. |
No |
milestone_label_style: dict |
A dict containing the styling information for the milestone labels. For valid properties, see the Plotly documentation. Default is {"size": 10.0, "family": "Open Sans, verdana, arial, sans-serif", "color": "teal"}. |
No |
Tip
When milestone labels are shown and titlepad is not set manually, the class attempts to detect a suitable margin by using the same trick as RWSimplePlotter: it constructs a plot in matplotlib and measures the longest label to use as a guide. This produces reasonable results unless you change the figure height. In that case, it is advisable to set titlepad manually.
Once the figure is generated, it can be accessed with self.fig. You can then call self.fig.update_layout() and modify the figure using any of the parameters available in the Plotly documentation. This is useful to make changes not enabled by the Lexos API.
Adds numeric suffixes for duplicate milestone labels. Returns a dictionary containing unique keys.
def _check_duplicate_labels(self, locations: List[Dict[str, int]]) -> List[Dict[str, int]]| Parameter | Description | Required |
|---|---|---|
locations: List[Dict[str, int]] |
A list of location dicts. | Yes |
Note
The method is not yet implemented. The documentation here is copied from RWSimplePlotter since it should be substantially the same. That said, the class currently requires milestones to be submitted as a dictionary, which requires unique keys. So this needs some further thought.
Ensures that the title, xlabel, and ylabel values are dicts.
def _get_axis_and_title_labels(self) -> Tuple[bool, str]Get a titlepad value based on the height of the longest milestone label if the titlepad class attribute is not set.
def _get_titlepad(self, labels: Dict[str, int]) -> float| Parameter | Description | Required |
|---|---|---|
labels: Dict[str, int] |
A dict with the labels as keys. | Yes |
Adds a milestone label to the Plotly figure.
def _plot_milestone_label(self, label: str, x: int) -> None| Parameter | Description | Required |
|---|---|---|
label: str |
The text of a milestone label. | Yes |
x: int |
The location of the milestone label on the x axis. | Yes |
Adds a milestone marker (vertical line) to the Plotly figure.
def _plot_milestone_marker(self, x: int, df_val_min: Union[float, int], df_val_max: Union[float, int]) -> None| Parameter | Description | Required |
|---|---|---|
x: int |
The location of the milestone label on the x axis. | Yes |
df_val_min: Union[float, int] |
The minimum value in the pandas DataFrame. | Yes |
df_val_max: Union[float, int] |
The maximum value in the pandas DataFrame. | Yes |
Runs the plotter saves the figure to RWPlotlyPlotter.fig.
def runs(self, df: pd.DataFrame) -> None| Parameter | Description | Required |
|---|---|---|
df: pandas.DataFrame |
A pandas DataFrame, normally stored in RollingWindows.result. |
Yes |
Saves the plot to a file.
def save(self, path: str, **kwargs) -> None| Parameter | Description | Required |
|---|---|---|
path: str |
The path to the file where the figure is to be saved. | Yes |
[NOTE] If the path ends in
.html, this method will attempt to save the figure as a dynamic HTML file. The method accepts any keyword available for Plotly'sFigure.write_htmlmethod.Otherwise, it will attempt to save the figure as a static file in the format suggested the the extension in the filename (e.g.
.png,.jpg,Figure.write_imagemethod.
Displays a generated figure. This method calls matplotlib.pyplot.show. However, since this does not work with an inline backend like Jupyter notebooks, the method tried to detect this environment via a UserWarning and then just calls the plot attribute.
def show(self, config={"displaylogo": False}, **kwargs) -> None| Parameter | Description | Required |
|---|---|---|
config: dict |
A dictionary supply Plotly configuration values. | No |
Generates a plot using matplotlib.pyplot.
rollingwindows.plotters.RWSimplePlotter has a class attribute id, the value of which is "rw_simple_plotter". This the id registered in the registry.
class RWSimplePlotter(width: Union[float, int] = 6.4, height: Union[float, int] = 4.8, figsize: tuple = None, hide_spines: List[str] = ["top", "right"], title: str = "Rolling Windows Plot", titlepad: float = 6.0, title_position: str = "top", show_legend: bool = True, show_grid: bool = False, xlabel: str = "Token Count", ylabel: str = "Average Frequency", show_milestones: bool = False, milestone_colors: Union[List[str], str] = "teal", milestone_style: str = "--", milestone_width: int = 1, show_milestone_labels: bool = False, milestone_labels: List[dict] = None, milestone_labels_ha: str = "left", milestone_labels_va: str = "baseline", milestone_labels_rotation: int = 45, milestone_labels_offset: tuple = (-8, 4), milestone_labels_textcoords: str = "offset pixels", use_interpolation: bool = False, interpolation_num: int = 500, interpolation_kind: str = "pchip", **kwargs)| Attribute | Description | Default |
|---|---|---|
width: Union[float, int] |
The figure width in inches. Default is 6.4. |
6.4 |
height: Union[float, int] |
The figure height in inches. Default is 6.4. |
6.4 |
fig_size: tuple |
A tuple containing the figure width and height in inches (overrides the width and height settings). Default is None. |
None |
hide_spines: List[str] |
A list of ["top", "right", "bottom", "left"] indicating which spines to hide. Default is ["top", "right"]. |
["top", "right"] |
title: str |
The title to use for the plot. Default is Rolling Windows Plot. |
Rolling Windows Plot |
titlepad: float |
The padding in points to place between the title and the plot, which may need to be increased if you are showing milestone labels. Default is 6.0. |
6.0 |
title_position: str |
Show the title on the "bottom" or the "top" of the figure. Default is top. |
top |
show_legend: bool |
Whether to show the legend. Default is True. |
True |
show_grid: bool |
Whether to show the grid. Default is False. |
False |
xlabel: str |
The text to display along the x axis. Default is Token Count. |
Token Count |
ylabel: str |
The text to display along the y axis. Default is Average Frequency. |
Average Frequency |
show_milestones: bool |
Whether to show the milestone markers. Default is False. |
False |
milestone_colors: Union[List[str], str] |
The colour or colours to use for milestone markers. See pyplot.vlines(). Default is teal. |
teal |
milestone_style: str |
The style of the milestone markers. See pyplot.vlines(). Default is --. |
-- |
milestone_width: int |
The width of the milestone markers. See pyplot.vlines(). Default is 1. |
1 |
show_milestone_labels: bool |
Whether to show the milestone labels. Default is False. |
False |
milestone_labels: List[dict] |
A list of dicts with keys as milestone labels and values as token indexes. Default is None. |
None |
milestone_labels_ha: str |
The horizontal alignment of the milestone labels. See pyplot.annotate(). Default is left. |
left |
milestone_labels_va: str |
The vertical alignment of the milestone labels. See pyplot.annotate(). Default is baseline. |
baseline |
milestone_labels_rotation: int |
The rotation of the milestone labels in degrees. See pyplot.annotate(). Default is 45. |
45 |
milestone_labels_offset: tuple |
A tuple containing the number of pixels along the x and y axes to offset the milestone labels. See pyplot.annotate(). Default is (-8, 4). |
(-8, 4) |
milestone_labels_textcoords: str |
Whether to offset milestone labels by pixels or points. See pyplot.annotate(). Default is offset pixels. |
offset pixels |
use_interpolation: bool |
Whether to use interpolation on values. Default is False. |
False |
interpolation_num: int |
Number of values to add between points. Default is 500. |
500 |
interpolation_kind: str |
Algorithm to use for interpolation. Default is pchip. |
pchip |
If your RollingWindows.doc has milestones, you can display them as vertical lines on the graph. If show_milestone_labels is set to True, the first token in each milestone will be displayed as a label above the vertical line. If the labels are the same, they will be numbered consecutively ("milestone1", "milestone2", etc.). You can also submit custom labels and locations using the milestone_labels keyword. The other milesone_labels_ parameters control the rotation and location of the labels.
Rolling Windows plots can often produce unattractive, squarish lines, rather than the smooth curves you often see in line graphs for some types of data. This is because there tend to be very abrupt shifts in the frequencies of patterns, rather than gradual changes. With use_interpolation, you can attempt to introduce smoothing by interpolating points between the values calculated by the calculator to produce a more aesthetically pleasing graph. However, the resulting plots should only be used for presentation purposes where the interpretive value is established in a non-interpolated plot. This is because interpolations can introduce distortions which may be deceptive. The user is encouraged to compare interpolated and non-interpolation plots of their analysis. The value of interpolation_num is the number of points to interpolate between points in your data. The interpolation_kind refers to the function used to interpolate the points. The default is scipy's interpolate.pchip_interpolate function. You can also supply any of the kinds allowed by the scipy's interpolate.interp1d method, although in practice, only "cubic" and "quadratic" are likely to make a difference.
Adds numeric suffixes for duplicate milestone labels. Returns a list of unique location dictionaries.
def _check_duplicate_labels(self, locations: List[Dict[str, int]]) -> List[Dict[str, int]]| Parameter | Description | Required |
|---|---|---|
locations: List[Dict[str, int]] |
A list of location dicts. | Yes |
Returns the height of the longest milestone label by using a separate plot to calculate the label height. The method is used to estimate how high to place the title above the plot.
def _get_label_height(self, milestone_labels: List[dict], milestone_labels_rotation: int) -> float| Parameter | Description | Required |
|---|---|---|
milestone_labels: List[dict] |
A list of milestone_label dicts. | Yes |
milestone_labels_rotation: int |
The rotation of the labels in degrees. | Yes |
Runs the plotter saves the figure to RWSimplePlotter.fig.
def runs(self, df: pd.DataFrame) -> None| Parameter | Description | Required |
|---|---|---|
df: pandas.DataFrame |
A pandas DataFrame, normally stored in RollingWindows.result. |
Yes |
Saves the plot to a file. This method is a wrapper for matplotlib.pyplot.savefig().
def save(self, path: str) -> None| Parameter | Description | Required |
|---|---|---|
path: str |
The path to the file where the figure is to be saved. The image type (e.g. .png, .jpg, .pdf) is determined by the extension on the filename. |
Yes |
Displays a generated figure. This method calls matplotlib.pyplot.show. However, since this does not work with an inline backend like Jupyter notebooks, the method tried to detect this environment via a UserWarning and then just calls the plot attribute.
def show(self, **kwargs) -> NoneA registry of "built-in" rolling windows calculators, filters, and plotters. These can be loaded using their string id attributes with the Python catalogue module.
Calculators are implemented with the Calculator protocol, which allows you to produce custom calculator classes. A skeleton calculator is given below.
class MyCustomCalculator(Calculator):
id: str = "my_custom_calculator"
def __init__(
self,
patterns: Union[list, str],
windows: Iterable
):
"""Create an instance of the calculator."""
self.patterns = patterns
self.windows = windows
self.data = None
def run(self) -> spacy.tokens.doc.Doc:
"""Run the calculator."""
...
def to_df(self) -> pd.DataFrame:
"""Convert the data to a pandas DataFrame."""
...The Calculator protocol automatically builds a metadata dictionary when the class is instantiated. It requires a run() method to perform calculations and save the data to the object's data attribute. It also requires a to_df() method to convert the data to a pandas DataFrame. The data and DataFrame can take any format, as required for your purpose. However, if the data must be compatible with the chosen plotter. For instance, if using rollingwindows.plotters.RWSimplePlotter, the DataFrame must be organised with each pattern in a separate column and each window in a separate row.
Calculators are implemented with the Filter protocol, which allows you to produce custom filter classes. A skeleton filter is given below.
class MyCustomFilter(Filter):
id: str = "my_custom_filter"
def __init__(
self,
doc: spacy.tokens.doc.Doc,
*,
spacy_attrs: List[str] = SPACY_ATTRS
):
self.doc = doc
self.spacy_attrs = spacy_attrs
@property
def filtered_token_ids(self) -> set:
"""Get a set of token_ids to keep after filtering."""
return {
token.i for token in self.doc
if token.text.startswith("a")
}
def apply(self) -> spacy.tokens.doc.Doc:
"""Apply the filter."""
return filter_doc(
self.doc,
self.filtered_token_ids,
self.spacy_attrs
)The name of the filter is stored in the class attribute id. The filtered_token_ids property retrieves a list of token ids to keep. The apply() method returns a new document with all tokens not in the filtered_token_ids list removed. Notice that it calls the filter_doc() function, which is imported with filters. This function returns a new document in which the attribute labels have been copied from the old one. However, you may call your own function if you wish to adopt different procedure. Once you have a filtered document, you can use it to create a new RollingWindows instance.
!!! Note
If you wish to pass an arbitrary list of token indexes to filter_doc(), it is wise to pass these indexes as a set. Although, filter_doc() will accept a Python list, this can increase processing times from less than a second to several minutes, depending on the length of the document.
Plotters are implemented with the BasePlotter protocol, which allows you to produce custom plotter classes. A skeleton plotter is given below.
class MyCustomPlotter(BasePlotter):
id: str = "my_custom_plotter"
def __init__(self, **kwargs):
"""Create an instance of the plotter."""
# Define any attributes here
def file(self) -> None:
"""Save the figure to a file."""
...
def run(self, data: Any) -> None:
"""Run the plotter on a set of input data."""
...
def show(self) -> None:
"""Display the plot."""
...The Plotter protocol automatically builds a metadata dictionary when the class is instantiated. The data can be passed to the run() method in any format as long as the run() method handles the logic of generating a plot from it. However, if the data is to be compatible with a built-in calculator, it must take the form of a pandas DataFrame organised with each pattern in a separate column and each window in a separate row.