modify usfm for chapter-level drafting to avoid import issues; move remarks to chapters#285
modify usfm for chapter-level drafting to avoid import issues; move remarks to chapters#285mshannon-sil wants to merge 6 commits intomainfrom
Conversation
…emarks to chapters
ddaspit
left a comment
There was a problem hiding this comment.
@ddaspit reviewed 2 files and all commit messages, and made 1 comment.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on Enkidu93 and mshannon-sil).
machine/corpora/update_usfm_parser_handler.py line 345 at r1 (raw file):
tokens = list(self._tokens) if chapters is not None: tokens = self._get_incremental_draft_tokens(tokens, chapters)
I think we can do something similar, but before we parse instead of after. Instead of calling parse_usfm in update_usfm, we can do something like this:
tokenizer = UsfmTokenizer(self._settings.stylesheet)
tokens = tokenizer.tokenize(usfm)
tokens = filter_tokens_by_chapter(tokens, chapters)
parser = UsfmParser(tokens, handler, self._settings.stylesheet, self._settings.versification)
parser.process_tokens()This would avoid updating the whole book.
|
Previously, ddaspit (Damien Daspit) wrote…
I updated it accordingly, how does it look now? If we change parse_usfm to accept a Sequence[UsfmToken] as well as str, we could call it here and let it instantiate the parser and process the tokens to avoid some code duplication. But not sure if there's a reason parse_usfm only accepts str. |
Enkidu93
left a comment
There was a problem hiding this comment.
@Enkidu93 reviewed 4 files and all commit messages, and made 2 comments.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on ddaspit and mshannon-sil).
machine/corpora/paratext_project_text_updater_base.py line 97 at r2 (raw file):
in_id_marker = False elif token.type == UsfmTokenType.CHAPTER: if token.data and int(token.data) in chapters:
I think this may be safe now after some recent changes, but you may want to double check what happens with bad chapter references like \c 1. if you haven't already/isn't already covered by tests.
machine/corpora/update_usfm_parser_handler.py line 348 at r2 (raw file):
remark_tokens.append(UsfmToken(UsfmTokenType.TEXT, text=remark)) if len(tokens) > 0: for index, token in enumerate(tokens):
Don't we want to preserve the ability to add book-level remarks? Am I reading this correctly that we'd no longer be able to do so with this change? Peter has also put in a PR related to the per-chapter remarks; are you guys coordinating on this sillsdev/machine#408? I think if we did something like he has there, we would have a bit more flexibility.
|
Previously, Enkidu93 (Eli C. Lowry) wrote…
I was unaware that Peter was making any changes. As far as I'm aware we want it working in SILNLP first (through machine.py) before changes to machine. What I remember from the NT meeting where we discussed remarks was that the team wanted to move all the remarks to the chapter level once we realized that \rem is valid in chapters, so we would no longer have book remarks. I can throw the question to the wider group to see what we think. |
ddaspit
left a comment
There was a problem hiding this comment.
@ddaspit reviewed 4 files and all commit messages, and made 3 comments.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on Enkidu93 and mshannon-sil).
machine/corpora/update_usfm_parser_handler.py line 345 at r1 (raw file):
Previously, mshannon-sil wrote…
I updated it accordingly, how does it look now?
If we change parse_usfm to accept a Sequence[UsfmToken] as well as str, we could call it here and let it instantiate the parser and process the tokens to avoid some code duplication. But not sure if there's a reason parse_usfm only accepts str.
I think it makes sense to update parse_usfm to accept tokens as well.
machine/corpora/update_usfm_parser_handler.py line 348 at r2 (raw file):
Previously, mshannon-sil wrote…
I was unaware that Peter was making any changes. As far as I'm aware we want it working in SILNLP first (through machine.py) before changes to machine.
What I remember from the NT meeting where we discussed remarks was that the team wanted to move all the remarks to the chapter level once we realized that \rem is valid in chapters, so we would no longer have book remarks. I can throw the question to the wider group to see what we think.
The philosophy for implementing features in the Machine library is a bit different than our other applications. For applications, we only build what we need. Because Machine is a shared library, we try to keep any features we add generally useful (within reason). In this case, I think it makes sense to include the ability to add book-level remarks even if we aren't planning on using that specific feature right now.
machine/corpora/__init__.py line 30 at r2 (raw file):
from .paratext_project_settings_parser_base import ParatextProjectSettingsParserBase from .paratext_project_terms_parser_base import KeyTerm, ParatextProjectTermsParserBase from .paratext_project_text_updater_base import ParatextProjectTextUpdaterBase, filter_tokens_by_chapter
This seems unnecessary.
This PR addresses issue #284.
Mostly looking for high-level feedback about the approach at the moment. As we were discussing, is the right place for this functionality in the
get_usfm()method as essentially a post-processing step? Or should we look to implement this feature inprocess_tokens()(and maybe move the remark logic here as well)?Some initial thoughts:
Pros for putting it in
get_usfm():process_token().Pros for putting it in
process_token():This change is