Difference between revisions of "WFC segmentation rules Wordfast Classic"

From Wordfast Wiki
Jump to: navigation, search
 
Line 44: Line 44:
 
the segment will begin with "This is text", skipping the initial "10.". If the initial number is actually part of the segment, translators can press Alt+Delete (Unsegment), then select the entire sentence and press Shift+Alt+Down (ForceSegment). Translators can also set Wordfast to always override the number-skipping behaviour with the "[[SegmentAll]]" command in Pandora's Box.  
 
the segment will begin with "This is text", skipping the initial "10.". If the initial number is actually part of the segment, translators can press Alt+Delete (Unsegment), then select the entire sentence and press Shift+Alt+Down (ForceSegment). Translators can also set Wordfast to always override the number-skipping behaviour with the "[[SegmentAll]]" command in Pandora's Box.  
  
'''Parts of text not considered as segments.'''
+
== Parts of text not considered as segments. ==
 
Isolated series/combinations of numbers, spaces, punctuation do not consitute a segment. For example,
 
Isolated series/combinations of numbers, spaces, punctuation do not consitute a segment. For example,
  
Line 63: Line 63:
 
will all be segmented, because at least one letter is present in each series of numbers/punctuations. The "[[SegmentAll]]" command in Pandora's Box will force Wordfast to segment isolated series of numbers/spaces/punctuation at all times.
 
will all be segmented, because at least one letter is present in each series of numbers/punctuations. The "[[SegmentAll]]" command in Pandora's Box will force Wordfast to segment isolated series of numbers/spaces/punctuation at all times.
  
'''Abbreviations'''
+
== Abbreviation ==
  
 
Users can specify a list of abbreviations in WFC > Setup > Segments. WFC will not end a segment if its last series of characters matches any of the abbreviations, case-sensitive. For example, if "Pr." is listed in the user-specified abbreviations, which is the case by default, the following sentence will be considered as making up a whole segment...
 
Users can specify a list of abbreviations in WFC > Setup > Segments. WFC will not end a segment if its last series of characters matches any of the abbreviations, case-sensitive. For example, if "Pr." is listed in the user-specified abbreviations, which is the case by default, the following sentence will be considered as making up a whole segment...

Latest revision as of 03:41, 4 November 2017

The largest possible unit of segmentation with Wordfast, as with most translation tools, is the paragraph. Paragraphs end with a paragraph mark (ANSI 13 with or without page feed ANSI 10), page feed (ANSI 12), end of cell (ANSI 7). Not that the manual line feed (ANSI 11) does not end a paragraph. Nevertheless, Wordfast can be set up to consider the manual line feed as ending a segment: see the section on customizing ESPs, or the note further below.

Wordfast attempts to recognize individual segments within a paragraph by parsing the paragraph and looking for End of Segment Punctuations (ESPs). The default ESPs used by Wordfast are . : ! ? as well as the tabulator mark, noted ^t by Wordfast, and the manual line feed, noted ^l . Users can edit the list of ESPs to fine-tune segmentation, although that is not recommended, as it breaks their TM compatibility with most other TMs.

If all ESPs are deleted, Wordfast segments at the whole paragraph level. This is not recommended, as some paragraphs may exceed the acceptable segment limit of 8,000 characters (nearly two large pages!) imposed by Wordfast, although segments of that size are very rare. If a segment is larger than 8,000 characters, Wordfast ignores the extra characters, which can be segmented with the "ForceSegment" shortcut.

To remain compatible with most other tools, Wordfast does not consider the manual line feed (noted ^l ) as ending a segment. Users can add ^l to the user-defined list of ESPs in Wordfast to break segments when a manual line feed (ANSI code 11, decimal) is encountered, which is generally considered more logical. However, by default, Wordfast does not end a segment at a manual line feed.

Within a paragraph, Wordfast will consider that it has reached the end of a segment if:

  1. the said segment ends with an ESP, AND
  2. a space is immediately after the ESP,AND
  3. the letter following that space is a capital letter, AND
  4. the character immediately before the ESP is not a number.

Rules 2, 3, 4 can be disabled by the user in the Wordfast > Setup > Segments pane. With CJK languages, rule 2 is always disabled, and the "wide-character" equivalent punctuations are also used.

the following sequence example produces
full stop, space, uppercase. Hello world. Hello world. 2 segments
full stop, space, lowercase. Hello world. hello world. 1 segment
full stop, space, number. Hello world. 10 Hello world. 1 segment
full stop, no space, upper/lowercase. Hello world.Hello world. 1 segment

Rule concerning the beginning of a segment If a segment begins with a series of numbers (or combination of numbers and full stops) followed by a full stop, Wordfast assumes that it's a numbering scheme, and skips the apparent numbering scheme. With the following text:

10. This is text

the segment will begin with "This is text", skipping the initial "10.". If the initial number is actually part of the segment, translators can press Alt+Delete (Unsegment), then select the entire sentence and press Shift+Alt+Down (ForceSegment). Translators can also set Wordfast to always override the number-skipping behaviour with the "SegmentAll" command in Pandora's Box.

Parts of text not considered as segments.

Isolated series/combinations of numbers, spaces, punctuation do not consitute a segment. For example,

100

100.89.67.90

100 (9078) // 67-56

will be skipped by Wordfast as being "numbers". But

100a

100.89.67ö90

100 (9078) // 67-56 é

will all be segmented, because at least one letter is present in each series of numbers/punctuations. The "SegmentAll" command in Pandora's Box will force Wordfast to segment isolated series of numbers/spaces/punctuation at all times.

Abbreviation

Users can specify a list of abbreviations in WFC > Setup > Segments. WFC will not end a segment if its last series of characters matches any of the abbreviations, case-sensitive. For example, if "Pr." is listed in the user-specified abbreviations, which is the case by default, the following sentence will be considered as making up a whole segment...

Here is Pr. Johnson.

... although "Pr." is followed by a full stop, a space, and a capital letter.

There are many translation-time shortcuts and options that let the translator fine-tune segments to expand them, shrink them, or force a selection of text to be considred a whole segment, regardless of rules. However, translators should remember to prefer default segmentation whenever possible, to remain compatible with other TMs.

back to Wordfast Classic User Manual