Introduction
One
of very nice features in a CAT product is a possibility to pretranslate the text
and find the most frequent segments in it. To avoid inconsistent translations
one can export and translate them before actually translating the documents
themselves. This speeds up the translation process and ensures that frequent
segments are translated in a consistent manner.
How about something similar for Term bases? Adding to them while on the
way through the document is always possible, but it is rather distracting. To do
a proper job one has to concentrate on a the word alone, maybe go check in a
dictionary or two, ask friends – and that usually means one enters some prop
at those places and then forgets to tackle them later.
Here's a simple method to avoid this. It involves a text processing
program – for example Microsoft Word – and some functionality from Microsoft
Excel, specifically its pivot table.
Cutting up the source into single words
What we are looking for, is eventually a list (and as a consequence a
dictionary) of words present in the text to be processed.
Attention: it always pays to have a copy made of what you are working on.
The first step is relatively simple: to order the text into single words,
one replaces all blanks and tabs by carriage return/line feeds. In Word this is
achieved by replacing blank with ^p. You would do the same replacement for other
kinds of separators, like tabs, commas, columns etc. After the global replace
you should have your original text changed to lines, consisting of single words,
bracketed just by carriage returns
Let us take the first paragraph from The tale of two cities:
It was the best of times, it was the worst of times,
it was the age of wisdom, it was the age of foolishness,
it was the epoch of belief, it was the epoch of incredulity,
it was the season of Light, it was the season of Darkness,
it was the spring of hope, it was the winter of despair,
we had everything before us, we had nothing before us,
we were all going direct to Heaven, we were all going direct
the other way--in short, the period was so far like the present
period, that some of its noisiest authorities insisted on
its
being received, for good or for evil, in the superlative degree
of comparison only.
Making the suggested change to one word to a line, the text
is converted to (sparing you some or most of the 119 words):
It
was
the
best
……
of
comparison
only.
There may be some exotic single cases like the combination
"way—in" above, which in my Gutenberg version of Dickens' text was
missing the blanks. As the rule of this game is "Do less well", don't
bother.
Now copy this text to the clipboard (^A and ^C) and start Excel.
Creating vocabulary and word frequencies
According to Vicipedia,
vocabulary est verba et translationem
verborum in linguas alias docens – which means it's telling you about
words and their translations into other languages. We are not that far yet, as
we need the words first, and here's where Excel comes in handy: it will namely
reduce all the word repeats to their single occurrences and on top of that show
us how common they are.
To get this list, you will need the services of a pivot table. I will
assume you have some experience with them, so I hope the following description
is sufficient, if not even superfluous.
With Excel open and our one-word-per-line text copied into clipboard:
- select
one of the tables, make sure it is empty, and enter "words" into
A1
- activate
the cell A2
- press
^V to paste in the text you have in the clipboard from before
- select
the complete A column
- in
Data menu
- select
the pivot chart
- press
"next" one time in the first window
- press
"next" to confirm A column as the data selected
- press
Layout
You should see
now the layout of the pivot table and somewhere at 2'oclock "words"
- drag
"word" rectangle to "line"
- drag
it to "data" – it changes to "count of words"
- press
"OK" and "finish" in the next window
A new spreadsheet appears, containing distinct words from
your text and their frequencies, i.e. how often they have occurred in the
original text. In case of Charles Dickens' Tale of two cities, the top of this
list looks like this:
Count
of words
|
words
|
result
|
so
|
1
|
age
|
2
|
all
|
2
|
authorities
|
1
|
before
|
2
|
being
|
1
|
belief
|
1
|
best
|
1
|
The program found 57 different words in the text,–so evidently some of
them turn up more than once. Ordering the pivot table by "result" (by
copying its contents and sorting the copy in decreasing order of "result")
shows the following:
the
|
14
|
of
|
12
|
was
|
11
|
It
|
10
|
we
|
4
|
|
|
which is what one would expect and what does not need to be
translated - typing "es" outright in German for instance is of course
faster than using term base to look up "it".
Harvesting – a real case
Here's a real example - 1500+ words of a MSDS text, with
the usual suspects at the top:
and
|
47
|
to
|
46
|
the
|
39
|
in
|
38
|
of
|
33
|
be
|
33
|
with
|
31
|
or
|
20
|
… and then here and there some words, we will be pleased to add to our
term base:
water
|
17
|
material
|
11
|
reaction
|
9
|
diisocyanate
|
8
|
respiratory
|
7
|
isocyanate
|
6
|
heat
|
6
|
carbon
|
6
|
avoid
|
5
|
polyol
|
5
|
dioxide
|
5
|
container
|
5
|
pressure
|
5
|
Conclusion
One can of course build whole machinery around this simple solution,
adding for instance:
i)
exclusion tables – "ignore
words and, it, the…."
ii)
exclusion rules – "ignore
words shorter than…"
iii)
start automatic search for
translations
However,
just by taking care of the above table (with "water",
"material" etc.) we are 95 pretranslates richer.
Not
bad for a 10 minutes job.