Free Open-Source Artificial Intelligence @lemmy.world hok @lemmy.dbzer0.com 1y ago

Is there anything that makes training a translation task easy?

I have thousands of side-by-side translations for two computer languages (lower level to higher level), and I would like to train a model that is able to do translations on new data with higher accuracy.

Got any suggestions on what to do? I don't think I want to fine tune a ChatGPT-style model since I think the task is more structured than that. Also, I consider myself technically competent but probably would fail at designing my own model and pipeline.

4 comments

Try looking into OpenNMT, I used it for a similar task.

https://opennmt.net
- Thanks, the quickstart guide was straightforward to follow. Do you have any suggestions on how to do word splitting with code, if any? For example, on a test run, I found that the model was not able to synthesize unique constants correctly even though this test run consisted only of obvious "a to b" relationships.
  
  If you’re working with a well known language, then you can probably use NLTK to tokenize your words. Word2vec is also helpful if you want a word embedding approach. https://github.com/nltk/nltk