What is the best way to optimise a program that makes a lot of array and file accesses?

I’m currently developing a program for my AI module at university in Java. The goal of the program is to enable spam detection via neural networks. At the moment I have written this program to extract input data from a large folder full of spam and ham emails represented as txt files.

Currently what happens is this:

  1. The files are accessed and the text contents of each file are extracted and placed into an ArrayList, each string element within the ArrayList corresponds to one email.
  2. One by one each ArrayList element is fed to a second class, which accesses each element in turn, removes a number of junk characters and strings such as special characters etc. When each string has been ‘purged’ of junk characters, the new string is used to overwrite the old string present in the arraylist.
  3. From here the occurence of each unique word stored in this particular element is counted, the total number of occurences of each unique word and the unique word is used to create a custom object designed to hold these two parameters.
  4. Each email string in the ArrayList is iterated over, junk characters are removed, the words are counted, and the custom WordOccurence objects are updated with their new tallys.

At the moment the program is very slow to run, it currently takes around 15 minutes to analyse a folder consisting of 2000 emails. However this time does not seem to be linear, the program seems to slow down as more emails are analysed and stored. At first I thought the slow execution time was due to accesssing each file in turn, so I changed the code around to extract each email’s data first and place it into an ArrayList, the ArrayList was then used instead of accessing each email. However this had little impact on the performance of the program.

I’m just wondering if there are any other out-of-the-box optimisations that I will be able to make in order to run this program faster? My goal is to analyse a folder of 17000 emails, however as it stands at the moment this will take a significant amount of time.

I don’t want to just paste the code to my whole program into github somewhere and give you guys a link to search through, as this is not fair. I’m just asking for any ideas or suggestions you guys might have that will hopefully improve the performance.