fail on retraining tagger, file rewrites accomplished

(Original post was to be me figuring out how to train the tagger in NLTK so it doesn't tag everything it can't recognize as a noun.)

Monty python scripts --
http://www.montypython.net/scriptsidx.php

NLTK Website docs:
https://www.nltk.org/book/ch03.html

When I went to do this project, I had no idea what it entailed.
So for the purpose of this post, let me say, This is how far I got.
I still have no clear picture if training the punkt pos_tagger is useful.
More research to do, but I think I need to give it an entirely different data-set to go off from.  Looks like the free one I downloaded isn't very good with tagging spoken language vs. written.

But if you came for some code, here's to methods I wrote to clean up the monty python scripts.

First step was to copy paste into a new file the script.
Second step:  save it as txt file
Third: run these py files on the script to clean it up a bit for training texts for NLTK, when I figure that bit out.

 ** finding a regex to remove 'ARTHUR:'  words that ended in ':' from the line of the text file proved fail, so I wrote something else to accomplish what I needed.
I think I just had some simple piece missing, and that's why I couldn't get it to work.  Anyhoo, I'll come back to it.  Enjoy.

 And code:
#### rm_double_space.py ####


import sys

text_file = sys.argv[1]

def remove_double_space(somefile):
  with open(somefile, "r+") as f:
    line = f.readlines()
    f.seek(0)
    for item in line:
      if len(item) <= 1:
        pass
      else:
        f.write(item)
     print(f"{somefile} : Empty lines removed.")

remove_double_space(text_file)

##########


###   rm_colon.py    ###


import sys

### remove the NAME: indicator on lines of script  ###

script = sys.argv[1]

def rm_script_names(arg):
    ## Open file ##
    with open(arg, "r+") as f:
        #get a list of lines from file #
        l = f.readlines()
       # go to first line in file #
        f.seek(0)
        # go through lines in list, and write replacement lines where necessary #
        for sentence in l:
            # make a list of words in the sentence/line
            alist = sentence.split(" ")
            for word in alist:
                # use for loop to find words you want to remove #
                if word.endswith(':'):
                    # remove it from list
                    alist.remove(word)
                    #print(f"removing {word}")
                elif word.isupper():
                    alist.remove(word)
                    #print(f"removing uppercase {word}")
                else:
                    pass
            #make newline from altered list
            newline = " ".join(alist)
            #print(newline)
            # overwrite the old line, with the altered newline
            f.write(newline)
    #print(f"{arg} file re-written, words with ':' have been removed.")

rm_script_names(script)





Comments

Popular posts from this blog

playing with color in powershell python

JavaScript Ascii animation with while loops and console.log

playing with trigonometry sin in pygame