Bentley University Fall 2022 CS 230 Introduction to Programming with Python Program 3: text Pattern identification Text pattern identification is a process to find strings with a specific pattern....

1 answer below »

Bentley University Fall 2022
CS 230
Introduction to Programming with Python
Program 3: text Pattern identification
Text pattern identification is a process to find strings with a specific pattern. It is one of the important parts of Natural Language Processing (NLP). There are already Python packages (e.g., re which implements Regular Expressions) that can help you with this process but in order to understand the algorithms behind it, this assignment requires you to develop your own program to do simple text pattern identifications.
Your program will process the text from a .txt file and produce the following results:
1. The number of words in the text file
2. The longest word or words
3. The number of distinct words with the given text patterns
4. Printing the distinct words with the given text patterns using the required format
program required Parameters: para_info.txt
The program will run based on parameters specified in file named para_info.txt. A sample para_info.txt file will be provided to you, but you can modify any parameters in the file or create your own para_info.txt file using a text editor to test your program.
A para_info.txt file has three lines:
1. Line 1: The name of the input file. You’ll find the words to be processed in this file. Programming_History.txt is provided as an example, but you can use other files to test your program.
2. Line 2: A list of word patterns.
There are in total 7 possible patterns, where * represents exactly one character and % represents multiple (zero, one, or more) character. All patterns are case insensitive:
· The * or % may not appear inside word. For example, program*ing is not allowed.
· Only one or the other of * or % may be used in a pattern. e.g. *rogram% is not a valid pattern.
· **GU*** means a word with 7 characters, there are two characters before gu, and three characters after gu. For example, “regular”.
· ***rn means a word with 5 characters ending with rn. For example, “learn”.
· eXpr****** means a word with 10 characters that starts with expr. For example, “expression”.
· Python means find the word python.
This line can contain many (but at least one) patterns and the patterns are separated with "|”. Ignore the spaces between or after the pattern. For example,
| *itat*** | pro******* |
3. Line 3: How to display and store the words found. It can be “left”, “right”, or “center”. All options are case insensitive. For example, “Left”, “LEFT”, “leFt” are means “left”.
program TASKS
Your program will process the text file as described below. This is an exercise in using Files for Reading and Writing, Lists, and String manipulation. The program will perform these tasks:
1. Create a function or functions to read all the content in the para_info.txt file.
2. Create a List of all words stored in the input file (removing punctuation and other characters) and count the number of words (including duplications) in the list.
3. Find the length of the longest word and concatenate the longest word(s) (if thre is more than one of the same size) to a String.
4. For each pattern in the pattern_list, find out the words matching that pattern.
· Create functions to detect whether a given word can match a pattern that uses *.
· Create a function to format the pattern-matched words.
5. Display the statistics and pattern-matched words with the specified format.
Create a function to read all the content in the para_info.txt file
Create a function read_para(para_file) with argument para_file to read the lines in the parameter file. Return input_file_name, pattern_list, and alignment stored in each line of the file.
· You can assume all parameters in the para_file are co
ectly spelled and organized.
· pattern_list is a list and all other returned values are strings.
CreatE a list of words and count the number of words
Process the text in the file to create a list of words in the file.
· Remove all punctuations. To find all the punctuation marks, use import string and then assign a constant PUNCTUATION = string.punctuation
· Convert all the words to lower letters.
· Create the word list with all words (including duplications).
· Count the number of words in the word list
For example, for the text
“The history of programming languages spans from documentation of early mechanical computers to modern tools for software development. Early programming languages were highly specialized, relying on mathematical notation and similarly obscure syntax.”
You’ll get a word list of 32 words.
[‘the’, ‘history’, ‘of’, ‘programming’, ‘languages’, ‘spans’, ‘from’, ‘documentation’, ‘of’, ‘early’, ‘mechanical’, ‘computers’, ‘to’, ‘modern’, ‘tools’, ‘for’, ‘software’, ‘development’, ‘early’, ‘programming’, ‘languages’, ‘were’, ‘highly’, ‘specialized’, ‘relying’, ‘on’, ‘mathematical’, ‘notation’, ‘and’, ‘similarly’, ‘obscure’, ‘syntax’]
Hint: if you use the split method to split a string and the string contains consecutive spaces, the result will include an empty string between the spaces. You’ll need to remove that from your word list.
Find out the length of the longest word
· Create a variable to track the max word length in the list. If a word length is larger than the existing max length, replace the max length with the cu
ent word length.
Choose from one of the following to get the longest word or words:
· Create an empty string to store all longest words in the word list. For each word in the above word list, if the word has the max length and the word is not part of the string yet, concatenate the word to the string and separate each word with space. You’ll have a string with all longest word(s).
· Create an empty list to store all longest words in the word list. For each word in the above word list, if the word has the max length and the word is not in the “longest word list” yet, append the word to the list. You’ll have a list with all longest word(s). To show them with the required format, use .join().
For each pattern in the pattern_list, find out the words matching that pattern.
· For each pattern in pattern_list, traverse all words in the word list and for each word, detect whether it matches the pattern using the function is_pattern(word, pattern).
· If the word matches the pattern, format the word output using function formatted_output(word, max_word_length, alignment).
· Function is_pattern(word, pattern) takes two arguments – word and pattern
· pattern would be one of the four patterns mentioned above (assume the patterns are all written co
ectly).
· The function returns True if the word matches the pattern, and False otherwise.
· Function formatted_output(word, max_word_length, alignment) takes three arguments – word, max_word_length, and alignment.
· max_word_length is the length of the longest word you find in step 3. alignment takes values from the third line (Line 3) of the para_info.txt and can be one of the following options
· “left” or “l” - aligning the word to the left;
· “right” or “r” - aligning the word to the right;
· “center” or “c” - to center the word.
All ignore cases, which means Left/LEFT/leFt all mean left. The default value for alignment is “center”.
· A returned value of this function should be a formatted string, with two bars “|” at the beginning and end, and the word between them. The total spaces between the two bars should be the value of max_word_length. For example, if the longest word is “information” with the length 11,
· formatted_output(“world”, 11) or formatted_output(“world”, 11, “center”) would return (3 spaces before and after “world”):
| world |
· formatted_output(“world”, 11, L) or formatted_output(“world”, 11, “left”) would return (6 spaces after “world”):
|world |
· formatted_output(“world”, 11, riGht) or formatted_output(“world”, 11, “right”) would return (6 spaces before “world”)”
| world|
· Feel free to define you own functions if it helps with the program.
Display the statistics and pattern-matched words with the specified format.
Display the Text Analysis Report in the output console, which should include
· A header: “Text Analysis Report” with 20 “*”s before and afte
· Total Number of Words in the Text File
· The length of the longest word, e.g.: The longest word has 13 characters.
· A separate line with 50 “=”s.
· For each of the patterns in the pattern_list,
· Display one of the following messages depending on the number of words with the pattern:
· There is no word with the pattern “{pattern}”.
· There is only one word with the pattern “{pattern}”.
· There are {n} distinct words with the pattern “{pattern}”.
· Replace {pattern} with the real patterns in the pattern_list, replace {n} with the real number of distinct words with that pattern. Don’t forget the quotation marks (“”) before and after the pattern.
· For each distinct word with the pattern (if any), display the returned value of formatted_output in one line.
· Add a separate line with 50 “=”s.
· An ending line: “End of Report” with 20 “*”s before and afte
Requirements
1. Include a doc string at the beginning of the program containing your name and a
ief description of the program.
2. Try different pattern_list to make sure your code works for any pattern combinations.
3. You may use the Program3_starter.py file as a starting point, but it’s not required.
Requirements and Ru
ic
    Numbe
    Criteria
    Points
    1
    Function read_para(para_file)
· Co
ectly read each line in the file
· Co
ect return values

6 (2*3)
2
    2
    Statistics
· Create word list
· Handle punctuations
· Find out the number of words in the word list
· Find out the length of the longest word
· Show the string with all longest word

2
2
1
2
3
    3
    Function is_pattern(word, pattern)
· Co
ectly match each of the four patterns

10
    4
    Function formatted_output
· Handle default value
· Return the format co
ectly

2
3
    5
    Display
· Show statistics in the required format
· Show all patterns results using for loop (do not hard code the output)
· For each pattern
· Show co
esponding message with co
ect number of distinct words
· Show the distinct words with co
ect format

2
4
4
4
    6
    Coding style, including docstring, comments, variable names, etc.
    3

    Total:
    50
Grading
This assignment will be worth ten (10) percent of your final grade.
Your program

program3fall2022v3-2-1-zdwnnvvf.docx

Answered 1 days After Nov 13, 2022

Solution

Nidhi answered on Nov 15 2022

54 Votes

SOLUTION.PDF

Bentley University Fall 2022 CS 230 Introduction to Programming with Python Program 3: text Pattern identification Text pattern identification is a process to find strings with a specific pattern....

Solution

Answer To This Question Is Available To Download

Related Questions & Answers

Submit New Assignment