Skip to content

Commit 00c2d0b

Browse files
committed
Updated docstrings, added new methods
We updated the docstrings for all of the functions, with new and improved descriptions and usages. We also added new functions to replication and similarity.
1 parent dd96d7f commit 00c2d0b

7 files changed

Lines changed: 133 additions & 17 deletions

File tree

duplipy/__init__.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
1-
import source
1+
import duplipy
22
from .formatting import remove_stopwords, remove_numbers, remove_whitespace, normalize_whitespace, separate_symbols, remove_special_characters, standardize_text, tokenize_text, stem_words, lemmatize_words, pos_tag
3-
from .replication import replace_word_with_synonym, augment_text_with_synonyms, load_text_file, augment_file_with_synonyms, insert_random_word, delete_random_word, insert_synonym, paraphrase, flip_horizontal, flip_vertical, rotate, random_rotation, resize, crop, random_crop
4-
from .similarity import edit_distance_score, bleu_score
3+
from .replication import replace_word_with_synonym, augment_text_with_synonyms, load_text_file, augment_file_with_synonyms, insert_random_word, delete_random_word, insert_synonym, paraphrase, flip_horizontal, flip_vertical, rotate, random_rotation, resize, crop, random_crop, shuffle_words
4+
from .similarity import edit_distance_score, bleu_score, jaccard_similarity_score
55
from .text_analysis import analyze_sentiment

duplipy/formatting.py

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,9 @@ def remove_stopwords(text):
3232
"""
3333
Remove stopwords from the input text using NLTK's stopwords.
3434
35+
Stopwords are frequently used words (e.g., 'the', 'and', 'is') that are often
36+
excluded from text processing to focus on more meaningful content.
37+
3538
Parameters:
3639
- `text` (str): The input text from which stopwords should be removed.
3740
@@ -52,6 +55,8 @@ def remove_numbers(text):
5255
"""
5356
Remove numbers from the input text.
5457
58+
Numerical digits are removed from the text to focus on the non-numeric content.
59+
5560
Parameters:
5661
- `text` (str): The input text from which numbers should be removed.
5762
@@ -69,6 +74,9 @@ def remove_whitespace(text):
6974
"""
7075
Remove excess whitespace from the input text.
7176
77+
Excess whitespace, including leading, trailing, and multiple consecutive spaces,
78+
is removed from the text to create a more standardized and readable format.
79+
7280
Parameters:
7381
- `text` (str): The input text from which excess whitespace should be removed.
7482
@@ -86,6 +94,9 @@ def normalize_whitespace(text):
8694
"""
8795
Normalize multiple whitespaces into a single whitespace in the input text.
8896
97+
Multiple consecutive whitespaces are replaced with a single whitespace to
98+
create a more consistent and readable text format.
99+
89100
Parameters:
90101
- `text` (str): The input text from which whitespace should be normalized.
91102
@@ -103,6 +114,9 @@ def separate_symbols(text):
103114
"""
104115
Separate symbols and words with a space to ease tokenization.
105116
117+
Symbols in the input text are separated from words with a space to facilitate
118+
easier tokenization and analysis of the text.
119+
106120
Parameters:
107121
- `text` (str): The input text from which symbols needs to be seperated.
108122
@@ -121,6 +135,9 @@ def remove_special_characters(text):
121135
"""
122136
Remove special characters from the input text.
123137
138+
Special characters, such as punctuation and user-defined symbols, are removed
139+
to create a text without these non-alphanumeric elements.
140+
124141
Parameters:
125142
- `text` (str): The input text from which special characters should be removed.
126143
@@ -140,6 +157,9 @@ def standardize_text(text):
140157
"""
141158
Standardize the formatting of the input text.
142159
160+
The input text is converted to lowercase and leading/trailing whitespaces are removed
161+
to create a standardized representation for easier comparison and analysis.
162+
143163
Parameters:
144164
- `text` (str): The input text which needs to be standardized.
145165
@@ -158,6 +178,10 @@ def tokenize_text(text):
158178
"""
159179
Tokenize the input text into individual words.
160180
181+
Tokenization is the process of breaking down a text into individual words,
182+
facilitating further analysis, such as counting word frequencies or analyzing
183+
language patterns.
184+
161185
Parameters:
162186
- `text` (str): The input text to be tokenized.
163187
@@ -171,6 +195,9 @@ def stem_words(words):
171195
"""
172196
Stem the input words using Porter stemming algorithm.
173197
198+
Stemming reduces words to their base or root form, helping to consolidate
199+
variations of words and simplify text analysis.
200+
174201
Parameters:
175202
- `words` (list): A list of words to be stemmed.
176203
@@ -185,6 +212,9 @@ def lemmatize_words(words):
185212
"""
186213
Lemmatize the input words using WordNet lemmatization.
187214
215+
Lemmatization reduces words to their base or dictionary form, helping to
216+
normalize variations and simplify text analysis.
217+
188218
Parameters:
189219
- `words` (list): A list of words to be lemmatized.
190220
@@ -199,6 +229,9 @@ def pos_tag(text):
199229
"""
200230
Perform part-of-speech (POS) tagging on the input text.
201231
232+
Part-of-speech tagging assigns a grammatical category (tag) to each word
233+
in a text, aiding in syntactic analysis and understanding sentence structure.
234+
202235
Parameters:
203236
- `text` (str): The input text to be POS tagged.
204237

duplipy/replication.py

Lines changed: 44 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,13 +17,15 @@
1717
- `resize(image, size)`: Resize the input image to the specified size.
1818
- `crop(image, box)`: Crop the input image to the specified rectangular region.
1919
- `random_crop(image, size)`: Randomly crop a region from the input image.
20+
- `shuffle_words(text)`: Randomly shuffle the order of words in each sentence.
2021
"""
2122

2223
import random
2324
import time
2425
import nltk
2526
from nltk.corpus import wordnet
2627
from PIL import Image
28+
import tqdm
2729

2830
nltk.download("wordnet", quiet=True)
2931
nltk.download("averaged_perceptron_tagger", quiet=True)
@@ -33,6 +35,9 @@ def replace_word_with_synonym(word):
3335
"""
3436
Replace the given word with a synonym.
3537
38+
Synonyms are alternative words with similar meanings, and replacing words
39+
with synonyms can be used for text augmentation or variation.
40+
3641
Params:
3742
- `word` (str): The input word to replace with a synonym.
3843
@@ -155,6 +160,9 @@ def insert_random_word(text, word):
155160
"""
156161
Insert a random word into the input text.
157162
163+
This function randomly inserts a specified word into the input text, creating
164+
variations for text augmentation or diversification.
165+
158166
Parameters:
159167
- `text` (str): The input text for word insertion.
160168
- `word` (str): The word to be inserted into the text.
@@ -176,6 +184,9 @@ def delete_random_word(text):
176184
"""
177185
Delete a random word from the input text.
178186
187+
This function randomly deletes a word from the input text, creating variations
188+
for text augmentation or diversity.
189+
179190
Parameters:
180191
- `text` (str): The input text for word deletion.
181192
@@ -197,6 +208,9 @@ def insert_synonym(text, word):
197208
"""
198209
Insert a synonym of the given word into the input text.
199210
211+
This function replaces the specified word in the input text with a synonym,
212+
introducing variations for text augmentation or diversity.
213+
200214
Parameters:
201215
- `text` (str): The input text for synonym insertion.
202216
- `word` (str): The word for which a synonym will be inserted.
@@ -217,6 +231,9 @@ def paraphrase(text):
217231
"""
218232
Paraphrase the input text.
219233
234+
This function leverages part-of-speech tagging to identify verbs (VB), nouns (NN),
235+
and adjectives (JJ) in the input text, replacing them with synonyms for paraphrasing.
236+
220237
Parameters:
221238
- `text` (str): The input text to be paraphrased.
222239
@@ -327,4 +344,30 @@ def random_crop(image, size):
327344
upper = random.randint(0, height - size[1])
328345
right = left + size[0]
329346
lower = upper + size[1]
330-
return crop(image, (left, upper, right, lower))
347+
return crop(image, (left, upper, right, lower))
348+
349+
# DupliPy 0.2.0
350+
351+
def shuffle_words(text):
352+
"""
353+
Randomly shuffle the order of words in each sentence.
354+
355+
This function takes a list of sentences and randomly shuffles the order of words
356+
in each sentence, creating variations for text augmentation or diversity.
357+
358+
Parameters:
359+
- `text` (list of str): List of sentences where each sentence's words needs to be shuffled.
360+
361+
Returns:
362+
- `list of str`: List of sentences with randomly shuffled words.
363+
"""
364+
# Shuffle the order of words in each sentence
365+
shuffled_text = []
366+
with tqdm(total=len(text), desc="Shuffling Words") as pbar:
367+
for sentence in text:
368+
words = sentence.split()
369+
shuffled_words = random.sample(words, len(words))
370+
shuffled_sentence = ' '.join(shuffled_words)
371+
shuffled_text.append(shuffled_sentence)
372+
pbar.update(1)
373+
return shuffled_text

duplipy/similarity.py

Lines changed: 39 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,8 @@
33
44
Available functions:
55
- `edit_distance_score(text1, text2)`: Calculate the edit distance score between two texts.
6+
- `bleu_score(reference, candidate)`: Calculate the BLEU score between a reference sentence and a candidate sentence.
7+
- `jaccard_similarity_score(text1, text2)`: Calculate Jaccard similarity between two texts.
68
"""
79

810
import nltk
@@ -13,12 +15,17 @@ def edit_distance_score(text1, text2):
1315
"""
1416
Calculate the edit distance score between two texts.
1517
18+
The edit distance, also known as Levenshtein distance, is a measure of the
19+
minimum number of single-character edits (insertions, deletions, or
20+
substitutions) required to transform one text into another.
21+
1622
Parameters:
1723
- `text1` (str): The first text.
1824
- `text2` (str): The second text.
1925
2026
Returns:
21-
- `int`: The edit distance score.
27+
- `int`: The edit distance score between the two texts. A lower score
28+
indicates greater similarity, with 0 meaning the texts are identical.
2229
"""
2330
try:
2431
# Calculate the edit distance
@@ -30,14 +37,17 @@ def edit_distance_score(text1, text2):
3037

3138
def bleu_score(reference, candidate):
3239
"""
33-
Calculate the BLEU score between a reference sentence and a candidate sentence.
40+
Calculate the BLEU (Bilingual Evaluation Understudy) score between a reference sentence and a candidate sentence.
41+
42+
BLEU is a metric commonly used for evaluating the quality of machine-translated text. It measures the precision of the
43+
candidate sentence's n-grams (contiguous sequences of n items) against the reference sentence.
3444
3545
Parameters:
3646
- `reference` (str): The reference sentence.
3747
- `candidate` (str): The candidate sentence.
3848
3949
Returns:
40-
- `float`: The BLEU score.
50+
- `float`: The BLEU score. The score ranges from 0 (no similarity) to 1 (perfect match).
4151
"""
4252
try:
4353
# Tokenize the reference and candidate sentences
@@ -49,4 +59,29 @@ def bleu_score(reference, candidate):
4959
return bleu
5060
except Exception as e:
5161
print(f"An error occurred during BLEU score calculation: {str(e)}")
52-
return 0.0
62+
return 0.0
63+
64+
# DupliPy 0.2.0
65+
66+
def jaccard_similarity_score(text1, text2):
67+
"""
68+
Calculate Jaccard similarity between two texts.
69+
70+
Jaccard similarity is a measure of similarity between two sets. In the context
71+
of text comparison, it calculates the similarity between the sets of words
72+
in two texts.
73+
74+
Parameters:
75+
- `text1` (str): The first text for comparison.
76+
- `text2` (str): The second text for comparison.
77+
78+
Returns:
79+
- `float`: Jaccard similarity score between the two texts. The score ranges
80+
from 0 (no similarity) to 1 (complete similarity).
81+
"""
82+
set1 = set(text1.split())
83+
set2 = set(text2.split())
84+
intersection = len(set1.intersection(set2))
85+
union = len(set1.union(set2))
86+
similarity_score = intersection / union if union != 0 else 0
87+
return similarity_score

duplipy/text_analysis.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,9 @@ def analyze_sentiment(text):
1414
"""
1515
Analyze the sentiment of the input text using NLTK's SentimentIntensityAnalyzer.
1616
17+
Sentiment analysis assesses the emotional tone of a text, providing a sentiment
18+
score ranging from -1 (negative) to 1 (positive).
19+
1720
Parameters:
1821
- `text` (str): The input text to be analyzed.
1922

readme.md

Lines changed: 8 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,15 @@
1-
# DupliPy 0.1.9
2-
![Python Version](https://img.shields.io/badge/python-3.11-blue.svg)
1+
# DupliPy 0.2.0
2+
![Python Version](https://img.shields.io/badge/python-3.12-blue.svg)
33
![Code Size](https://img.shields.io/github/languages/code-size/infinitode/duplipy)
44
![Downloads](https://pepy.tech/badge/duplipy)
55
![License Compliance](https://img.shields.io/badge/license-compliance-brightgreen.svg)
66
![PyPI Version](https://img.shields.io/pypi/v/duplipy)
77

88
An open source Python library for text formatting, augmentation, and similarity calculation tasks in NLP, the package now also includes additional methods for image augmentation.
99

10-
## Changes to DupliPy
10+
## Changes to DupliPy 0.2.0
1111

12-
DupliPy now offers support for image augmentation, with functions to rotate, resize and crop images. These are available through:
13-
```python
14-
from duplipy.replication import flip_horizontal, flip_vertical, rotate, random_rotation, resize, crop, random_crop
15-
```
12+
DupliPy now includes useful method descriptions in docstrings, allowing anyone to quickly see what a method does and why it is used. DupliPy also now includes a few extra methods in `replication` and `similarity`, including `shuffle_words()` and `jaccard_similarity_score()` .
1613

1714
## Installation
1815

@@ -32,6 +29,7 @@ DupliPy supports the following Python versions:
3229
- Python 3.9
3330
- Python 3.10
3431
- Python 3.11
32+
- Python 3.12
3533

3634
Please ensure that you have one of these Python versions installed before using DupliPy. DupliPy may not work as expected on lower versions of Python than the supported.
3735

@@ -42,6 +40,9 @@ Please ensure that you have one of these Python versions installed before using
4240
- Sentiment Analysis: Find impressions within sentences.
4341
- Similarity Calculation: Calculate text similarity using various metrics.
4442
- BLEU Score Calculation: Calculate how well your text-based NLP model performs.
43+
- Image Augmentation Tasks **(NEW)**
44+
45+
*For full reference documentation view [DupliPy's official documentation](https://infinitode-docs.gitbook.io/documentation/package-documentation/duplipy-package-documentation).*
4546

4647
## Usage
4748

setup.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
setup(
44
name='duplipy',
5-
version='0.1.9',
5+
version='0.2.0',
66
author='Infinitode Pty Ltd',
77
author_email='infinitode.ltd@gmail.com',
88
description='A package for formatting and text replication, with added support for image augmentation.',
@@ -19,7 +19,7 @@
1919
'pillow',
2020
],
2121
classifiers=[
22-
'Development Status :: 3 - Alpha',
22+
'Development Status :: 5 - Production/Stable',
2323
'Intended Audience :: Developers',
2424
'License :: OSI Approved :: MIT License',
2525
'Programming Language :: Python :: 3',
@@ -29,6 +29,7 @@
2929
'Programming Language :: Python :: 3.9',
3030
'Programming Language :: Python :: 3.10',
3131
'Programming Language :: Python :: 3.11',
32+
'Programming Language :: Python :: 3.12',
3233
],
3334
python_requires='>=3.6',
3435
)

0 commit comments

Comments
 (0)