[AI/NLP] NLTK를 통한 자연어 처리 기초개념(Tokenization, Stopwords, POS tagging, NER, Stemming, Lemmatization)

Notice

Recent Posts

Recent Comments

Link

« 2026/02 »
일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28

Tags more

Archives

Today

Total

관리 메뉴

일단 테크블로그😊

[AI/NLP] NLTK를 통한 자연어 처리 기초개념(Tokenization, Stopwords, POS tagging, NER, Stemming, Lemmatization) 본문

AI/NLP

[AI/NLP] NLTK를 통한 자연어 처리 기초개념(Tokenization, Stopwords, POS tagging, NER, Stemming, Lemmatization)

^__^/ 2024. 4. 25. 10:52

0. NLTK란?

NLTK는 자연어 처리를 위한 대표적인 파이썬 라이브러리이며, 자연어 처리 분야에서 학계와 산업 현장을 가리지 않고 널리 사용되는 인기 있는 툴킷이다.

NLTK가 제공하는 대표적인 알고리즘은 다음과 같으며, 본 포스팅에서는 각각의 알고리즘에 대해 알아볼 것이다.

Tokenization
Stopwords
POS tagging
NER
Stemming, Lemmatization

1. Tokenization (토큰화)

우리가 컴퓨터로 자연어를 처리하기 위해서는, 받은 문장을 일정한 단위로 나누어 분석해야 한다. 이에, 입력받은 자연어를 일정한 Chunk(덩어리)로 잘라서 처리하는 과정을 Tokenization(토큰화)라고 하며, 이 Chunk들을 Token이라고 부른다. 이 Token들은 자연어 분석에 있어서 기초 단위가 된다.

Token의 Size는 사용자가 임의로 정할 수 있다. Unigram(단어 하나), Bigram(연속된 두 단어), Trigram(연속된 세 단어),... N-gram(연속된 N개 단어) 모두 token이라고 할 수 있다. NLTK에서는 대표적으로 Sentence Tokenization과 Word Tokenization을 지원한다.

#Sentence Tokenization과 Word Tokenization
import nltk
from nltk.tokenize import sent_tokenize
# nltk.download('punkt') 


text="Strolling through the vibrant city streets, Sarah admired the pinkish-blue sky overhead"
tokenized_text=sent_tokenize(text)
print(tokenized_text)
#['Strolling through the vibrant city streets, Sarah admired the pinkish-blue sky overhead']



from nltk.tokenize import word_tokenize
tokenized_word=word_tokenize(text)
print(tokenized_word)

#'Strolling', 'through', 'the', 'vibrant', 'city', 'streets', ',', 'Sarah', 'admired', 'the', 'pinkish-blue', 'sky', 'overhead', '.', 'Amidst', 'the', 'beauty', ',', 'she', 'chuckled', 'to', 'herself', ',',
#'remembering', 'the', 'odd', 'advice', ':', "'Never", 'eat', 'cardboard', '.', "'"

2. Stopwords (불용어)

토큰화된 자연어를 살펴보면 쓸데없이 많은 빈도를 보이는 token들을 살펴볼 수 있다. 이렇듯 큰 의미를 가지지 않으면서 전체적인 자연어의 분석을 방해하는 단어들의 모음을 Stopwords라고 하며, 보통 전처리 과정에서 제거한다. 영어 문장에서는 'a', 'the' 같은 관사나 'is', 'are'과 같은 be동사가 Stopwords라고 할 수 있겠다.

NLTK에서는 미리 Stopwords를 모아둔 corpus(말뭉치)를 제공하고 있다. 따라서 우리는 Stopwords들을 일일이 직접 정의할 필요 없이 이를 import 해 사용하기만 하면 된다. 다음은 NLTK가 제공하는 English의 Stopwords를 사용하여, 토큰화된 자연어의 Stopwords를 제거하는 코드이다.

from nltk.corpus import stopwords
stop_words=set(stopwords.words("english"))
print(len(stop_words))  #179
print(stop_words)  #nltk가 제공하는 불용어 179개를 전부 확인할 수 있다.
#{'with', 'after', 'the', 'can', 'did', 'to', 'such', 'his', 'do', 'out', "needn't", 'ain', 'mightn', 'don', "she's", 'needn', "haven't", "you're", 'shan', 'ours', 'hers', 'has', "doesn't", 'their', 'above', 'aren', 'just', 'itself', 'who', 'that', 'we', 'isn', 'a', 'now', 'its', 'before', 'haven', 'each', 'for', "wasn't", ...



filtered_sent=[]
for w in tokenized_word:
    if w not in stop_words:
        filtered_sent.append(w)
print("Tokenized Sentence:",tokenized_word)
print("Filtered Sentence:",filtered_sent)

#Tokenized Sentence: ['Strolling', 'through', 'the', 'vibrant', 'city', 'streets', ',', 'Sarah', 'admired', 'the', 'pinkish-blue', 'sky', 'overhead', '.', 'Amidst', 'the', 'beauty', ',', 'she', 'chuckled', 'to', 'herself', ',', 'remembering', 'the', 'odd', 'advice', ':', "'Never", 'eat', 'cardboard', '.', "'"]
#Filtered Sentence: ['Strolling', 'vibrant', 'city', 'streets', ',', 'Sarah', 'admired', 'pinkish-blue', 'sky', 'overhead', '.', 'Amidst', 'beauty', ',', 'chuckled', ',', 'remembering', 'odd', 'advice', ':', "'Never", 'eat', 'cardboard', '.', "'"]

3. POS (Part-of-Speech) tagging

POS tagging이란, 각 주어진 단어에 대해 문장과의 관계를 살펴보고, 그에 대한 문법적 정의를 하는 것이다. 즉, 각 단어의 품사를 mapping 해 주는 것이 POS tagging의 주요 목적이다.

4. NER (Name Entity Recognition)

NER은 주어진 텍스트에서 이름, 장소, 조직 등 이름을 가진 개체(Name Entity)를 인식(Recognize)하여, 사전에 정의한 카테고리에 따라 분류하는 것이다. "홍길동은 한국에서 XX대학교를 2026년에 졸업한 뒤 구글에 입사했다"라는 예시문장을 NER을 통해 인식해 보면 다음과 같다.

홍길동 -> Name
한국 -> Country
XXX 대학교 -> University
2026 -> Time
구글 -> Company

이와 같이 NER을 사용하면 텍스트에서 중요한 개체를 카테고리화하여 식별할 수 있다.

심화) NER tagging 시스템, BIO와 BIESO
BIO와 BIESO는 NER에서 사용되는 두 가지 주요 방법론이다.

1.BIO Scheme: BIO는 "Begin", "Inside", "Outside"의 약자이다. 이 체계에서, 각 토큰은 다음 중 하나의 태그를 가지게 된다.

- B: 개체명의 시작 부분
- I: 개체명의 내부 부분
- O: 개체명이 아닌 부분

예를 들어, "John Smith was born in America."라는 문장에서 "John Smith"는 개체명인데, 이를 BIO 체계로 표시하면 "B-Person I-Person O O O B-Location"과 같이 될 것이다.

2. BIESO Scheme:BIESO는 "Begin", "Inside", "End", "Singleton", "Outside"의 약자이며, BIO Scheme에서
"End"와 "Singleton"이 추가된 Scheme이다. BIESO체계는 개체명이 시작되는 위치와 끝나는 위치를 명확하게 표시하는 데 조금 더 유용하다.

- B: 개체명의 시작 부분
- I: 개체명의 내부 부분
- E: 개체명의 마지막 부분 (추가)
- S: 개체명이 하나의 토큰으로만 이루어진 경우(추가)
- O: 개체명이 아닌 부분

예를 들어, "John Smith was born in America."라는 문장에서 "John Smith"는 BIESO체계로 표시하면 "B-Person E-Person O S-Location"와 같이 될 것이다.

5. Stemming, Lemmetization

Stemming과 Lemmatization은 텍스트의 단어를 원형으로 변환(표준화)하는 데 사용되는 두 가지 대표적인 기술이다.

Stemming: Stemming은 단어의 어간(Stem)을 추출하여 단어를 표준화하는 과정인데, 이 과정에서 단어의 접미사나 어미를 제거하여 단어의 기본 형태를 추출한다 (이때 결과로 나오는 형태는 사전에 존재하지 않을 수 있다). 예를 들어 "running"과 "runs"는 모두 "run"으로 어간이 추출된다. Stemming은 일반적으로 속도가 빠르지만 어간 추출의 정확도는 Lemmatization에 비해 떨어질 수 있다.
Lemmatization: Lemmatization은 단어를 사전에 있는 기본형(lemma)으로 변환하는 과정이다. 이때 기본형은 사전에 실제로 존재하는 단어이며, 품사 정보를 고려하여 단어의 원형을 찾게 된다. 예를 들어 "am", "are", "is"는 모두 "be"로 표준화된다. Lemmatization은 Stemming보다 더 정확하지만, Stemming에 비해 처리 속도가 느리다는 단점이 있다.

두 가지 방법 중 보편적으로는 정확도를 위해 Lemmatization이 많이 사용된다고 한다. 다만, 정밀한 단어 분류가 불필요한 감성 분석 등에서는 처리속도를 위해 Stemming을 고려해 볼 수도 있겠다.

#Stemming and Lemmatization의 차이
from nltk.stem.wordnet import WordNetLemmatizer
lem = WordNetLemmatizer()

from nltk.stem.porter import PorterStemmer
stem = PorterStemmer()

word = "flying"
print("Lemmatized Word:",lem.lemmatize(word,"v"))
print("Stemmed Word:",stem.stem(word))

#Lemmatized Word: fly
#Stemmed Word: fli

이렇게 자연어 처리의 가장 기본이 되는 과정들의 개념과 NLTK의 사용법을 살펴보았다. 가장 기초적이지만 중요한 개념이므로, 이 정도는 꼭 숙지하도록 하자.

🌼한 줄 요약🌼
Tokenization: 자연어를 일정한 단위로 쪼개는 과정
Stopwords: 자연어 분석에 중요하지 않은 단어들을 제거하는 과정
POS tagging: 각 단어의 품사를 매핑하는 작업
NER: 단어를 이름, 사람, 장소, 날짜 등 카테고리별로 분류하는 과정
Stemming: 단어의 어간을 추출하는 과정
Lemmatization: 단어의 원형(표준 형태)을 추출하는 과정

😊소중한 의견, 피드백 환영합니다!!😊

'AI > NLP' 카테고리의 다른 글

[AI/NLP] Bag of Words(CountVectorizer) vs TF-IDF (2)	2024.04.25

'AI/NLP' Related Articles

[AI/NLP] Bag of Words(CountVectorizer) vs TF-IDF 2024.04.25

일단 테크블로그😊

[AI/NLP] NLTK를 통한 자연어 처리 기초개념(Tokenization, Stopwords, POS tagging, NER, Stemming, Lemmatization) 본문

[AI/NLP] NLTK를 통한 자연어 처리 기초개념(Tokenization, Stopwords, POS tagging, NER, Stemming, Lemmatization)

0. NLTK란?

1. Tokenization (토큰화)

2. Stopwords (불용어)

3. POS (Part-of-Speech) tagging

4. NER (Name Entity Recognition)

5. Stemming, Lemmetization

😊소중한 의견, 피드백 환영합니다!!😊

'AI > NLP' 카테고리의 다른 글

티스토리툴바