단어 사전, 특징 추출, 단어 표현

Notice

Recent Posts

Recent Comments

Link

« 2025/05 »
일	월	화	수	목	금	토
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Tags more

Archives

Today

Total

관리 메뉴

When will you grow up?

단어 사전, 특징 추출, 단어 표현 본문

02. Study/Deep Learning

단어 사전, 특징 추출, 단어 표현

미카이 2019. 8. 5. 23:12

단어 사전(word index) : 숫자 매핑 사전 만들기. 즉, 단어별로 인덱스를 부여하는 것이다.

keras에서 제공되는 preprocessing을 이용하면 간단하게 구현해볼 수 있다.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

from keras import preprocessing
 
samples = ['현재날씨는 10분 단위로 갱신되며, 날씨 아이콘은 강수가 있는 경우에만 제공됩니다.', 
           '낙뢰 예보는 초단기예보에서만 제공됩니다.', 
           '나 좋은 일이 생겼어', 
           '아 오늘 진짜 짜증나' ]
 
tokenizer = preprocessing.text.Tokenizer()
tokenizer.fit_on_texts(samples) 
 
word_index = tokenizer.word_index
print("각 단어의 인덱스: \n", word_index)
 
sequences = tokenizer.texts_to_sequences(samples)
print(sequences )
http://colorscripter.com/info#e" target="_blank" style="color:#4f4f4ftext-decoration:none">Colored by Color Scripter

http://colorscripter.com/info#e" target="_blank" style="text-decoration:none;color:white">cs

 
 
특징 추출 (word feature extraction) : 자연어 처리에서 특징 추출이란 텍스트 데이터에서 단어나 문장들을 어떤 특징 값으로 바꾸어주는 것을 의미하며, 기존에 문자로 청리되어 있던 데이터를 모델에 적용할 수 있도록 특징을 뽑아 값으로 수치화.
대표적으로 CountVectorizer / TfidVectorizer 가 있다.
 



1
2
3
4
5
6
7
8
9
10
11
12

# CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
 
text_data = ['나는 배가 고프다', '내일 점심 뭐먹지', '내일 공부 해야겠다', '점심 먹고 공부 해야지']
 
count_vectorizer = CountVectorizer()
 
count_vectorizer.fit(text_data) #자동으로 단어 사전 생성 
print(count_vectorizer.vocabulary_)  # {'나는': 2, '배가': 6, '고프다': 0, '내일': 3, '점심': 7, '뭐먹지': 5, '공부': 1, '해야겠다': 8, '먹고': 4, '해야지': 9}
 
sentence = ["나는 배가 배가 고프다 "]  
print(count_vectorizer.vocabulary_) # {'나는': 2, '배가': 6, '고프다': 0, '내일': 3, '점심': 7, '뭐먹지': 5, '공부': 1, '해야겠다': 8, '먹고': 4, '해야지': 9}

http://colorscripter.com/info#e" target="_blank" style="text-decoration:none;color:white">cs


 



1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37

# TfidfVectorizer
 
from sklearn.feature_extraction.text import TfidfVectorizer
 
text_data = ['나는 배가 고프다', '내일 점심 뭐먹지', '내일 공부 해야겠다', '점심 먹고 공부 해야지']
tfidf_vectorizer = TfidfVectorizer()
 
tfidf_vectorizer.fit(text_data) #자동으로 단어 사전 생성 
print(tfidf_vectorizer.vocabulary_) # {'나는': 2, '배가': 6, '고프다': 0, '내일': 3, '점심': 7, '뭐먹지': 5, '공부': 1, '해야겠다': 8, '먹고': 4, '해야지': 9}
 
sentence = [text_data[3]] # ['점심 먹고 공부 해야지']
print(tfidf_vectorizer.transform(sentence))
"""
  (0, 9)    0.5552826649411127
  (0, 7)    0.43779123108611473
  (0, 4)    0.5552826649411127
  (0, 1)    0.43779123108611473
"""
 
 
 
print(tfidf_vectorizer.transform(text_data))
"""
  (0, 6)    0.5773502691896257
  (0, 2)    0.5773502691896257
  (0, 0)    0.5773502691896257
  (1, 7)    0.5264054336099155
  (1, 5)    0.6676785446095399
  (1, 3)    0.5264054336099155
  (2, 8)    0.6676785446095399
  (2, 3)    0.5264054336099155
  (2, 1)    0.5264054336099155
  (3, 9)    0.5552826649411127
  (3, 7)    0.43779123108611473
  (3, 4)    0.5552826649411127
  (3, 1)    0.43779123108611473
"""
http://colorscripter.com/info#e" target="_blank" style="color:#4f4f4ftext-decoration:none">Colored by Color Scripter

http://colorscripter.com/info#e" target="_blank" style="text-decoration:none;color:white">cs


 
 
 
단어 표현 : One hot encoding, word embedding, word vector
One hot encoding - 각 단어 인덱스를 정한 후, 그 단어의 벡터를 그 단어에 해당하는 인덱스의 값을 1로 표현하는 방식



1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24

# 케라스를 사용한 단어 수준의 원-핫 인코딩
 
from keras.preprocessing.text import Tokenizer
samples = ['The cat sat on the mat.', 'The dog ate my homework.']
 
# 가장 빈도가 높은 1,000개의 단어만 선택하도록 Tokenizer 객체 생성
tokenizer = Tokenizer(num_words=1000)
 
# 단어 인덱스를 구축
tokenizer.fit_on_texts(samples)
 
print("index_word :" , tokenizer.index_word )
print("word_index :" , tokenizer.word_index )
print("단어 출현 빈도수 : " , tokenizer.index_docs )
print("문장 수 : " , tokenizer.document_count) #문장 수
 
# 문자열을 정수 인덱스의 리스트로 변환
sequences = tokenizer.texts_to_sequences(samples)
print("sequences",  sequences  )
 
# 원-핫 이진 벡터 표현
one_hot_results = tokenizer.texts_to_matrix(samples, mode='binary')#TODO
print( "one_hot_results", one_hot_results  )
 
http://colorscripter.com/info#e" target="_blank" style="color:#4f4f4ftext-decoration:none">Colored by Color Scripter

http://colorscripter.com/info#e" target="_blank" style="text-decoration:none;color:white">cs


 
 
 
단어 표현
전체 소스 코드
- word index : 클릭
- feature extraction : 클릭
- word one hot encoding : 클릭

'02. Study > Deep Learning' 카테고리의 다른 글

워드 클라우드(word cloud) (0)	2019.08.06
Gensim과 keras를 이용한 단어 임베딩 (0)	2019.08.06
Natural Language Tokenizing (KoNLPy) (0)	2019.08.05
Natural Language Tokenizing (NLTK) (0)	2019.08.05
자연어 처리(natural language processing) (0)	2019.08.05

공유하기 링크

페이스북
카카오스토리
트위터

'02. Study/Deep Learning' Related Articles

Comments

When will you grow up?

단어 사전, 특징 추출, 단어 표현 본문

단어 사전, 특징 추출, 단어 표현

'02. Study > Deep Learning' 카테고리의 다른 글

티스토리툴바