PySpark에서 dataframe 컬럼의 이름을 얻는 방법은?

prosource

PySpark에서 dataframe 컬럼의 이름을 얻는 방법은?

probook 2023. 10. 25. 23:22

PySpark에서 dataframe 컬럼의 이름을 얻는 방법은?

팬더의 경우, 이것은 할 수 있습니다.column.name.

하지만 스파크 데이터 프레임의 열일 때 어떻게 같은 작업을 수행할 수 있습니까?

예를 들어, 호출 프로그램의 데이터 프레임은 다음과 같습니다.spark_df

>>> spark_df.columns
['admit', 'gre', 'gpa', 'rank']

이 프로그램은 내 함수를 호출합니다.my_function(spark_df['rank'])
인my_function, 저는 칼럼의 이름이 필요합니다.'rank'.

팬더 데이터 프레임이라면 다음을 사용할 수 있습니다.

>>> pandas_df['rank'].name
'rank'

스키마에서 이름을 가져올 수 있습니다.

spark_df.schema.names

스키마를 인쇄하면 스키마를 시각화하는 데 유용할 수 있습니다.

spark_df.printSchema()

유일한 방법은 JVM으로 기본 레벨을 이동하는 것입니다.

df.col._jc.toString().encode('utf8')

이것은 또한 이것이 a로 변환되는 방법이기도 합니다.strPyspark 코드 자체에 있습니다.

pyspark/sql/column에서 왔습니다.py:

def __repr__(self):
    return 'Column<%s>' % self._jc.toString().encode('utf8')

파이썬

@number가 정확히 말했듯이,column._jc.toString()앨리어싱되지 않은 열의 경우에는 잘 작동합니다.

앨리어싱된 열의 경우(즉,column.alias("whatever")) 에일리어스는 정규식을 사용하지 않더라도 추출할 수 있습니다.str(column).split(" AS ")[1].split("`")[1].

스칼라 구문은 모르지만 똑같이 할 수 있을 겁니다

데이터 프레임의 열 이름을 원한다면 다음을 사용할 수 있습니다.pyspark.sqlclass. SDK에서 열 이름으로 DF를 명시적으로 인덱싱할 수 있는지 잘 모르겠습니다.나는 이 추적을 받았습니다.

>>> df.columns['High'] Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: list indices must be integers, not str

그러나 데이터 프레임에서 열 메서드를 호출하면 열 이름 목록이 반환됩니다.

df.columns돌아올 것입니다['Date', 'Open', 'High', 'Low', 'Close', 'Volume', 'Adj Close']

열 데이터 유형을 원한다면 다음에 전화를 걸 수 있습니다.dtypes방법:

df.dtypes돌아올 것입니다[('Date', 'timestamp'), ('Open', 'double'), ('High', 'double'), ('Low', 'double'), ('Close', 'double'), ('Volume', 'int'), ('Adj Close', 'double')]

특정 열을 원할 경우 인덱스별로 액세스해야 합니다.

df.columns[2]돌아올 것입니다'High'

답이 아주 간단하다는 걸 알게 됐어요

// It is in Java, but it should be same in PySpark
Column col = ds.col("colName"); //the column object
String theNameOftheCol = col.toString();

변수가theNameOftheCol가"colName"

여기에는 가장 이상한 경우도 포함됩니다.

별칭이 있는 열/없는 열
여러 별칭
여러 단어를 포함하는 별칭
뒷깍지로 둘러 싸인 기둥 이름들
가명으로 된 의도적인 뒷따귀

def get_col_name(col):
    if str(col)[-3] != '`':
        return str(col).split("'")[-2].split(" AS ")[-1]
    return str(col).replace('``', '`').split(" AS `")[-1].split("`'")[-2]

#table name이 여러 개인 경우 예로 들어 있습니다.

loc = '/mnt/tablename' or 'whatever_location/table_name' #incase of external table or any folder 

table_name = ['customer','department']

for i in table_name:
  print(i) # printing the existing table name

  df = spark.read.format('parquet').load(f"{loc}{i.lower()}/") # creating dataframe from the table name
  for col in df.dtypes:
    print(col[0]) # column_name as per availability

    print(col[1]) # datatype information of the respective column

답변 중 어떤 것도 답변으로 표시되지 않았기 때문에 OP의 질문을 지나치게 단순화한 것일 수도 있습니다.

my_list = spark_df.schema.fields
for field in my_list:
    print(field.name)

언급URL : https://stackoverflow.com/questions/39746752/how-to-get-name-of-dataframe-column-in-pyspark

'prosource' 카테고리의 다른 글

MySQL에서 Postgre로 전환SQL - 팁, 속임수, 잡동사니? (0)	2023.10.25
#이(가) 포함되어 있으면 실제로 어떤 역할을 합니까? (0)	2023.10.25
Word press - remove_filter를 한 페이지에만 적용 (0)	2023.10.25
jQuery의 getJSON() 메서드로 요청 헤더를 전달하려면 어떻게 해야 합니까? (0)	2023.10.25
하위 폴더의 다른 워드프레스 설치 (0)	2023.10.20

현재글PySpark에서 dataframe 컬럼의 이름을 얻는 방법은?

각종 프로그래밍 정보를 다루는 블로그입니다.

WPF, WordPress, sql-server, reactjs, MongoDB, jQuery, oracle, spring-boot, MySQL, Ajax, JSON, ASP.NET, AngularJS, PowerShell, C, mariaDB, TypeScript, Excel, GIT, Python,

Today :
Yesterday :

일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

probook

PySpark에서 dataframe 컬럼의 이름을 얻는 방법은?

PySpark에서 dataframe 컬럼의 이름을 얻는 방법은?

'prosource' 카테고리의 다른 글

'prosource'의 다른글

티스토리툴바

PySpark에서 dataframe 컬럼의 이름을 얻는 방법은?

PySpark에서 dataframe 컬럼의 이름을 얻는 방법은?

'prosource' 카테고리의 다른 글

'prosource'의 다른글

관련글

티스토리툴바