Pandas study notes

Pandas study notes

the goal:

  • Understand the difference between Numpy and Pandas
  • The difference between Pandas's Series and DataFrame structures
  • Pandas application
    • Basic operation of data
    • Basic statistics of data
    • Basic logic of data
    • Arithmetic operations on data
    • Data missing value processing
    • Discretization of data
    • Consolidation of data
  • Using crosstab and pivot_table to realize cross table and pivot table
  • Use groupby and aggregate functions to group and aggregate data
  • Understand the plot drawing function of Pandas
  • Use Pandas to realize data reading and storage

1 Introduction to Pandas

1.1 Understanding Pandas

  • Based on Numpy, taking advantage of Numpy's high performance in computing , based on Matplotlib, it can easily draw pictures , unique data structure, and easy to read files

1.2 Case

Different from Numpy, Pandas will be more visualized in terms of data manipulation.

# Import pandas import pandas as pd # Create a 5-day rise and fall data for 10 stocks that conform to the normal distribution stock_change = np.random.normal( 0 , 1 , ( 10 , 5 )) # Use the data structure in Pandas stock_day_rise = pd.DataFrame(stock_change) # Construct row index sequence stock_code = [ 'stock' + str (i) for i in range (stock_day_rise.shape[ 0 ])] # Add row index data = pd.DataFrame(stock_change, index=stock_code) # Generate a time sequence, skip weekend non-trading days # date_range(start=None,end=None, periods=None, freq='B') Parameter description: # start: start time, end: end time, periods : Time in days, freq: progressive unit, default is 1 day,'B' defaults to skip weekend date = pd.date_range( '2020-09-01' , periods=stock_day_rise.shape[ 1 ], freq = 'B' ) # index stands for row index, columns stands for column index data = pd.DataFrame(stock_change, index=stock_code, columns=date) # Results 2020 -09-01 2020 -09-02 2020 -09-03 2020 -09-04 2020 -09-07 Stock 0 0.666512 - 1.686975 - 1.921115 - 0.476754 0.567056 stock . 1 - 1.028429 0.690079 0.515872 - 0.245661 0.295913 stock 2 0.887564 0.535831 - 0.953737 0.702742 0.340305 stock 3 0.488817 0.106693 1.115562 0.162008 - 1.278754 stock . 4 - 0.948476 - 0.838498 0.501498 - 1.766902 -0.545747 stock . 5 - 0.194258 1.231776 0.784383 - 0.836579 - 0.569891 stock 6 0.584226 0.186989 0.796285 - 0.892660 0.053643 stock 7 1.799851 0.641454 - 0.442945 0.061169 - 2.076740 stock . 8 - 1.062695 - 0.414012 - 0.335279 - 1.577242 0.578246 stock . 9 - 0.107383 0.503183 0.610964 -1.932938 0.310824Copy code

2 Pandas--data structure

Pandas has two most important and important data structures: Series and DataFrame

2.1 Series

Series is an object similar to a one-dimensional array, consisting of a set of data (various Numpy data types) and a set of corresponding indexes (data labels).

  • Features

    • One-dimensional array-like objects
    • Composed of data and index
      • Index (index) on the left, data (values) on the right
      • Index is created automatically
  • create

    import pandas as pd pd.Series(data = None , index = None , dtype = None ) """ parameter: data: incoming data, can be ndarray, list, etc. index: Index, which must be unique and equal to the length of the data. If no index parameter is passed in, an integer index from 0-N will be automatically created by default. dtype: the type of data """ Copy code
    • Build Series through list, default index
    pd.Series(np.arange( 10 )) Copy code
    • Specify index
    pd.Series ([ 6.7 , 5.6 , . 3 , 10 , 2 ], index = [ . 1 , 2 , . 3 , . 4 , . 5 ]) Copy the code
    • Created from dictionary data
    = pd.Series color_count ({ 'Red' : 100 , 'Blue' : 200 is , 'Green' : 500 , 'Yellow' : 1000 }) copy the code
  • Attributes

    • index
    color_count.index # Results Index ([ 'Blue' , 'Green' , 'Red' , 'Yellow' ], DTYPE = 'Object' ) copying the code
    • values
    color_count.values Results # Array ([ 200 is , 500 , 100 , 1000 ]) Copy the code

2.2 DataFram

DataFrame is a tabular data structure, it contains a set of ordered columns, each column can be a different type of value. DataFrame has both row index and column index. It can be regarded as a dictionary composed of Series (share the same index), and the data is stored in a two-dimensional structure.

  • structure

    The DataFrame object has both row index and column index

    Row index: indicate different rows, horizontal index, index, 0 axis, axis=0

    Column index: indicate different columns, vertical index, column, 1 axis, axis=1

  • create

    • Build DataFrame through ndarray
    import numpy as np # Construct DataFrame array through ndarray = np.random.randn( 5 , 4 ) df_obj = pd.DataFrame(array) Copy code
    • Build DataFrame through dict
    # Construct DataFrame through dict dict_data = { 'A' : 1 , 'B' : pd.Timestamp( '20170426' ), 'C' : pd.Series( 1 , index = list ( range ( 4 )), dtype = ' float32' ), 'D' : np.array([ 3 ] * 4 ,dtype= 'int32' ), 'E' : [ "Python" , "Java" , "C++" , "C" ], 'F': 'ITCast' } #print dict_data df_obj2 = pd.DataFrame(dict_data) Copy code
  • Attributes

    • shape

      # Get shape data.shape ( 10 , 5 ) copy the code
    • index

      # Get row index data.index Index([ 'stock 0' , 'stock 1' , 'stock 2' , 'stock 3' , 'stock 4' , 'stock 5' , 'stock 6' , 'stock 7' , 'stock 8' , 'stock . 9 ' ], DTYPE = ' Object ' ) copying the code
    • columns

      # Get column index data.columns DatetimeIndex([ '2017-01-02' , '2017-01-03' , '2017-01-04' , '2017-01-05' , '2017-01-06' ], dtype = 'datetime64[ns ] ' , FREQ = ' B ' ) copying the code
    • values

      # View value data.values array([[- 0.06544031 , -1.30931491 , -1.45451514 , 0.57973008 , 1.48602405 ], [- 1.73216741 , -0.83413717 , 0.45861517 , -0.80391793 , -0.46878575 ], [ 0.21805567 , 0.19901371 , 0.7134683 , 0.5484263 , 0.38623412 ], [- 0.42207879 , -0.33702398 , 0.42328531 , -1.23079202 , 1.32843773 ], [- 1.72530711 , 0.07591832 , -1.91708358 , -0.16535818 , 1.07645091 ], [- 0.81576845 , -0.28675278 , 1.20441981 , 0.73365951 , -0.06214496 ], [- 0.98820861 , -1.01815231 , -0.95417342 , -0.81538991 , 0.50268175 ], [- 0.10034128 , 0.61196204 , -0.06850331 , 0.74738433 , 0.143011 ], [ 1.00026175 , 0.34241958 , -2.2529711 , 0.93921064 , 1.14080312 ], [ 2.52064693 , 1.55384756 , 1.72252984 , 0.61270132 , 0.60888092 ]]) Copy the code
    • T

      # Transpose data.T Copy code
    • head(5)

      Display the content of the first 5 lines, if no parameters are added, the default is 5 lines, and fill in the parameter N to display the first N lines

    • tail(5)

      Display the last 5 lines of content, if you do not add content, the default is 5 lines. Fill in the parameter N to display the last N lines

3 Basic data operations of Pandas

3.1 Index operation

In Numpy, we have already talked about using index selection sequence and slice selection. Pandas also supports similar operations. You can also directly use column names, row names, or even combinations.

  • **Use the rank index directly (column first, row second)
# Name index ranks directly embodiment (first trailing column) Data [ 'Open' ] [ '2018-02-27' ] 23.53 duplicated code
  • Use index in combination with loc or iloc

    # Use loc: only the name of the row and column index can be specified data.loc[ '2018-02-27' : '2018-02-22' , 'open' ] 2018 -02- 27 23.53 2018 -02- 26 22.80 2018 -02- 23 22.88 Name: open , dtype: float64 # Use iloc to get by index subscript # Get the result of the'open ' column of the previous 100 days of data data.iloc[ 0 : 100 , 0 : 2 ].head() Open High Low Close 2018 -02- 27 23.53 25.88 24.16 23.53 2018 -02- 26 is 22.80 23.78 23.53 22.80 2018 -02- 23 is 22.88 23.37 22.82 22.71 duplicated code
  • Use ix composite index

    Get the results of the four indicators ['open','close','high','low'] from day 1 to day 4 of the row

    # Use ix to make the following table and name combination to quote data.ix[ 0 : 4 , [ 'open' , 'close' , 'high' , 'low' ]] # It is recommended to use loc and iloc to obtain data.loc[data.index[ 0 : 4 ], [ 'open' , 'close' , 'high' , 'low' ]] data.iloc[ 0 : 4 , data.columns.get_indexer([ 'open' , 'close' , 'high' , 'low' ])] open close high low 2018 -02- 27 23.53 24.16 25.88 23.53 2018 -02- 26 22.80 23.53 23.78 22.80 2018 -02- 23 22.88 22.82 23.37 22.71 2018 -02- 22 is 22.25 22.28 22.76 22.02 duplicated code

3.2 Assignment

Re-assign the close column in the DataFrame to 1

Directly modifying the original value # Data [ 'Close' ] = . 1 # or data.close = . 1 duplicated code

3.3 Sort

There are two forms of sorting, one is to sort the index, and the other is to sort the content

  • dataframe

    • Use df.sort_values(by=, ascending=)

      Sort by single key or multiple keys, default ascending order

      ascending=False: descending order

      ascending=True: ascending order

      # Sort according to the size of the increase and decrease, use ascending to specify the sort according to the size data = data.sort_values(by = 'p_change' , ascending = False ).head() # Keys are sorted according to Guo Data = data.sort_values (by = [ 'Open' , 'High' ]) Copy the code
    • Use df.sort_index to sort the index

  • series

    • Use series.sort_values(ascending=True) to sort

      When sorting in series, there is only one column and no parameters are required

      data[ 'p_change' ].sort_values(ascending = True ).head() , 2015 -09-01 - 10.03 , 2015 -09- 14 - 10.02 2016 -01- . 11 - 10.02 , 2015 -07- 15 - 10.02 , 2015 -08- 26 is - 10.01 Name: p_change, dtype: float64 Copy code
    • Use series.sort_index() to sort

4 DataFrame operations

4.1 Arithmetic operations

  • add(other)

  • sub(other)

    # 1. Filter two columns of data close = data[ 'close' ] open1 = data[ 'open' ] # 2. The closing price minus the opening price data[ 'm_price_change' ] = close.sub(open1) data.head() Copy code

4.2 Logical operations

  • Logical operation symbols

    > <| &

  • Logical operation function

    query(expr): query string

    data.query ( "p_change> 2 & turnover> 15" ) copy the code

    isin(values)

    # Can specify a value determined so as to perform the filtering operation Data [Data [ 'turnover' ] .isin ([ 4.19 , 2.39 ])] Copy the code

4.3 Statistical operations

  • describe()

    Comprehensive analysis: A lot of statistical results can be obtained directly,

    count
    ,
    mean
    ,
    std
    ,
    min
    ,
    max
    Wait

    # Calculate the average, standard deviation, maximum, minimum data.describe() Copy code
  • Statistical function

    When performing statistics for a single function, the axis still defaults to "columns" (axis=0, default) according to these, if you want to specify the row "index" (axis=1)

    • Maximum value max(), minimum value min()
    • Standard deviation std(), variance var()
    • Median()
    • Maximum position idmax(), minimum position idmin()
  • Cumulative statistics function

    functioneffect
    cumsum
    Calculate the sum of the first 1/2/3//n numbers
    cummax
    Calculate the maximum value of the first 1/2/3/.../n
    cummin
    Calculate the minimum value of the first 1/2/3/.../n
    cumprod
    Calculate the product of the first 1/2/3//n numbers

4.4 Custom calculation

apply(func, axis=0). func: custom function, axis=0: column by default, axie=1: row operation

data[[ 'open' , 'close' ]].apply( lambda x: x. max ()-x. min (), axis = 0 ) open 22.74 close 22.85 dtype: float64 Copy code

5 Pandas drawing

5.1 pandas.DataFrame.plot

Reference: pandas.pydata.org/pandas-docs...

5.2 pandas.Series.plot

Reference: pandas.pydata.org/pandas-docs...

6 File reading and storage

6.1 CSV

  • read_csv

    pandas.read_csv(filepath_or_buffer, sep =',')

    filepath_or_buffer: file path

    usecols: specify the name of the column to be read, in the form of a list

    # Read the file, and specifies only Get 'open', 'close' Index Data = pd.read_csv ( "./data/stock_day.csv" , usecols = [ 'Open' , 'Close' ]) Copy the code
  • to_csv

    DataFrame.to_csv(path_or_buf=None, sep=',', columns=None, header=True, index=True, mode='w', encoding=None)

    • path_or_buf :string or file handle, default None

    • sep :character, default','

    • columns :sequence, optional

    • mode:'w': rewrite,'a' append

    • index: Whether to write for index

    • header :boolean or list of string, default True, whether to write the column index value

    # Select 10 rows of data to save for easy observation of the data data[: 10 ].to_csv( "./data/test.csv" , columns=[ 'open' ]) # Read the results pd.read_csv ( "./data/test.csv" ) Copy the code

6.2 HDF5

  • read_hdf (h5 file)

    pandas.read_hdf(path_or_buf, key=None, **kwargs)

    • path_or_buffer: file path
    • key: read key
    • return:Theselected object
  • to_hdf

    DataFrame.to_hdf(path_or_buf, key , * *kwargs )

  • Remarks

    Prefer to use HDF5 file storage

    HDF5 supports compression when storing. The method used is blosc. This is the fastest and is supported by pandas by default.

    Use compression to improve disk utilization and save space

    HDF5 is also cross-platform and can be easily migrated to hadoop.

6.3 JSON

  • read_json

    pandas.read_json(path_or_buf=None, orient=None, typ='frame', lines=False)

    • Change the JSON format to the default Pandas DataFrame format

    • orient: string,Indication of expected JSON string format.

      • 'split': dict like {index -> [index], columns -> [columns], data -> [values]}

        • Split summarizes the index to the index, column name to column name, and data to data. Separated the three parts
      • 'records': list like [{column -> value}, ..., {column -> value}]

        • records with
          columns: values
          Output in the form of
      • 'index': dict like {index -> {column -> value}}

        • index with
          index: {columns: values}...
          Output in the form of
      • 'columns': dict like {column -> {index -> value}}

        , The default format

        • colums with
          columns:{index:values}
          Output in the form of
      • 'values': just the values array

        • values Direct output value
    • lines: boolean, default False

      • Read the json object according to each line
    • typ: default'frame', specify the object type converted into series or dataframe

    # Orient the specified storage format json, lines specified in the line to become a sample json_read = pd.read_json ( "./data/Sarcasm_Headlines_Dataset.json" , Orient = "Records" , lines = True ) copying the code
  • to_json

    DataFrame.to_json( path_or_buf=None , orient=None , lines=False )

    • Store Pandas objects in json format
    • path_or_buf=None : file address
    • orient: the stored json form, {'split','records','index','columns','values'}
    • lines: an object is stored as a line
    json_read.to_json ( "./data/test.json" , Orient = 'Records' ) copying the code

7 Advanced processing

7.1 Treatment of missing values

  • Handling NaN

    • judgment

      pd.isnull(df)

      pd.notnull(df)

    • deal with

      • There are missing values and it is np.nan

        1. Delete the missing values: dropna(axis='rows'). Note: The original data will not be modified, and the return value needs to be accepted

        # Do not modify the original data movie.dropna() # You can define a new variable to accept or use the original variable name data = movie.dropna() Copy code

        2. Replace missing values: fillna(value, inplace=True). value: the value replaced, inplace: True: the original data will be modified, False: the original data will not be replaced, and a new object will be generated

        # Replace the two columns of samples with missing values # Replace the filled average, median movie[ 'Revenue (Millions)' ].fillna(movie[ 'Revenue (Millions)' ].mean(), inplace= True ) Copy Code
      • Isn't the missing value nan, there is a default tag? x *etc

        # Globally cancel certificate verification import ssl ssl._create_default_https_context = ssl._create_unverified_context Copy code

        1. First replace the'?' with np.nan

        # Replace some missing values marked by other values with np.nan. to_replace value before replacement, the replacement value is a value WIS = wis.replace (to_replace = '?' , value = np.nan) copying the code

        2. Perform missing value processing

        # Delete wis = wis.dropna() Copy code

7.2 Discretization of data

The purpose of continuous attribute discretization is to simplify the data structure, and data discretization technology can be used to reduce the number of given continuous attribute values .

Discrete continuous attributes is in the range continuous attributes, the range is divided into several discrete intervals, using different symbols or the final integer falling property value of each sub-interval value represents.

  • Discretization of the rise and fall of stocks

    pd.qcut(data, bins). Group the data. The data grouping is generally used with value_counts to count the number of each group

    series.value_counts(): count the number of groups

    # Read data data = pd.read_csv( "./data/stock_day.csv" ) p_change = data[ 'p_change' ] # Group by yourself qcut = pd.qcut(p_change, 10 ) # Calculate the number of data divided into each group qcut.value_counts() # Specify the grouping interval by yourself bins = [- 100 , -7 , -5 , -3 , 0 , 3 , 5 , 7 , 100 ] p_counts = pd.cut(p_change, bins) Copy code
  • one-hot encoding

    Generate a Boolean column for each category, and only one of these columns can take the value of 1 for this sample. It is also called hot coding.

    pandas.get_dummies( data , prefix=None ). prefix: group name

    As shown below:

    # One-hot encoding matrix obtained Dummies = pd.get_dummies (p_counts, prefix = "Rise" ) copy the code

7.3 Merger

  • pd.concat

    pd.concat([data1, data2], axis=1). Merge according to rows or columns, axis=0 is the column index, and axis=1 is the row index.

    # The above treated one-hot encoding, combined with the original data in accordance with row index pd.concat ([Data, Dummies], Axis = . 1 ) Copy the code
  • pd.merge

    pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None). You can specify to merge according to the common key-value pair of the two sets of data or to left and right separately

    • left
      : A DataFrame object
    • right
      : Another DataFrame object
    • on
      : Columns (names) to join on. Must be found in both the left and right DataFrame objects.
    • left_on=None, right_on=None: Specify the left and right keys
    Merge methodSQL Join NameDescription
    left
    LEFT OUTER JOIN
    Use keys from left frame only
    right
    RIGHT OUTER JOIN
    Use keys from right frame only
    outer
    FULL OUTER JOIN
    Use union of keys from both frames
    inner
    INNER JOIN
    Use intersection of keys from both frames
    left = pd.DataFrame({ 'key1' : [ 'K0' , 'K0' , 'K1' , 'K2' ], 'key2' : [ 'K0' , 'K1' , 'K0' , 'K1' ] , 'A' : [ 'A0' , 'A1' , 'A2' , 'A3' ], 'B' : [ 'B0' , 'B1' , 'B2' , 'B3']}) right = pd.DataFrame({ 'key1' : [ 'K0' , 'K1' , 'K1' , 'K2' ], 'key2' : [ 'K0' , 'K0' , 'K0' , 'K0' ] , 'C' : [ 'C0' , 'C1' , 'C2' , 'C3' ], 'D' : [ 'D0' , 'D1' , 'D2' , 'D3']}) Connections within default # Result = pd.merge (left, right, ON = [ 'key1' , 'key2' ]) Copy the code
    Left Link # Result = pd.merge (left, right, How = 'left' , ON = [ 'key1' , 'key2' ]) Copy the code
    Right Link # Result = pd.merge (left, right, How = 'right' , ON = [ 'key1' , 'key2' ]) Copy the code
    # Linked Result = pd.merge (left, right, How = 'Outer' , ON = [ 'key1' , 'key2' ]) Copy the code

7.4 Crosstab and Pivot Table

  • Cross table

    The cross table is used to calculate the number of groups of one column of data for another column of data (to find the relationship between two columns)

    • pd.crosstab(value1, value2)
    • DataFrame.pivot_table([], index=[])
  • Pivot table

    # PivotTable through the entire process becomes simpler data.pivot_table ([ 'posi_neg' ], index = 'Week' ) copying the code

7.5 Grouping and aggregation

Grouping and aggregation is usually a way to analyze data, usually used with some statistical functions to view the grouping of data

  • Grouping

    DataFrame.groupby(key, as_index=False).

    • key: grouped column data, can be multiple
    # Import Starbucks store data starbucks = pd.read_csv( "./data/starbucks/directory.csv" ) # Group by country, find the number of Starbucks retail stores in each country count = starbucks.groupby([ 'Country' ] ).count() count[ 'Brand' ].plot(kind = 'bar' , figsize=( 20 , 8 )) plt.show() Copy code

The summary of Pandas comes to an end for the time being. The focus is on the use in daily work. If you write more, you will be able to do it. It's that simple. Okay, let's start the summary study of big data related components.

Make progress together, come on! ! !