#
# Copyright (c) 2018-2022 Cadence Design Systems, Inc.
#
# Permission is hereby granted, free of charge, to any person obtaining
# a copy of this software and associated documentation files (the
# "Software"), to use this Software with Cadence processor cores only and
# not with any other processors and platforms, subject to
# the following conditions:
#
# The above copyright notice and this permission notice shall be included
# in all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
# EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
# MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
# IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
# CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
# TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
# SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
#

======================================================================
Cadence HiFi 4 Neural Network (NN) Library
======================================================================

======================================================================
Revision History
======================================================================

Version 2.9.0 API 1.4: October 14, 2022

+ Intermediate Release of HiFi4 NN Library
+ Tested using RI.9 tools, xt-clang/xt-clang++ compiler
+ Adds support for depthwise-conv2d, matXvec and fully connected kernels for
  sym8sxsym16s datatype
+ Adds support for new datatype for TFLM operator:
  REQUANTIZE (asym16s to asym16s)
  STRIDED_SLICE (int8)
+ Adds support for kernel-dimension greater than input-dimension for standard 
  convolution quantized datatypes (TENA-3528)
+ Fixes unexpected behaviour for null bias data pointer for kernel
  matmul_asym8sxasym8s_asym8s (TENA-3526)
+ Changes input_left_shift handling for returning error to clipping between -31 to +31
  for sigmoid_asym8s, sigmoid_asym8 & tanh_asym8s kernels (TENA-3615)
+ Fine tunes relevant kernels for HiFi1-enhanced core (RI.9 HiFi1e) for 
  cycles and code size
+ Removes membank conflict issues to improve performance for kernels
  conv1d_std_8x8,conv1d_std_8x16, conv2d_std_sym8sxsym16s,
  conv1d_std_f32xf32, conv1d_std_16x16, conv1d_std_asym8xasym8,
  conv2d_std_sym8sxsym16s, matXvec_batch_8x8_32,
  fully_connected_asym8xasym8_asym8
+ Optimizes performance for following kernels
  quant8 compare kernels,
  conv2d_point_sym8sxasym8s,
  broadcast_4D (cases when one input is constant)

Note
> Processor Configuration must be built with SP-VFP option in order to compile 
  floating point variants of the kernels.
> The ANN test bench and the supporting libraries need support for C++11 standard. 
  This is supported only with Xtensa C library. 

Known issues
> Compiler optimizations are disabled for ANN C++ source files
> 3 failures seen in ANN testcases because  of differences in reference 
  code implementations:
    DEPTHWISE_CONV2D_FLOAT_LARGE_2_WEIGHTS_AS_INPUTS
    LOCAL_RESPONSE_NORM_FLOAT_4
    LOCAL_RESPONSE_NORM_FLOAT_4_RELAXED

----------------------------------------------------------------------

Version 2.8.0 API 1.3: August 11, 2022

+ Intermediate Release of HiFi4 NN Library
+ Tested using RI.6 tools, xt-clang/xt-clang++ compiler.
+ For HiFi1 core, the library has been tested using RI.8 tools,
  xt-clang/xt-clang++ compiler.
+ Adds Single Step Rounding support for applicable SEANet kernels
+ Adds support for matmul, matXvec and fully connected kernels
  for asym8s datatype.
+ Adds support for 4D broadcast for following TFLM operators:
  ADD (asymmetric quantized int8 and int16)
  SUB (asymmetric quantized int8 and int16)
  MUL (asymmetric quantized int8)
  SQUARED_DIFF (asymmetric quantized int8)
+ Adds support for new datatypes for quantize TFLM operator:
  QUANTIZE (single precision float to asymmetric quantized int8, 
            asymmetric quantized int8 to asymmetric quantized int8)
+ Increases support for axis dimensions upto 1024 from 127 in kernel
  xa_nn_reduce_mean_4D_asym8s_asym8s
+ Fixes mismatch w.r.t to TFLM in kernel xa_nn_vec_hard_swish_asym8s_asym8s.
+ Fixes range of input, output shifts in element wise add/sub kernels from 
  -31 to 31 to -31 to 0.
+ Fixes input_range_radius check in xa_nn_vec_tanh_asym8s_asym8s,
  xa_nn_vec_sigmoid_asym8u_asym8u and xa_nn_vec_sigmoid_asym8s_asym8s.
+ Fixes input_zero_bias range in xa_nn_vec_leaky_relu_asym8s_asym8s to match
  with TFLM Ref.
+ Modifies CNN,LSTM and GRU testbenches to give more detalied error descriptions.

Note
> Processor Configuration must be built with SP-VFP option in order to compile 
  floating point variants of the kernels.
> The ANN test bench and the supporting libraries need support for C++11 standard. 
  This is supported only with Xtensa C library. 

Known issues
> Compiler optimizations are disabled for ANN C++ source files
> 3 failures seen in ANN testcases because  of differences in reference 
  code implementations:
    DEPTHWISE_CONV2D_FLOAT_LARGE_2_WEIGHTS_AS_INPUTS
    LOCAL_RESPONSE_NORM_FLOAT_4
    LOCAL_RESPONSE_NORM_FLOAT_4_RELAXED

----------------------------------------------------------------------

Version 2.7.0 API 1.2: May 11, 2022

+ Intermediate Release of HiFi4 NN Library
+ Tested using RI.6 tools, xt-clang/xt-clang++ compiler.
+ Single Step Rounding support is added for all applicable quantize kernels.
  (Except newly added SEANet kernels)
+ Added sigmoid/tanh TIE based optimizations for asym8s and 16-bit precisions.
+ Improved performance of conv2d_depth, conv1d_std, broadcast_8_8 and reduce_max_asym8s 
  kernels through memcpy optimization.
+ Membank confict reduction in all variants of matXvec 8x8, 8x16 and 16x16 precisions.
+ Added support for SEANet kernels. This includes ->
  conv2d_std_sym8sxsym16s
  conv2d_pointwise_sym8sxsym16s
  transpose_conv_sym8sxsym16s
  leaky_relu_asym16s
  elm_add_asym16s
  strided_slice_int16
  elm_sub_broadcast_asym16s
  pad_16

Note
> Processor Configuration must be built with SP-VFP option in order to compile 
  floating point variants of the kernels.
> The ANN test bench and the supporting libraries need support for C++11 standard. 
  This is supported only with Xtensa C library. 

Known issues
> Compiler optimizations are disabled for ANN C++ source files
> 3 failures seen in ANN testcases because  of differences in reference 
  code implementations:
    DEPTHWISE_CONV2D_FLOAT_LARGE_2_WEIGHTS_AS_INPUTS
    LOCAL_RESPONSE_NORM_FLOAT_4
    LOCAL_RESPONSE_NORM_FLOAT_4_RELAXED

----------------------------------------------------------------------

Version 2.6.0, API 1.1: February 1, 2022

+ GA Release of HiFi4 NN Library
+ Tested using RI.6 tools, xt-clang/xt-clang++ compiler.
+ Fixed following issues:
  Fixes hang in matXvec sample testbench.(TENA-3062)
  Fixes sigmoid_asym8, sigmoid_asym8s and tanh_asym8s bit-exactness issue (TENA-2918)
  Adds -Wsign-compare flag in build set up. (TENA-2953)
  Allows null bias for fully connected (sym8sxasym8s) kernel. (TENA-3091)

Note
> Processor Configuration must be built with SP-VFP option in order to compile 
  floating point variants of the kernels.
> The ANN test bench and the supporting libraries need support for C++11 standard. 
  This is supported only with Xtensa C library. 

Known issues
> Compiler optimizations are disabled for ANN C++ source files
> 3 failures seen in ANN testcases because  of differences in reference 
  code implementations:
    DEPTHWISE_CONV2D_FLOAT_LARGE_2_WEIGHTS_AS_INPUTS
    LOCAL_RESPONSE_NORM_FLOAT_4
    LOCAL_RESPONSE_NORM_FLOAT_4_RELAXED

----------------------------------------------------------------------

Version 2.5.0 API 1.1: Novemeber 9, 2021

+ Intermediate Release of HiFi4 NN Library
+ Tested using RI.6 tools, xt-clang/xt-clang++ compiler.
+ Adds support for build on HiFi1 core.
+ Adds support for asymmetric quantized 8 bit (int8) variants of following
  TFLM operators:
  DEPTH_TO_SPACE, SPACE_TO_DEPTH, PAD, PAD_V2
  BATCH_TO_SPACE_ND, SPACE_TO_BATCH_ND
  L2Normalization
  LogicalAND, LogicalOR, LogicalNOT
  ReduceMax, ReduceMean
  Equal, Greater, GreaterEqual
  Less, LessEqual, NotEqual
  Logistic, Hardswish,Tanh
  Maximum, Minimum
  Prelu, Relu, leaky_relu
  Sub, Add, Mul
+ Adds support for single precision float (float32) variants of following
  TFLM operators:
  SIN, COS, LOG
  SQRT, RSQRT, SQUARE
  FILL, CEIL, ROUND
  ABS, NEG
+ Adds support for Dilated Standard 2D Convolution.
+ Adds 16-bit input/output variants for sigmoid and tanh kernels based on 
  Tensorflow.
+ Adds matXvec batch kernel with accumulation (8-bit input, sym8s kernel and
  asym16s output).
+ Adds support for TFLM Dequantize (asymmetric int8 to float32) operators.
+ Adds support for additional data types (int8 to int32, int16 to int32) for 
  TFLM Quantize operator.
+ Adds broadcast kernel for int8 data type.
+ Adds broadcast variants of elementwise minimum and maximum kernels for int8
  data type
+ Improves performance for following kernels:
  xa_nn_conv1d_std_8x8
  xa_nn_conv1d_std_8x16
  xa_nn_conv1d_std_f32
  xa_nn_conv2d_depthwise_8x8
  xa_nn_conv2d_depthwise_8x16
  xa_nn_conv2d_depthwise_asym8xasym8
  xa_nn_conv2d_depthwise_per_chan_sym8sxasym8s
  xa_nn_conv2d_depthwise_per_chan_nhwc_sym8sxasym8s
  xa_nn_conv2d_pointwise_nhwc_asym8xasym8
  xa_nn_conv2d_pointwise_per_chan_nhwc_sym8sxasym8s
  xa_nn_matXvec_asym8xasym8_asym8
  fully_connected_asym8xasym8_asym8
  xa_nn_maxpool_8
+ Improves code-size for following kernels:
  xa_nn_matXvec_asym8uxasym8u_asym8u
  xa_nn_matmul_asym8uxasym8u_asym8u
  xa_nn_conv2d_std_per_chan_sym8sxasym8s
  xa_nn_conv2d_pointwise_asym8uxasym8u
  xa_nn_conv2d_depthwise_per_chan_sym8sxasym8s
+ Fixes issues TENA-2938, TENA-2976 and TENA-2994

Note
> For configurations without VFPU, the floating point  variants of the kernels
  are not supported and should not be used.
> The ANN test bench and the supporting libraries need support for C++11 standard. 
  This is supported only with Xtensa C library. 

Known issues
> Compiler optimizations are disabled for C++ source files 
> 3 failures seen in ANN testcases because  of differences in reference 
  code implementations:
    DEPTHWISE_CONV2D_FLOAT_LARGE_2_WEIGHTS_AS_INPUTS
    LOCAL_RESPONSE_NORM_FLOAT_4
    LOCAL_RESPONSE_NORM_FLOAT_4_RELAXED

----------------------------------------------------------------------
Version 2.4.0 API 1.0: Feb 10, 2021

+ Intermediate Release of HiFi 4 NN Library
+ Tested using RI.5 tools, xt-xcc/xt-xc++ and xt-clang/xt-clang++ compiler
+ Added support for asymmetric quantized 8 bit (int8) variants of 
  following TFLM operators
    Quantize
    SVDF
+ Updated TFLM example testbench to newer version of TFLM framework
+ Improved optimization for reminder loop in xa_nn_vec_softmax_asym8s_16 
+ Reduced memory banking conflicts in FC kernel xa_nn_fully_connected_sym8sxasym8s_asym8s

Limitations
> The ANN test bench, TFLite Micro example application and the supporting libraries
  need support for C++11 standard. This is supported only with Xtensa C library. 
> For configurations without VFPU, the floating point  variants of the kernels 
  are not supported and should not be used. 

Known issues
> Compiler optimizations are disabled for C++ source files 
> 3 failures seen in ANN testcases because  of differences in reference 
  code implementations:
    DEPTHWISE_CONV2D_FLOAT_LARGE_2_WEIGHTS_AS_INPUTS
    LOCAL_RESPONSE_NORM_FLOAT_4
    LOCAL_RESPONSE_NORM_FLOAT_4_RELAXED
 
----------------------------------------------------------------------
Version 2.3.0 API 1.0: Dec 22, 2020

+ Intermediate Release of HiFi 4 NN Library
+ Tested using RI.5 tools, xt-xcc/xt-xc++ and xt-clang/xt-clang++ compiler
+ Added support for asymmetric quantized 8 bit (int8) variants of 
  following TFLM operators
    Conv2D (Standard convolution)
    Depthwise convolution
    Fully connected
    Softmax
    Average Pool (using existing 8x8 variant)
    Max Pool (using existing 8x8 variant)
+ Enabled cross compilation for HiFi 3. 
+ Removed kernel padding requirement for depthwise convolution kernels.
+ [TENA-2712] Convolution 1D: Fix incorrect zero bias values in the call 
  to function xa_nn_matXvec_asym8xasym8_asym8_circ_nb.

Limitations
> The ANN test bench, TFLite Micro example application and the supporting libraries
  need support for C++11 standard. This is supported only with Xtensa C library. 
> For configurations without VFPU, the floating point  variants of the kernels 
  are not supported and should not be used. 

Known issues
> Compiler optimizations are disabled for C++ source files 
> 3 failures seen in ANN testcases because  of differences in reference 
  code implementations:
    DEPTHWISE_CONV2D_FLOAT_LARGE_2_WEIGHTS_AS_INPUTS
    LOCAL_RESPONSE_NORM_FLOAT_4
    LOCAL_RESPONSE_NORM_FLOAT_4_RELAXED
 
----------------------------------------------------------------------

Version 2.2.0 API 1.0: April 22, 2020

+ GA Release
+ Tested using RI.2 tools and xt-xcc/xt-xc++, xt-clan/xt-clang++ compilers
+ Improved support for NHWC (depth-first) data format for convolution and pooling kernels
+ Improved implementation of pointwise convolution kernels
+ Improved Relu support
+ Added support for SVDF operator (floating point variant) in TFLiteMicro
+ Enabled compilation of the NN library for cores with newlib as well as xclib
+ Enabled cross compilation for HiFi3z, HiFi5 and Fusion-F1 AVS with 16-bit Quad MAC
+ Added mechanism for controlling code size of the library
+ Updated TFLiteMicro speech commands example application
+ Added two supporting libraries for the TFLiteMicro speech commands example
+ Added a supporting library containing code of the ANN API support 
+ Programmer's Guide updated

Limitations
> The ANN test bench, TFLite Micro example application and the supporting libraries
  need support for C++11 standard. This is supported only with Xtensa C library. 
> For configurations without VFPU, the floating point  variants of the kernels 
  are not supported and should not be used. 

Known issues
> Compiler optimizations are disabled for C++ source files 
> 3 failures seen in ANN testcases because  of differences in reference 
  code implementations:
    DEPTHWISE_CONV2D_FLOAT_LARGE_2_WEIGHTS_AS_INPUTS
    LOCAL_RESPONSE_NORM_FLOAT_4
    LOCAL_RESPONSE_NORM_FLOAT_4_RELAXED
 
----------------------------------------------------------------------

Version 2.1.0 API 1.0: September 6, 2019

+ GA Release
+ Tested using RI.1 tools and xt-xcc/xt-xc++ compiler
+ Needs HiFi4 core with Xtensa C Library 
+ Programmer's Guide updated

Known issues
> Building with xt-clang compiler is not supported currently
> 3 failures seen in ANN testcases because  of differences in reference 
  code implementations:
    DEPTHWISE_CONV2D_FLOAT_LARGE_2_WEIGHTS_AS_INPUTS
    LOCAL_RESPONSE_NORM_FLOAT_4
    LOCAL_RESPONSE_NORM_FLOAT_4_RELAXED
 
----------------------------------------------------------------------

Version 2.0.0 API 1.0: August 8, 2019

+ Beta Release
+ Added support for Android NN API v1.1 (Android P)
+ Asym8 datatype support added to low level kernels for ANN compatibility
+ Added 38 ANN operations from Android NN API v1.1 to the library
+ Added TF Micro Lite example testbench 
+ Tested using RI.1 tools
+ Needs HiFi4 core with Xtensa C Library 
+ [TENX-47056] Removed the ISS mode switching call from the testbenches 

Known issues
> Programmer's Guide is not finalized
> 3 failures seen in ANN tests:
    DEPTHWISE_CONV2D_FLOAT_LARGE_2_WEIGHTS_AS_INPUTS
    LOCAL_RESPONSE_NORM_FLOAT_4
    LOCAL_RESPONSE_NORM_FLOAT_4_RELAXED
 
----------------------------------------------------------------------

Version 1.0.1 API 1.0: February 11, 2019

+ GA Release
+ Also available in xws file format
+ Programmer's Guide updated
+ Adapted to 3 digit release version number mechanism
 
----------------------------------------------------------------------

Version 1.0 API 1.0: January 7, 2019

+ GA Release
+ Added convolution, pooling low level kernels
+ Added LSTM Layer (forward path), CNN Layer
 
----------------------------------------------------------------------

Version 0.1_Alpha API 0.5: September 21, 2018

+ Alpha Release - Hifi4 Source Package only
+ NN Library implements:
  Matrix-X-vector multiplication kernels
  Activation function kernels
  GRU layer implementation

+ Programmers Guide - partially complete, DRAFT version
+ Refer PG for GRU details
+ Refer xa_nnlib_kernels_api.h and PG: Appendix for kernels details
+ Refer PG - Appendix for kernels performance
  (Measured with RG.5 tools, AE_HiFi4_LE5 with VFPU core)

+ Use 'build/runperf.sh' script for performance testing.
  The scripts requires test vectors from 'nnlib_testvectors_v0.1_Alpha.tgz'
  This test vectors package is shared separately.
----------------------------------------------------------------------
